🆕  Is “dirty data” silently harming your AI initiatives? Read our new report!

Dirty Data, Broken AI - The hidden threat derailing your competitive edge‍

In this research study, we uncover website data collection practices and their effects on AI initiatives.
Tags
Read time
10 min read
Last updated
January 19, 2025
Run your free privacy health check

Are hidden website trackers putting your brand at risk? Find out now

Check your compliance score
Need an easy-to-use consent management solution?

Ketch makes consent banner set-up a breeze with drag-and-drop tools that match your brand perfectly. Let us show you.

Book a 30 min Demo
Ketch is simple,
automated and cost effective
Book a 30 min Demo
‍

Executive Summary

AI is revolutionizing industries, but it cannot function without data–vast amounts of it.

Nowhere is this more evident than in AI-driven marketing and advertising use cases. Data is the fuel that powers AI’s ability to personalize experiences, optimize decisions, and drive growth. AI-powered marketing isn’t simply a competitive advantage; it’s a requirement for commercial success. 

This study uncovered a major threat to AI initiatives: due to lack of proper technical controls, companies are populating AI models with dirty data. Across 100+ major, U.S. websites, encompassing approximately 1.7 trillion monthly data events per month:

 
  • ✔️ 55% of the data companies collect on digital channels is for marketing, advertising, and personalization use cases.
  • ✔️  40% of the trackers collecting this data ignore consumer preferences for data sharing. 
  • ✔️ 215 billion unpermissioned, “dirty data” events are generated each month.

The consequence of feeding this dirty data into AI models goes beyond risk of regulatory enforcement. Using dirty data for business-critical AI initiatives introduces the risk of being forced to retrain AI models–halting productivity and opening the door to competitors.

In today’s AI-driven marketplace, ensuring a baseline of permissioned data isn’t just a legal necessity; it’s essential to competitive advantage and ultimately, survival. It’s time for leaders and board members to call attention to this major, overlooked issue and demand appropriate oversight and investment.

Read Now: Save the Study as a PDF

AI isn’t a trend–it’s the new foundation for competitive edge and market dominance

The new basis of competition is rapidly emerging, and it centers on AI. 

  • Worldwide spending on artificial intelligence (AI) will more than double by 2028 when it is expected to reach $632 billion.
  • The generative AI market is poised to explode, growing to $1.3 trillion over the next 10 years from a market size of just $40 billion in 2022.

This wave of investment is rewriting the rules of competition. 67% of surveyed organizations are ramping up investments in generative AI after seeing early value in pilot projects. Those who embrace AI will surge ahead, while those who delay will be left behind. 

‍

AI is the new foundation for competitive edge and market dominance

‍

Data is the lifeblood driving AI’s success–or failure

For AI to succeed, it must be built on a foundation of clean, permissioned data. Most leaders recognize this; 55% of companies are avoiding certain generative AI applications due to data quality and security concerns. 

  • Data quality issues are often defined by challenges with accuracy, coverage for business-specific problems, and legal considerations–such as the rights to data gathered from public sources. 
  • Data security concerns often refer to the technical safeguards needed to protect data during transfer, storage, and processing–for example, encryption protocols to prevent breaches. 

These examples paint an incomplete picture. The most critical and often overlooked risk to data quality–and perhaps to AI itself–is none of these things.

The biggest threat to your AI growth strategy is dirty data.

The greatest risk to leveraging AI for growth isn’t what you might think. It’s subtle, lurking behind every browser interaction across the web: dirty data.

In law, there’s a concept called “fruit of the poisonous tree”–if evidence is gathered illegally, everything it touches is tainted, unusable. AI faces a similar problem. AI systems are hungry for data, but when fed with data collected improperly or without consent, it poisons the roots. That poisonous, dirty data seeps into every insight and prediction, corrupting the entire AI operation from the ground up.

Dirty data is data of uncertain origin that has not been explicitly permissioned by its original owners or relevant entities, falling short of regulatory standards. It is gathered without clear consent or proper authorization, introducing compliance risks and undermining the reliability and usability of AI-driven insights and recommendations. 

‍

‍

Are your AI initiatives powered by dirty data?

At Ketch, we conducted a comprehensive study to uncover just how much data is being unlawfully collected online, and the impact this could have on your business. What we found should raise alarms for every company leveraging AI to drive growth.

Opt-out or not? How we tested data collection integrity

We conducted a comprehensive study of 134 prominent U.S. websites including retail, ecommerce, media, and financial services brands. Together, these sites account for over 1.7 trillion data collection events per month.

Our research focused on answering a simple question: when users opt out of data collection, is their choice respected? To get there, we followed a three-phased approach: 

1. Website Visitor Behavior Simulation

Our team deployed advanced browser emulators to mimic typical web visitor behaviors: opening pages, clicking buttons, and interacting with content. 

When implemented, companies (and their respective websites, tags, and trackers) treat these emulators as genuine users—targets for tracking, data collection, and monetization. 

2. Comparative “Before/After Opt-Out” Data Collection

Through automated browsing sessions, we simulated visits to company websites in two “states”—opted-in and opted-out of data collection—to compare differences in brand tracking response. 

The emulators generated thousands of events, reflecting how brand CMPs (consent management platforms) and website trackers responded to user consent status. 

3. Tracker Analysis

We assembled and fed all simulated visitor events into Ketch’s proprietary tracker database. 

This enabled us to pinpoint which trackers were active, how they operated, and whether they honored opt-out signals, across our sample size of 134 prominent websites. 

The alarming scope of dirty data, exposed 

Our analysis was nothing short of alarming. Across 134 websites and thousands of simulated visitor events, three major findings emerged. 

1. 48% of web tracking facilitates advertising, marketing, and personalization

A whopping 48% of all trackers firing on websites are specifically tied to advertising, marketing, and personalization use cases. Also known as PDTs (privacy dependent trackers), these tracking technologies are critical to the digital advertising ecosystem, enabling companies to capture and share personal data for targeted ads and tailored experiences.

‍

48% of web tracking facilitates advertising, marketing, and personalization

‍

Under current data privacy regulations, firing these trackers equates to a “sell or share of data,” and should be disabled when users opt out. 

Together, this 48% of all tracking on the web forms the backbone of personalized digital marketing and data-sharing strategies across industries.

As for the remaining 52%, many are data trackers and cookies that serve necessary operational purposes, such as supporting account verification. However, for 24% of the 52% remaining, PDT status could not be verified—suggesting that 48% actually undershoots the actual level. 

2. 88% of companies ignore user preferences for data collection

Approximately 40% of PDTs remain active after a user opts out. This was true for both large websites (38% remained active) and midsize brands (40% remained active). 

‍

Only 12% of companies fully complied with user preferences, deactivating every PDT after receiving an opt-out

‍

Only 12% of companies fully complied with user preferences, deactivating every PDT after receiving an opt-out. 88% of companies ignore user preferences for data in advertising, marketing, and personalization use cases. 

3. 215 billion dirty data events occur every month

Overall, our findings underscored the massive scale of digital data collection.

  • 51: Average # of data trackers per large website
  • 63: Average # of data trackers per midsize website‍
  • 250: Distinct data collection technologies activated on a single visit

Across the top sites, there were 665 billion monthly, advertising-related data collection events. However, 30% of PDTs continued firing after an opt-out, creating an estimated 215 billion dirty data events.

Dirty data represents 12.7% of all 1.7 trillion data collection events in the study.

‍

215 billion dirty data events occur every month

‍

The undeniable link between dirty data and business risk: AI-powered marketing 

How exactly does “dirty data” relate to AI? To understand this, we need to unpack where dirty data goes after it’s been collected. When businesses unlawfully and unknowingly intake data intended for AI for marketing use cases, where does it go? What systems or initiatives depend on it? 

For competitive brands and businesses, one destination stands out among the rest: AI for marketing.

Marketing use cases are the frontrunners for AI investment, with advertising and personalization leading the charge

Nowhere is the AI work transformation happening faster than in marketing and advertising, emerging as top AI investment areas in 2024:

  • 65% of organizations are already using AI in at least one business function, with marketing and sales being the most popular functions to adopt AI.
  • Over the next 12–18 months, 78% of marketers are planning to expand their use of AI to enhance their marketing capabilities and processes.

These investments are transforming how businesses reach and engage customers, allowing them to personalize campaigns at a scale that was previously unimaginable. AI is no longer a niche experiment—it’s becoming a central gear of the modern marketing engine.

Personalized marketing is the top AI use case for business leaders  

‍

Personalized marketing is a prioritized AI use case across industries
Graph source: McKinsey

‍

Personalized marketing is a prioritized AI use case across industries. A well-executed AI-driven hyper-personalization strategy can deliver 8x the return on investment (ROI) and boost sales by 10% or more.

Three primary use cases are proving indispensable in marketing and advertising today.

  1. Dynamic audience targeting and segmentation. AI rapidly analyzes customer data, identifying new audience segments in real time, allowing businesses to send targeted messages that improve campaign effectiveness and optimize marketing spend.
  2. Hyper-personalized outreach and content generation. AI automates personalized content—emails, ads, or posts—at scale, adapting messaging based on customer behavior to drive higher engagement and conversions.
  3. Optimizing marketing strategies with predictive analytics. AI-powered predictive analytics continuously refine campaigns, predicting which strategies will work best and adjusting in real time to maximize ROI and marketing efficiency.
AI isn’t just adding value to marketing—it’s redefining it. Traditional marketing approaches relied on broad targeting strategies. AI enables businesses to dive deep into individual customer behaviors, preferences, and past interactions, tailoring messages and offers with pinpoint precision. 

Customer data is what unlocks successful AI activation 

For these AI initiatives to work—especially in dynamic targeting, hyper-personalization, and predictive marketing—they rely on one indispensable ingredient: customer data. 

We’re talking about real-time digital exhaust from every corner of your brand: website, mobile apps, digital transactions, and other online platforms. Every click, every session, every customer behavior leaves a trail of data that’s valuable for AI to interpret and act upon.

Without this digital data exhaust, AI models are blind, unable to deliver the personalized experiences and optimized strategies that today’s marketing demands.

Why AI needs your customer data

We sat down with Vivek Vaidya to understand exactly what it means for AI-driven marketing and advertising to rely on digital data. 

As he explained, “AI’s effectiveness is entirely dependent on the data it’s trained on. Every digital touchpoint—whether it’s a website visit, a mobile app interaction, or a digital transaction—creates data that feeds into AI models. Without this constant stream of digital data, AI simply can’t perform," Vivek emphasizes. 

So, what kind of data does AI need to drive successful marketing? 

  1. Personal characteristics. What you do indicates who you are. Businesses collect your behavioral and demographic data—such as page views or geographic location—to infer customer preferences and characteristics. These insights can help AI personalize the experience for each user, delivering relevant content based on their specific behaviors.
  1. Conversion signals. These are activity signals that indicate when a campaign or tactic works, such as a purchase, a signup, or a download. This conversion data helps AI understand what success looks like, feeding back into the system to improve future performance.
  1. Marketing tactics. AI requires a constant feedback loop on how your marketing strategies are performing. Whether you’re testing new ad creatives or deploying personalized offers, AI uses this data to refine tactics and optimize campaign effectiveness over time.

Once collected, data must be put to work in AI models

Once the data is collected across digital properties, it feeds AI models. AI-driven personalized marketing depends on continually refining and customizing models with fresh, real-time data.

Your website, mobile apps, and transactions create an ongoing digital footprint, which becomes the fuel that drives AI’s decision-making capabilities. The more data AI has, the better it can:

  • Identify and predict customer preferences through real-time analysis of user behavior
  • Test and optimize marketing tactics by running thousands of simulations based on data inputs
  • Adapt to market shifts and customer trends by learning continuously from new data points

Most companies don’t build this data infrastructure–they license it

An outside observer might assume that brands build these critical data collection tools in-house. However, the 2024 marketing tech stack is not built: it’s assembled. Most businesses rely on an ecosystem of martech and adtech partners (an average of 90, to be exact) to capture, process, and store this business-critical data.

This data is largely collected through JavaScript tags, cookies, and trackers embedded across a brand’s digital properties, creating that “digital exhaust” required to feed AI models. 

‍

complex network of trackers at work when you browse any website
You can see this complex network of trackers at work when you browse any website—mapping the journey of data collection that powers AI-driven marketing campaigns.

‍

Relying on such a vast network of third-party vendors for critical data collection introduces significant complexity—and raises important concerns. 

  • Do businesses have visibility into the full scope of how data is being gathered and used? 
  • Do stakeholders understand what happens to the information flowing from their digital properties?
  • How confident can businesses be that these tools are operating as intended?

There is a clear link between dirty data flowing from your digital ecosystem into your AI initiatives 

As companies turn to AI for growth, they rely heavily on data collected from their digital channels—websites, apps, and online platforms. But if this data is collected without proper consent, it creates “dirty data” that flows directly into AI systems. 

‍

Staying competitive today means leveraging AI →  Personalized marketing is the top AI use case for revenue growth AI models need customer data collected from digital properties to succeed This study assessed the behaviors of trackers (PDTs) used to collect this data On average, 40% of these trackers ignore consumer opt-outs
AI is revolutionizing industries. Companies that adopt AI are pulling ahead will gain a competitive edge Marketing use cases are the frontrunners for AI investment, with advertising and personalization leading the charge AI cannot function without vast amounts of data. Data from  the website, mobile apps, digital transactions, and other online platforms must feed models Despite regulations, nearly half of trackers continue collecting data after users opt out, creating significant risks for privacy compliance and data integrity.‍ Despite regulations, nearly half of trackers continue collecting data after users opt out, creating significant risks for privacy compliance and data integrity.‍

‍

Why you should care: the penalties of powering AI with dirty data

Privacy: the reason you get caught, but not the reason you pay

Look no further than regulators' favorite term for data-powered advertising, “surveillance capitalism,” to understand the increase in consumer privacy enforcement actions. As digital data collection becomes more sophisticated, so do regulators in their comprehension of tracking technologies.

Dirty data collection is the central thrust behind privacy enforcement action:

  • New York Attorney General Letitia James released two comprehensive guides on website privacy controls: the Business Guide to Website Privacy Controls and the Consumer Guide to Web Tracking. Both demonstrate the growing technical sophistication seen in regulators’ scrutiny of privacy practices and the degree to which privacy is now solidly a consumer protection issue. 
  • Regulators are also closely watching the use of tracking pixels on healthcare websites, and the leakage of sensitive healthcare information to third-party vendors. The FTC accused BetterHelp and other healthcare companies of deceptive conduct over their use of such technologies.
  • Brands across the U.S. are inundated with wiretapping claims from plaintiffs' lawyers, equating digital pixel tracking with eavesdropping on consumer behavior. Without clear precedent or guidance on refuting claims, companies are settling; with the risk of more claims right around the corner. 

As regulatory scrutiny intensifies, consumer awareness is also on the rise: 

  • According to The Person Behind the Data, a research report by Ketch and IPG MAGNA, 74% of consumers "highly value" their data privacy and are inclined to reward brands that responsibly manage their information.
  • In a recent study by Consumer Reports, 78% of Americans would support a law regulating how companies can collect, store, share, and use our personal data. 

This growing consumer vigilance, combined with increasingly sophisticated regulatory actions, underscores a clear message: brands that neglect data privacy risk not only enforcement penalties but also the erosion of consumer trust–the cornerstone of long-term reputation and success.

The biggest risk of all: the threat of retraining AI models

Privacy violations might trigger fines, but the real cost of dirty data–and threat to business productivity–lies in having to retrain AI models that have been corrupted by non-compliant data. 

AI models depend on data. When that data is revealed to be dirty—collected without consent or in violation of privacy laws—companies must stop everything. The data is tainted, and using it puts the entire business at risk. Correcting this issue involves retraining the model from the ground up, which is a complex and costly endeavor.

‍

‍

Retraining AI models is expensive

Why is retraining AI so difficult? To understand the challenge, Vivek Vaidya breaks down the high-level steps involved.

  • Rebuilding the data pipeline: AI models rely on vast, carefully prepared datasets to learn patterns and make predictions. Retraining means going back to the beginning: collecting, cleaning, and preparing an entirely new set of data to replace the tainted inputs.
  • High computational costs: Training a model involves processing terabytes of data through powerful servers, often using advanced GPUs or TPUs. These systems run for days or weeks at a time, consuming significant energy and computational resources.
  • Expertise and labor: Skilled data scientists and machine learning specialists must oversee retraining. They adjust algorithms, validate results, and ensure compliance, adding substantial labor costs to the process.

The costs of retraining depend on the model’s size and complexity. Vivek estimates that retraining a “small” AI model with 10 billion parameters and 50–100 terabytes of data can cost between $1.5 million and $4 million. This includes data processing, computational infrastructure, and labor costs.

A more severe consequence: model deletion
In 2022, the FTC penalized Weight Watchers (WW International) for illegally collecting sensitive data from children. Beyond deleting the data, WW was ordered to destroy any algorithms or AI systems trained using it. This “disgorgement” ruling wiped out the models entirely, forcing the company to start over from scratch.

Operational delays cripple competitive advantage

Consider this: businesses leveraging AI for personalized marketing see 10-20% faster growth than those that don’t. Every day your AI is out of commission is another day you’re losing ground in the race for relevance and profitability.

If your initiatives are down for weeks or months while you retrain models or source clean data, the market won’t wait for you to catch up.

The message is stark but simple: businesses that don’t prioritize clean, permissioned data are undermining the very AI initiatives they hope will secure their future.

Dirty data threatens everything. It poisons AI models, invites regulatory scrutiny, and erodes customer trust. And while privacy fines may be recoverable, the operational and strategic costs of dirty data are not.  

Stop dirty data before it stops you: three actions to take today 

This study should serve as a wake-up call for any CEO managing a business that collects personal data. If your business 1) collects personal data and 2) intends to remain competitive, here are three things you need to do right now: 

1. Pause any AI initiatives where you can’t certify that data is permissioned and clean

Stop any AI projects that use data you can’t verify as permissioned and compliant. Convene a team—ideally led by your CTO—and ask them to verify that dirty data isn’t being used to train AI models or fuel AI applications. 

2. Audit your consent management technology and processes, including verification that signals are passed to systems that matter

Find out if your company is one of the 88% that aren’t respecting consumer consent signals. Do you have a CMP (consent management platform)? Does that CMP ensure that you don’t collect data when you are supposed to? Can you prove it?

To realize true privacy, not just the Hollywood facade of privacy, businesses must ensure that privacy commitments are respected and enforced across the data ecosystem—including mission-critical business functions such as marketing, customer data management, and commerce. 

3. Put your CTO in charge of privacy to ensure tech-forward leadership

Privacy may be a legal requirement, but at its core, it’s a technical problem. Achieving compliance and meeting consumer expectations requires a deep understanding of systems, data flows, and technical controls.

Your CTO should own the privacy program, with engineers leading implementation and execution. Legal and privacy teams play a critical role as stakeholders, but the complexities of modern compliance demand technical expertise at the helm. Stop treating data privacy like a check-the-box exercise; it’s a strategic imperative that requires cross-functional collaboration across engineering, marketing, product, and legal to keep regulators at bay and your data clean.

Permissioned data, protected future–the road ahead 

Dirty data isn’t just a compliance issue—it’s a direct threat to the success of AI initiatives. As this study shows, a staggering portion of data fueling AI is improperly permissioned, compromising both legal compliance and the reliability of AI-driven insights and activation.

In an AI-powered market, companies that prioritize clean, compliant data will outpace competitors, while those ignoring data quality risks will face regulatory scrutiny and lose ground.

The message is clear: if AI is the future, then permissioned data is its foundation. By investing in robust data governance and privacy-first practices, businesses can protect their AI investments, safeguard customer trust, and position themselves for sustainable growth.

‍

Tags
Read time
10 min read
Published
January 16, 2025

Continue reading

Product, Privacy tech, Top articles

Advertising on Google? You must use a Google certified CMP

Sam Alexander
3 min read
Marketing, Privacy tech

3 major privacy challenges for retail & ecommerce brands

Colleen Barry
7 min read
Marketing, Privacy tech, Strategy

Navigating a cookieless future with Google Privacy Sandbox

Colleen Barry
7 min read
Get started
with Ketch
Begin your journey to simplified privacy operations and granular data control across the enterprise.
Book a Demo
Ketch was named top consent management platform on G2