The new basis of competition is rapidly emerging, and it centers on AI.Â
This wave of investment is rewriting the rules of competition. 67% of surveyed organizations are ramping up investments in generative AI after seeing early value in pilot projects. Those who embrace AI will surge ahead, while those who delay will be left behind.Â
‍
‍
For AI to succeed, it must be built on a foundation of clean, permissioned data. Most leaders recognize this; 55% of companies are avoiding certain generative AI applications due to data quality and security concerns.Â
These examples paint an incomplete picture. The most critical and often overlooked risk to data quality–and perhaps to AI itself–is none of these things.
The greatest risk to leveraging AI for growth isn’t what you might think. It’s subtle, lurking behind every browser interaction across the web: dirty data.
In law, there’s a concept called “fruit of the poisonous tree”–if evidence is gathered illegally, everything it touches is tainted, unusable. AI faces a similar problem. AI systems are hungry for data, but when fed with data collected improperly or without consent, it poisons the roots. That poisonous, dirty data seeps into every insight and prediction, corrupting the entire AI operation from the ground up.
Dirty data is data of uncertain origin that has not been explicitly permissioned by its original owners or relevant entities, falling short of regulatory standards. It is gathered without clear consent or proper authorization, introducing compliance risks and undermining the reliability and usability of AI-driven insights and recommendations.Â
‍
‍
At Ketch, we conducted a comprehensive study to uncover just how much data is being unlawfully collected online, and the impact this could have on your business. What we found should raise alarms for every company leveraging AI to drive growth.
We conducted a comprehensive study of 134 prominent U.S. websites including retail, ecommerce, media, and financial services brands. Together, these sites account for over 1.7 trillion data collection events per month.
Our research focused on answering a simple question: when users opt out of data collection, is their choice respected? To get there, we followed a three-phased approach:Â
Our team deployed advanced browser emulators to mimic typical web visitor behaviors: opening pages, clicking buttons, and interacting with content.Â
When implemented, companies (and their respective websites, tags, and trackers) treat these emulators as genuine users—targets for tracking, data collection, and monetization.Â
Through automated browsing sessions, we simulated visits to company websites in two “states”—opted-in and opted-out of data collection—to compare differences in brand tracking response.Â
The emulators generated thousands of events, reflecting how brand CMPs (consent management platforms) and website trackers responded to user consent status.Â
We assembled and fed all simulated visitor events into Ketch’s proprietary tracker database.Â
This enabled us to pinpoint which trackers were active, how they operated, and whether they honored opt-out signals, across our sample size of 134 prominent websites.Â
Our analysis was nothing short of alarming. Across 134 websites and thousands of simulated visitor events, three major findings emerged.Â
A whopping 48% of all trackers firing on websites are specifically tied to advertising, marketing, and personalization use cases. Also known as PDTs (privacy dependent trackers), these tracking technologies are critical to the digital advertising ecosystem, enabling companies to capture and share personal data for targeted ads and tailored experiences.
‍
‍
Under current data privacy regulations, firing these trackers equates to a “sell or share of data,” and should be disabled when users opt out.Â
Together, this 48% of all tracking on the web forms the backbone of personalized digital marketing and data-sharing strategies across industries.
As for the remaining 52%, many are data trackers and cookies that serve necessary operational purposes, such as supporting account verification. However, for 24% of the 52% remaining, PDT status could not be verified—suggesting that 48% actually undershoots the actual level.Â
Approximately 40% of PDTs remain active after a user opts out. This was true for both large websites (38% remained active) and midsize brands (40% remained active).Â
‍
‍
Only 12% of companies fully complied with user preferences, deactivating every PDT after receiving an opt-out. 88% of companies ignore user preferences for data in advertising, marketing, and personalization use cases.Â
Overall, our findings underscored the massive scale of digital data collection.
Across the top sites, there were 665 billion monthly, advertising-related data collection events. However, 30% of PDTs continued firing after an opt-out, creating an estimated 215 billion dirty data events.
Dirty data represents 12.7% of all 1.7 trillion data collection events in the study.
‍
‍
How exactly does “dirty data” relate to AI? To understand this, we need to unpack where dirty data goes after it’s been collected. When businesses unlawfully and unknowingly intake data intended for AI for marketing use cases, where does it go? What systems or initiatives depend on it?Â
For competitive brands and businesses, one destination stands out among the rest: AI for marketing.
Nowhere is the AI work transformation happening faster than in marketing and advertising, emerging as top AI investment areas in 2024:
These investments are transforming how businesses reach and engage customers, allowing them to personalize campaigns at a scale that was previously unimaginable. AI is no longer a niche experiment—it’s becoming a central gear of the modern marketing engine.
‍
‍
Personalized marketing is a prioritized AI use case across industries. A well-executed AI-driven hyper-personalization strategy can deliver 8x the return on investment (ROI) and boost sales by 10% or more.
Three primary use cases are proving indispensable in marketing and advertising today.
AI isn’t just adding value to marketing—it’s redefining it. Traditional marketing approaches relied on broad targeting strategies. AI enables businesses to dive deep into individual customer behaviors, preferences, and past interactions, tailoring messages and offers with pinpoint precision.Â
For these AI initiatives to work—especially in dynamic targeting, hyper-personalization, and predictive marketing—they rely on one indispensable ingredient: customer data.Â
We’re talking about real-time digital exhaust from every corner of your brand: website, mobile apps, digital transactions, and other online platforms. Every click, every session, every customer behavior leaves a trail of data that’s valuable for AI to interpret and act upon.
Without this digital data exhaust, AI models are blind, unable to deliver the personalized experiences and optimized strategies that today’s marketing demands.
We sat down with Vivek Vaidya to understand exactly what it means for AI-driven marketing and advertising to rely on digital data.Â
As he explained, “AI’s effectiveness is entirely dependent on the data it’s trained on. Every digital touchpoint—whether it’s a website visit, a mobile app interaction, or a digital transaction—creates data that feeds into AI models. Without this constant stream of digital data, AI simply can’t perform," Vivek emphasizes.Â
So, what kind of data does AI need to drive successful marketing?Â
Once the data is collected across digital properties, it feeds AI models. AI-driven personalized marketing depends on continually refining and customizing models with fresh, real-time data.
Your website, mobile apps, and transactions create an ongoing digital footprint, which becomes the fuel that drives AI’s decision-making capabilities. The more data AI has, the better it can:
An outside observer might assume that brands build these critical data collection tools in-house. However, the 2024 marketing tech stack is not built: it’s assembled. Most businesses rely on an ecosystem of martech and adtech partners (an average of 90, to be exact) to capture, process, and store this business-critical data.
This data is largely collected through JavaScript tags, cookies, and trackers embedded across a brand’s digital properties, creating that “digital exhaust” required to feed AI models.Â
‍
‍
Relying on such a vast network of third-party vendors for critical data collection introduces significant complexity—and raises important concerns.Â
As companies turn to AI for growth, they rely heavily on data collected from their digital channels—websites, apps, and online platforms. But if this data is collected without proper consent, it creates “dirty data” that flows directly into AI systems.Â
‍
‍
Look no further than regulators' favorite term for data-powered advertising, “surveillance capitalism,” to understand the increase in consumer privacy enforcement actions. As digital data collection becomes more sophisticated, so do regulators in their comprehension of tracking technologies.
Dirty data collection is the central thrust behind privacy enforcement action:
As regulatory scrutiny intensifies, consumer awareness is also on the rise:Â
This growing consumer vigilance, combined with increasingly sophisticated regulatory actions, underscores a clear message: brands that neglect data privacy risk not only enforcement penalties but also the erosion of consumer trust–the cornerstone of long-term reputation and success.
Privacy violations might trigger fines, but the real cost of dirty data–and threat to business productivity–lies in having to retrain AI models that have been corrupted by non-compliant data.Â
AI models depend on data. When that data is revealed to be dirty—collected without consent or in violation of privacy laws—companies must stop everything. The data is tainted, and using it puts the entire business at risk. Correcting this issue involves retraining the model from the ground up, which is a complex and costly endeavor.
‍
‍
Why is retraining AI so difficult? To understand the challenge, Vivek Vaidya breaks down the high-level steps involved.
The costs of retraining depend on the model’s size and complexity. Vivek estimates that retraining a “small” AI model with 10 billion parameters and 50–100 terabytes of data can cost between $1.5 million and $4 million. This includes data processing, computational infrastructure, and labor costs.
A more severe consequence: model deletion
In 2022, the FTC penalized Weight Watchers (WW International) for illegally collecting sensitive data from children. Beyond deleting the data, WW was ordered to destroy any algorithms or AI systems trained using it. This “disgorgement” ruling wiped out the models entirely, forcing the company to start over from scratch.
Consider this: businesses leveraging AI for personalized marketing see 10-20% faster growth than those that don’t. Every day your AI is out of commission is another day you’re losing ground in the race for relevance and profitability.
If your initiatives are down for weeks or months while you retrain models or source clean data, the market won’t wait for you to catch up.
The message is stark but simple: businesses that don’t prioritize clean, permissioned data are undermining the very AI initiatives they hope will secure their future.
Dirty data threatens everything. It poisons AI models, invites regulatory scrutiny, and erodes customer trust. And while privacy fines may be recoverable, the operational and strategic costs of dirty data are not. Â
This study should serve as a wake-up call for any CEO managing a business that collects personal data. If your business 1) collects personal data and 2) intends to remain competitive, here are three things you need to do right now:Â
Stop any AI projects that use data you can’t verify as permissioned and compliant. Convene a team—ideally led by your CTO—and ask them to verify that dirty data isn’t being used to train AI models or fuel AI applications.Â
Find out if your company is one of the 88% that aren’t respecting consumer consent signals. Do you have a CMP (consent management platform)? Does that CMP ensure that you don’t collect data when you are supposed to? Can you prove it?
To realize true privacy, not just the Hollywood facade of privacy, businesses must ensure that privacy commitments are respected and enforced across the data ecosystem—including mission-critical business functions such as marketing, customer data management, and commerce.Â
Privacy may be a legal requirement, but at its core, it’s a technical problem. Achieving compliance and meeting consumer expectations requires a deep understanding of systems, data flows, and technical controls.
Your CTO should own the privacy program, with engineers leading implementation and execution. Legal and privacy teams play a critical role as stakeholders, but the complexities of modern compliance demand technical expertise at the helm. Stop treating data privacy like a check-the-box exercise; it’s a strategic imperative that requires cross-functional collaboration across engineering, marketing, product, and legal to keep regulators at bay and your data clean.
Dirty data isn’t just a compliance issue—it’s a direct threat to the success of AI initiatives. As this study shows, a staggering portion of data fueling AI is improperly permissioned, compromising both legal compliance and the reliability of AI-driven insights and activation.
In an AI-powered market, companies that prioritize clean, compliant data will outpace competitors, while those ignoring data quality risks will face regulatory scrutiny and lose ground.
The message is clear: if AI is the future, then permissioned data is its foundation. By investing in robust data governance and privacy-first practices, businesses can protect their AI investments, safeguard customer trust, and position themselves for sustainable growth.
‍