Author’s Note: This post is asking an honest question about some issues I am noticing with the quality of data in Google Analytics. My business relies on the generosity of Google to provide Google Analytics for free to many satisfied users, and by no means do I want to damage that relationship. But this topic seems so out of control I am hoping this post will show the size of the problem we are currently facing.
We need a data annulment
There is a massive data quality issue happening right now in Google Analytics, and not enough people are talking about it. This is the most challenging data quality issue that I have seen in the 10 years I have used the product.
I worry that if a permanent and swift solution is not found for this issue, it will cause long term damage to the credibility of the product. That damage will eat away at the market share that Google has fought so hard to gain and drive visitors to look for other solutions. I am concerned.
Understanding the problem we are facing
For big businesses, looking at data in Google Analytics is business as usual. Traffic sources are consistent, with the top 10 traffic sources rarely changing.
You can no longer say the same for small businesses. Over the past few months, many small businesses have noticed that their traffic numbers increasing. Good news, right?
Not when the traffic is fake.
This traffic is increasing due to non-existant (fake) traffic. I have written about this in the past (and many others have as well), yet the problem continues to get worse.
In the last month this has become an epidemic. In the screenshot above, there are over 350 visitors coming from a website called social-buttons.com. This is not real traffic.
How do we know the traffic is fake?
If a source of traffic has extreme metrics like 100% bounce rate or 0% bounce rate, it is usually non-human traffic. The same goes for 100% new visitor rates or 1.00 average page views. Humans don't behave this way en masse, but these numbers are possible when computers visit your website.
Another way to look at this is to compare traffic numbers for web properties that are running both classic and universal Google Analytics.
I happen to have a web property that runs both classic and universal analytics, which allows us to directly compare how staggering this problem has become.
Ever since universal analytics became the default version of Google Analytics, I have been obsessed with understanding the differences in data collection. Is there a difference in visitors? By how much?
For this reason, I still send data to classic and universal analytics properties through Google Tag Manager. This allows me to monitor the difference between the two tracking methods. For the most part, visits were 99% the same between classic and universal.
However, this balance is changing. For one website I operate, our universal analytics property has 500 more sessions than classic analytics over the same period of time. This is the biggest discrepancy that I have seen between classic and universal.
500 visits may not seem like a lot to a site with 5 million views a month, but it is HUGE for a small business owner. More on this in a minute.
Why is this happening with universal analytics?
This particular threat – which we will call the social buttons hack – is only happening to Universal Analytics accounts. These spammers (or hackers if you prefer) are exploiting a new feature in Universal Analytics to inject visits into our reports. This feature is called the measurement protocol.
How do I know? Private conversations with really smart analytics geeks. But also, there is a dead giveaway if you look at this traffic. There is no hostname assigned.
A hostname of (not set) indicates that this data did not come into Google Analytics through a web browser. This is different from other bot spam techniques where hostnames do show up.
While this may make things easy to filter (just require a hostname to record a hit), it is also something that the average Google Analytics user will not have the knowledge to solve.
This post is about the average user
In the past two weeks I have been sat down with clients and colleagues who look at Google Analytics on average once a month.
Each of them have made the same observation upon logging in:
“Wow, our traffic is up this month. Awesome!”
There was no real concern about where that traffic came from. No sense of urgency to figure out whether that traffic was from real people.
They just assumed that something good happened to make the number look better.
That is how the average Google Analytics user thinks about data. They assume that it just works.
For good reason, too. For the most part, Google Analytics just works. There was no reason to question things on a massive scale, until now.
There is one tool measuring the web, and it is Google Analytics
According to Builtwith, over 27 million websites are using Google Analytics. That is likely a conservative number.
Half of the top 10k websites use GA. 59% of the top 100k websites and 62% of the top 1 million websites use Google Analytics.
For the first time in my career, millions of websites now have a very good reason to question the accuracy of their data.
Now the question is: Will website owners even notice? And if they do, will this cause website owners to lose trust in Google Analytics?
Introducing the website visitor bubble
While I do not agree with pageviews measure of success, I know many companies that judge the performance of their online marketing by pageviews and visitors.
Publishers receive ad revenue by impressions and sell inventory based on their previous traffic. Small businesses set goals to drive more visitors to their websites over the course of the year.
Looking at Google Analytics today, a business may find that they are well on their way to reaching their marketing goals. Heck, they may even beat their own predictions!
If this bot traffic continues to grow, traffic for these websites will also grow. Website owners will make decisions based on this data. Investment will flow into the web channel, and marketers will be praised for their miraculous performance.
Everything will be amazing until the bubble bursts on this traffic. This will happen either by a massive cleanup effort from Google (my hope) or by digging deeper into the data to understand why traffic is increasing so much.
As an analyst, you would think that the data digging would have happened by now. But as a realist, I can see most organizations do not think critically about their traffic numbers.
What happens when the bubble bursts
No matter how or when the bubble bursts, it will leave people looking for answers.
Advertisers will want their money back for the fake impressions they received.
Businesses will blame their marketing managers and analysts for not seeing the problem sooner.
Marketers and analysts will blame Google for not fixing the problem sooner.
While this may not seem like a threat at the moment, it has potential to snowball out of control as it hits the masses.
Why do I think it can get out of control?
Most of the time when data discrepancies happen with a product, it is noticed by the power users. This would be the top 10k websites, of which Google Analytics is on ~50% of them.
But in this case, the top 10k websites do not even notice that this problem exists. When you get millions of visits each month, 500 bot visits is a rounding error. It’s less than .05% of traffic.
Who will notice then? The 27 million websites that operate Google Analytics. Let’s say that 20 million of those websites get less than 1,000 visitors a month. 500 bot visits would increase their traffic by 50-100+%.
This traffic could come in the form of measurement protocol hits as described above, or it could be traffic from crawlers like SEMalt. No matter the source, this is a huge increase in traffic for these owners. Enough for many of them to take notice. Say that even 1% of these users notice, that is 200,000 website owners.
This becomes a big time problem for Google.
URGENT: We need to solve this problem
I hope that by now I have sufficiently convinced you that this is a problem. It’s an especially large problem when put in the context of the sheer number of websites that are currently affected. We are talking millions of websites!
Because of the scale, I view this as the biggest challenge Google Analytics has faced. It is the largest data quality issue I have seen in my 15 years working with data. This is an urgent problem.
My solution: A data annulment
If I were in charge of this problem, I would start with a clean slate. This would be achieved by deleting all of this junk data from every Google Analytics account and starting fresh.
I would offer an alert to all users when logging in that their data has been cleaned up, explaining what they are seeing.
And then I would put measures in place to proactively prevent this from happening in the future. This would involve cleaning up the holes in the measurement protocol. It would also involve expanding the bot filtering capability beyond the ethical bots that are registered in the IAB Bots and Spiders list.
Last, I would assign an internal task force to constantly monitor data quality issues and issue quick fixes when problems arise.
The Google Analytics team is brilliant, so they will fix this problem
I am confident that Google will fix this problem with a smarter solution than what I have proposed. The main reason why I am writing this article is just to frame up the problem in a perspective that is hard to ignore. This is not a case where power users are complaining to Google about a problem. Instead, it's a problem most website owners don't even know that they have.
This issue is giving millions of website owners a reason to stop trusting Google Analytics data. It is giving them reason to start looking for alternative tools. I don’t want that to happen.
Let’s get this cleaned up Google!