Is Google Analytics Newest Data Quality Issue the Most Challenging?

Author’s Note: This post is asking an honest question about some issues I am noticing with the quality of data in Google Analytics. My business relies on the generosity of Google to provide Google Analytics for free to many satisfied users, and by no means do I want to damage that relationship. But this topic seems so out of control I am hoping this post will show the size of the problem we are currently facing.

We need a data annulment

There is a massive data quality issue happening right now in Google Analytics, and not enough people are talking about it. This is the most challenging data quality issue that I have seen in the 10 years I have used the product.

I worry that if a permanent and swift solution is not found for this issue, it will cause long term damage to the credibility of the product. That damage will eat away at the market share that Google has fought so hard to gain and drive visitors to look for other solutions. I am concerned.

Understanding the problem we are facing

For big businesses, looking at data in Google Analytics is business as usual. Traffic sources are consistent, with the top 10 traffic sources rarely changing.

You can no longer say the same for small businesses. Over the past few months, many small businesses have noticed that their traffic numbers increasing. Good news, right?

Not when the traffic is fake.

Google Analytics Fake Traffic

This traffic is increasing due to non-existant (fake) traffic. I have written about this in the past (and many others have as well), yet the problem continues to get worse.

In the last month this has become an epidemic. In the screenshot above, there are over 350 visitors coming from a website called social-buttons.com. This is not real traffic.

How do we know the traffic is fake?

If a source of traffic has extreme metrics like 100% bounce rate or 0% bounce rate, it is usually non-human traffic. The same goes for 100% new visitor rates or 1.00 average page views. Humans don’t behave this way en masse, but these numbers are possible when computers visit your website.

Another way to look at this is to compare traffic numbers for web properties that are running both classic and universal Google Analytics.

I happen to have a web property that runs both classic and universal analytics, which allows us to directly compare how staggering this problem has become.

Universal vs Classic

Ever since universal analytics became the default version of Google Analytics, I have been obsessed with understanding the differences in data collection. Is there a difference in visitors? By how much?

For this reason, I still send data to classic and universal analytics properties through Google Tag Manager. This allows me  to monitor the difference between the two tracking methods. For the most part, visits were 99% the same between classic and universal.

However, this balance is changing. For one website I operate, our universal analytics property has 500 more sessions than classic analytics over the same period of time. This is the biggest discrepancy that I have seen between classic and universal.

500 visits may not seem like a lot to a site with 5 million views a month, but it is HUGE for a small business owner. More on this in a minute.

Why is this happening with universal analytics?

This particular threat – which we will call the social buttons hack – is only happening to Universal Analytics accounts. These spammers (or hackers if you prefer) are exploiting a new feature in Universal Analytics to inject visits into our reports. This feature is called the measurement protocol.

How do I know? Private conversations with really smart analytics geeks. But also, there is a dead giveaway if you look at this traffic. There is no hostname assigned.

No Hostname Bad Visitors

A hostname of (not set) indicates that this data did not come into Google Analytics through a web browser. This is different from other bot spam techniques where hostnames do show up.

While this may make things easy to filter (just require a hostname to record a hit), it is also something that the average Google Analytics user will not have the knowledge to solve.

This post is about the average user

In the past two weeks I have been sat down with clients and colleagues who look at Google Analytics on average once a month.

Each of them have made the same observation upon logging in:

“Wow, our traffic is up this month. Awesome!”

There was no real concern about where that traffic came from. No sense of urgency to figure out whether that traffic was from real people.

They just assumed that something good happened to make the number look better.

That is how the average Google Analytics user thinks about data. They assume that it just works.

For good reason, too. For the most part, Google Analytics just works. There was no reason to question things on a massive scale, until now.

There is one tool measuring the web, and it is Google Analytics

According to Builtwith, over 27 million websites are using Google Analytics. That is likely a conservative number.

Google Analytics Market Share

Half of the top 10k websites use GA. 59% of the top 100k websites and 62% of the top 1 million websites use Google Analytics.

For the first time in my career, millions of websites now have a very good reason to question the accuracy of their data.

Now the question is: Will website owners even notice? And if they do, will this cause website owners to lose trust in Google Analytics?

Introducing the website visitor bubble

While I do not agree with pageviews measure of success, I know many companies that judge the performance of their online marketing by pageviews and visitors.

Publishers receive ad revenue by impressions and sell inventory based on their previous traffic. Small businesses set goals to drive more visitors to their websites over the course of the year.

Looking at Google Analytics today, a business may find that they are well on their way to reaching their marketing goals. Heck, they may even beat their own predictions!

If this bot traffic continues to grow, traffic for these websites will also grow. Website owners will make decisions based on this data. Investment will flow into the web channel, and marketers will be praised for their miraculous performance.

Everything will be amazing until the bubble bursts on this traffic. This will happen either by a massive cleanup effort from Google (my hope) or by digging deeper into the data to understand why traffic is increasing so much.

As an analyst, you would think that the data digging would have happened by now. But as a realist, I can see most organizations do not think critically about their traffic numbers.

What happens when the bubble bursts

No matter how or when the bubble bursts, it will leave people looking for answers.

Advertisers will want their money back for the fake impressions they received.

Businesses will blame their marketing managers and analysts for not seeing the problem sooner.

Marketers and analysts will blame Google for not fixing the problem sooner.

While this may not seem like a threat at the moment, it has potential to snowball out of control as it hits the masses.

Why do I think it can get out of control?

Most of the time when data discrepancies happen with a product, it is noticed by the power users. This would be the top 10k websites, of which Google Analytics is on ~50% of them.

Top 10k Websites

But in this case, the top 10k websites do not even notice that this problem exists. When you get millions of visits each month, 500 bot visits is a rounding error. It’s less than .05% of traffic.

Who will notice then? The 27 million websites that operate Google Analytics. Let’s say that 20 million of those websites get less than 1,000 visitors a month. 500 bot visits would increase their traffic by 50-100+%.

This traffic could come in the form of measurement protocol hits as described above, or it could be traffic from crawlers like SEMalt. No matter the source, this is a huge increase in traffic for these owners. Enough for many of them to take notice. Say that even 1% of these users notice, that is 200,000 website owners.

This becomes a big time problem for Google.

URGENT: We need to solve this problem

I hope that by now I have sufficiently convinced you that this is a problem. It’s an especially large problem when put in the context of the sheer number of websites that are currently affected. We are talking millions of websites!

Because of the scale, I view this as the biggest challenge Google Analytics has faced. It is the largest data quality issue I have seen in my 15 years working with data. This is an urgent problem.

My solution: A data annulment

If I were in charge of this problem, I would start with a clean slate. This would be achieved by deleting all of this junk data from every Google Analytics account and starting fresh.

I would offer an alert to all users when logging in that their data has been cleaned up, explaining what they are seeing.

Alert Box

And then I would put measures in place to proactively prevent this from happening in the future. This would involve cleaning up the holes in the measurement protocol. It would also involve expanding the bot filtering capability beyond the ethical bots that are registered in the IAB Bots and Spiders list.

Last, I would assign an internal task force to constantly monitor data quality issues and issue quick fixes when problems arise.

The Google Analytics team is brilliant, so they will fix this problem

I am confident that Google will fix this problem with a smarter solution than what I have proposed. The main reason why I am writing this article is just to frame up the problem in a perspective that is hard to ignore. This is not a case where power users are complaining to Google about a problem. Instead, it’s a problem most website owners don’t even know that they have.

This issue is giving millions of website owners a reason to stop trusting Google Analytics data. It is giving them reason to start looking for alternative tools. I don’t want that to happen.

Let’s get this cleaned up Google!

About the Author

Jeff Sauer is an independent Digital Marketing Consultant, Speaker and Teacher based out of a suitcase somewhere in the world. Formerly of Minneapolis, MN and San Francisco, CA.

  • Well stated, Jeff. I have personally spent the last 5 months trying to educate people on how to clean out their analytics, and there are millions of people frustrated by the effort required. Where is Google in all this? Why aren’t they educating people, or even acknowledging the problem? Isn’t it their product?

    • Hi Mike – I have been following your blog posts and forum comments and appreciate what you are doing to bring light on the subject!
      I think it’s one of three scenarios:

      1) Google is acutely aware of the problem, but doesn’t think it is a big enough problem to address proactively. 2) They are not aware of the problem at a level that makes decisions (I.e. It’s not on the radar at all). 3) They are aware and already working on a solution.

      My hope with this post was to get them to take notice if it were #1 or #2 by showing the scope. Scenario #3 would solve itself.
      If this is the article that causes them to take notice, great. Otherwise, let’s keep on getting more vocal until it does!

  • You say “There is a massive data quality issue happening right now in Google Analytics, and not enough people are talking about it.”

    There’s certainly a large number of people in the GA G+ community who have raised concerns about this – the problem is the total radio silence from the GA team on this! Even the GACP forum is silent and nobody from Google ever confirmed or denied or said anything publicly about this.

    Like you, Mike and thousands of people, we rely on Google Analytics as a primary source of information to optimize our businesses and those of our clients. We certainly do not want to damage the relationship we have with Google, but it is a serious issue!

    • It is getting noticed, but we can both agree that it is not being addressed by the people who can do something to fix the problem (Google). If this article helps frame up the problem for them in a tangible way, great.

  • Lucien

    I add htaccess rules to delete fake visits related to bots. Cleaner Analytics and the server doesn’t have to serve content to those bots.

    • Good method. But again, the best way to eliminate this problem for the average user will be for Google to make a massive change and fix the problem proactively.

      • This does not solve the issue for some bots who are sending GA data directly, without ever visiting the site!

  • Kenneth Waters EA

    I noticed the uptick in visits on my website and immediately thought, I’m doing something right. As I began looking closely at the data in GA, I felt something troubling me about what I was looking at. Me not being experienced in this highly technical area, I relied upon others posting who were experiencing the upticks as I was. Your post Jeff confirmed my instincts that what I was witnessing about my website the data was not all correct. Thanks for sharing.

  • Nico

    Thank you for this article Jeff. The problem is really getting out of hand now. It has been going on for well over a year, starting with semalt. Last month these nasty spammers are well over 10% of my “traffic” on low traffic websites i run. I can’t be bothered to update my filter every day a new site pops up.I run over 50 websites with GA, so keeping them updated and clean is one hell of a job.

    I don’t know what i am more angry about, the spammers or the complete radio silence by google. They know about this issue, just search for referral spam in google and you can see hundreds or thousands of people talking about it. The same goes for social media channels like twitter, this problem can;t escape the GA team.

    All we want is a statement from google saying they are working on a solution. If they simply come out and say that they identify the problem and are looking into it, i would assume that sooner or later my data would be clean again. Right now, i am seriously thinking of upgrading to a paid solution from another brand. I think i should be glad i did not spend thousdands of euro’s on upgrading to the paid version of GA…

    • Rich Patton

      Google may not listen, but we do – have you looked into Mixpanel? We do things quite differently, and I’d be happy to discuss.

      • Hi Rich – I do use Mixpanel on a few sites. It is a nice complement for me. Have not seen it as a replacement, though.

        • Rich Patton

          Hey Jeff – great post! For web tracking, I tend to agree with you. GA can provide a great understanding of attribution, especially through their own ad networks, and a high-level view of site performance, which is great for marketers. However, when you want to track a more complex process – such as a registration flow, or cart abandonment – or perform cohort analysis, Mixpanel provides greater detail, which is great for marketers and product teams. Many of our clients use us in tandem, where GA provides the “30k foot view,” and we bring more granularity. Glad to hear you’ve been using us!

  • That sums up perfectly what is going on, as many others I got caught by this unaware of what was going on. I happened to get caught in the game and for the last months I’ve been following this topic very close. Don’t know what is the position on this from GA, but I want to think they are working on it.

    Stephane is right, there are many people talking about it, I see it on the forums and my blog, some of them are really worried, and the word is spreading. Right know this is exploited by a few (mainly one), and for the “success” they are getting more will come, and that can turn on a real bad situation, so I’m resilient to think Google is ignoring it.

  • Peter O’Neill

    Just a point on this (great post btw Jeff), need to highlight it is not bots hitting the websites. These sessions are being triggered using the measurement protocol. I know this as I set up a blank account to get a View ID that was blank data (client with a lot of websites, easiest way to fix data extract when we temp lost access). Suddenly data appeared for this account that was not on any website.

    What this means is that a lot of potential fixes (bot filtering) just aren’t possible. And that these companies (or company) are just using the measurement protocol to trigger sessions, cycling through all possible GA account IDs. How do you identify fake measurement protocol hits? If the referrer is constantly changing? But let through true server side tracking?

    We need a solution from Google but think it will be a tricky one. Best option could be to monitor referrers that appear in more than X percent of websites each day, identify the fake ones based on behaviour and strip them out from all websites.

  • Part of the issue with the Google Analytics team not handling this problem is that the solutions for it as users are cumbersome, and incomplete — using custom segments only really helps when you remember to apply them, and you use the Google Analytics web front-end exclusively. It does _not_ help any widgets, data collectors, etc. that work through the Google Analytics API to pull data.

  • nls

    In particular for smaller websites Analytics has become useless because of this spam problem. Checking true statistics has become a manual search through mountains of spam. There are some guides that suggest blocking certain sources but with new spammers popping up every day that is an impossible task for individuals. On large websites 30 referrals from buttons-for-websites,com or 40 sessions from get-free-traffic,org might not even get noticed, but Google has to take action if they want their tool to keep having any use for smaller websites.

  • thegothtable

    All great points, Jeff. Filtering out (not set) hostnames is smart. A lot of the spammers do set their hostnames in a fairly unpredictable manner though—I’ve seen them set as the guardian, google.com, my own website, etc.

    A couple things I’d add: some of these spammers are starting to send the traffic as other source/mediums beyond just referrals, e.g., google / organic with completely absurd keywords (which, gotta hand it to them, is clever since there’s hardly any keyword data anymore).

    It would be incredible if the analytics team came up with a solution to retroactively clean up the data, because at this point, even for users with decent command of segments, filters, and regex, the spammers switch source sites so frequently I’ve been running into character limits on my segments and filters! E.g., I have a segment to exclude visits from sources that match this regex: buttons(.*)website|semalt|cenoval.ru|web.*awards|(free-share|social)-buttons|googlsucks|best-seo-(solution|offer)|addons.mozilla.org|hulfingtonpost|darodar|aliexpress|o-o-6-o-o|rf.dev|humanorightswatch.org|guardlan|dreamcatcherhotels|smailik

    It has reached a point where it’s almost too much of a pain to filter/segment out this data even for fairly advanced users… some people would have to resort to adding multiple rules or filters to get rid of everything.

    • Crazy. That is a big problem. The main value proposition of GA is that it’s a free tool that just works for SMBs to measure their websites (and AdWords spend of course). When it no longer just works, then the value goes away quickly.

      Big companies have the wherewithal to ensure data quality for their sites, but not SMBs. It becomes a full time job, just keeping up.

      And now keeping up is a job on top of that!

      • thegothtable

        Yeah I agree totally about the value of it. Saw some of the comments promoting Mixpanel—which I also use and like and their support is exceptional. But I don’t see it ever being a replacement for GA, particularly for the majority of SMBs, just because they need their analytics to platform to work without any setup/maintenance.

  • Gaah! So I have a tiny website (gerardv.com) – I am in the middle of implementing GA to see what works for me (even with only tens of views per day). And what I want to do is use google experiements to tweak it. But if the majority of the data collected is spam, then experiements wont work properly either. The offered solutions allow me to filter out to see only the data I want to see – but the data is still there and will (presumably) affect experiments. So while I am an entertainer I also have a degree in IT and this situation is really messing with my head. I might have to go down another more expensive route just to do what should actually be easy (and it would be easy if the data were not mostly spam). I say again Gaaah! I don’t want to filter the data after it is collected, I want to prevent it being collected.

    • I feel your pain. One thing to note is that the referral spam is sporadic (usually a blast of hits in one day, then goes away forever), so your A/B split test would equally split traffic to the junk. So it may affect your tests equally.
      The other thing to consider is that it may not affect your tests at all, if the traffic is being pushed in by non-referral means (the measurement protocol piece I mentioned above).
      I understand the desire to proactively block this traffic rather than filter. I guess you could consider modifying your tracking code to check for a valid hostname before even firing GA… That might solve the problem.

    • thegothtable

      You can’t really place a lot of confidence in the results of a split test on such a small audience. Might be helpful to read some of Optimizely’s explanations (also, Optimizely is a pretty cheap way to split test): https://help.optimizely.com/hc/en-us/articles/200133789-How-long-to-run-a-test#minimum

      • A small daily hit rate but over a longer period will produce statistically significant results.

  • Kristin Toivola Ziegler

    Have been meaning to comment and say thanks for this article! Lot’s of discussion here on how to handle and this has been a nice summary. We’ve even seen spam in the event tracking as of late. Arggg.

    • Yes, event tracking is out of control… OUT OF CONTROL Zig!

  • Head Down Golf

    Not gonna hold my breath. Each day the spam gets worse. If I try to block a site, 3 new ones pop up the next day. Google doesn’t seem to care

  • Sandra Newton

    This issue is doing my head in. Reporting for clients has become an exercise in frustration as we try and explain which stats can be considered even halfway actionable. It’s taking extra time and effort on our part that is a completely waste. I know hackers/spammers are hard to track down and catch but is there not a point where the whole internet user community just says “enough is enough”!

    • You might want to look into Super Metrics or Analytics Edge. Their awesome tools have built in filters for this SPAM that they keep up to date.

      http://supermetrics.com/product/supermetrics-data-grabber/
      http://www.analyticsedge.com/

      • Sandra Newton

        Thanks Jeff – we apply filters to all our stats but it’s still a headache. Those tools look interesting, thanks for the recommendations. We’ve been working on our own reporting tool that extracts data from various sources, but these might actually work as a solution.

  • Bhavesh Desai

    Great article, you opened Pandora box. I red your post when it was published, at that time i could not observe on five small sites i am running for my customers. A week back i launched my own site, GA via GTM was implemented on 29th May. Site is not advertised nor it has high searched keywords: only friends visited to see. But GA report of 1st June to 5 June shows ,243 users in Audience Overview report, which is super exaggerated. Worst part, 243 Users in Audience Overview report includes new and returning visitors and Location report shows 241 new users, that means only 2 returning visitors was captured!!! It is not true, myself visits site at least 5 times a day.
    I took data for the same period from Website log provided by hosting company( AwStats) and it shows 87 unique visitors, which is believable.
    FYI- I have checked Tag Assistant Plugin and GTM and GA, and found OK, so I presume implementation of codes is not the issue

    There is data pollution and risk. Taking one time precaution is OK but as a small agency owner, who handles MSB- Micro and Small business, it is not doable to update fille everyday.
    Google may solve the problem but we don’t know when??
    I have few questions-
    1) Is this kind of problem is limited to GA or all products who uses the same method or protocol? I mean is it technology limitation?
    2) Can we use Piwik or any other open source software?
    3) Can we take Log file from hosting companies and process it separately?
    4) We can use GA for Adwords, but for other functions we have to is non GA.
    Please share your view and help me.

    • Bhavesh,

      1) The problem is limited to GA right now, far as I know. They are the biggest player, which gives them the biggest target. You could easily manipulate other analytics platforms in the same way. Sort of like Viruses on Windows PCs.

      2) You can use whatever software you like. This can be in addition to GA.

      3) You can parse your log file, but it will likely create more questions than answers. Log file analytics are not very insightful, as they have no concept of things that happen within a page. That’s why JavaScript became so popular, because it is a client side language.

      4) Yes, GA is the only tool with a native AdWords integration. So if you want those reports, you are pretty much stuck with GA.

      • Bhavesh Desai

        Thanks. Sorry for late reply.

  • Joe Worthington

    Thanks Jeff, you have articulated the frustrations admirably. Google needs to make a proactive decision to create a solution. The ‘blunt instrument’ fix of the spam filter released 17th June is lame, for want of a better word. The event tracking spam … don’t get me started!

  • Tammy Morris

    Thanks for this article! You’re so right: no one is addressing this issue. I have a very small business and my real traffic is low so I need very exact data to show what works and what doesn’t in my advertising endeavors. ihatevitaly.

  • A question that bugs me on the subject is the fakeness of these requests. I did some analysis on these spam referrals on some of our accounts and it seems they are getting “smarter”, or more “humanoid” if you like: [kainoto.com/change-your-mind/analytics-spammers.aspx]. The Bounce rate is changing as well and becoming within normal range. Pageviews, timing on page, everything is becoming just like real users behave.

    That’s why I have asked myself in the solution of segmenting views [kainoto.com/change-your-mind/filter-spam-visits-in-your-analytics.aspx] if there is a real final solution anyway? Because there are 2 ways of faking requests:

    1. Generate fake requests by spammers servers (as mostly anticipated and assumed in Analytics community)
    2. Generate fake referrals by simply redirecting real users (through a virus/bot on their computers/browsers) for their first request.

    The second method worries me much more because we cannot (may not) filter them out. We should then just be avoiding the analysis on referral. But how can we know if these requests are made by method 1 or 2?

    • Bhavesh Desai

      I have also observed fakers getting smart and bringing readings in normal range. I am worried about its effect on goal completion? ( source- medium) But thanks to community and members like Analytics edge, Mike who is providing spam segment, which you can directly import. Hope Google gets in to the action before its too late.

      • The good news is that it’s not too late to retroactively clean up data… the only thing that might be too late is an exodus of users to another option. I just don’t see it happening, though. We are entrenched.

    • That is a good question and genuine fear. As an end user, there is nothing we can do. This is why I have long said that the only real solution from this problem is a data quality advocate at Google who makes this their mission to eliminate spam.

      Google should be attacking this problem with the same vigilance they attacked ClickFraud in the mid 2000’s. Click Fraud was an existential threat to Google then, and now we hardly hear about it.

      I imagine a utopian world where I can say the same about Google Analytics. That spam was a thing of the past. I think it might happen, but I also don’t see any urgency in getting it done by Google. Probably because GA is “free” and AdWords is the core business.

      If enough premium customers complained, it would be more urgent. But in reality, premium users don’t even notice spam referrers, because it’s a rounding error.

      Again, this is and always will be the reason for this post. To protect the small business owner who just wants a pulse on their web traffic.

      • I agree it should be at the core of Google’s effort no matter it is a free tool. What I wonder is though what can they really do about it. These days I gave it some extensive thinking and actually not sure if they have any reasonable tools at their disposal if spammers get really smart about it and install browser plugins on botnet computers of real users: http://kainoto.com/change-your-mind/2-ways-google-analytics-data-can-get-spammed.aspx

        • It’s definitely possible for them to do it. That’s what they did with AdWords (models against click fraud) and it works really well now. AdWords is a lot tougher to solve against, because money is involved. Analytics lasted this long without spam because there is no real business case to pollute analytics. It’s crazy that one person can taint the system so bad!

          • Yes, but the difference in AdWords is that the ad runs on their own server. Which makes it possible to implement additional logic to the scanned traffic. As well the fraud is related to an account they should pay – frauder who creates virtual unreal traffic. And they can scan that account for “unnormal” patterns, previous quality,… You have a link between clicks and some AdWords account.

            Within Analytics, they are limited in our website servers access (unless we would implement additional code in our server which would be a privacy issue for users). And frauders are not payed by Google so they are anonymous (don’t have an account with them). It’s impossible to link clicks to any other account.

          • There are certain aspects that will be tough to detect, but I’m sure they have some smart statistics/engineer folks who could figure that out. And they do collect the data on their own servers and can process it however they want. They are limited to protocols, but their protocols are all over the web.

  • Any recommendations for the next best tool to use after GA (paid if ness)? Ta

  • Shane White

    Great write up! This problem is not new, it’s been slowly creeping up over the past 2 years. Over %75 of my Google Analytics traffic is spam! I keep trying to filter the results, but new spam sites just keep popping up! I used to watch my analytic data daily, but now Google Analytics is pretty much worthless. I’m very shocked Google hasn’t fixed this yet.

    • Yes, on small traffic sites it can really pollute the ability to perform meaningful analysis. Painful that it’s still a problem!

  • Randal McNally

    Awesome information! I cant keep up with the filtering and its absurd that I have to.. This is a BILLION dollar company and the fact that this has been going on for years and has continue this long after your article shows just how much google really cares. If they cared they’d throw some $ and brains at it and fix it. People move heaven and earth to get things done…………. when they want to. Clearly they do not.