8 Steps for Eliminating Bad Data in Google Analytics

December is a weird month for analytics reporting. This is the busiest time of year for many e-commerce companies. Employees are working around the clock to ensure that their websites are performing well.

Most other companies use this time as an opportunity to take time away from the computer. They recharge their batteries. They use up their vacation days. Many employees take a significant amount of time off from their jobs the holidays.

While you were home alone, the spambots messed with your data

This year, many of us will be returning to our analytics reporting and notice something strange in our reports. There are a lot of referrals from a website we have never seen before!

It might look something like this:

forum.topic.darodar.com

The topic number could be anything from 1 to 100 million. All the referrals are from a domain called darodar.com.

For smaller websites, this may show up as a top 10 traffic source for December. For tiny websites, this may be your largest source of traffic!

Here is an example where the forums are cracking into the top 10 of a website.

Forum Darodar.com

As you can see from my handy annotation of this traffic, these 127 visits are crap. They are not from real people. Your website did not go viral. Your website was not mentioned in some kind of forum.

You were visited by the Santa Claus of traffic: someone who you may have believed in when you were a young analyst. Then you realized that it was just your parents pulling one over on you.

Or in this case, you realized that it was just an international spy-bot.

How can you tell that the traffic is crap?

Traffic from Russia

It’s pretty easy to tell when the traffic you are receiving is crap. The longer you work in web analytics, the more experience you have dealing with these problems.

I can tell when traffic is junk in about .0002 seconds using a mental checklist I have developed over the years.

These are the rules that I teach to the hundreds of students who have gone through my Google Analytics Training course.

Here is the checklist!

1) If it sounds too good to be true, it is

Traffic rarely falls from the sky. Yes, there are moments where something you write goes viral. Or gets picked up by an influencer and spreads.

But these are rare moments for 99.9999% of web pages.

When you look at analytics reports, look to see what has changed since the last period you analyzed. If something major has changed, always assume the data is wrong.

Blind trust in numbers is more dangerous than not looking at the numbers at all.

2) Visit the referral link listed and see what it says

Unfortunately, I did not get a screen shot of the darodar.com referrals when the site was live. But I can assure you that it was not a site that appeared to be driving traffic to any of the sites I own. The language was Russian, and there were no links.

This is always a dead-giveaway that the traffic is not real.

Now, if the website redirects you to what appears to be a legitimate electronics store. Perhaps this was just a traffic generation strategy?

Darodar.com redirect January 2015

If the URL in your referral report redirects elsewhere, there is little chance that it was legitimate.

3) Traffic has a half-life that usually lasts longer than a few days

The next step of analysis is to understand natural traffic patterns to your website. When something goes viral, it almost always follows the same pattern.

There is a large spike in traffic for a day or two. After a few days traffic cuts in half. A few more days and it cuts in half again. Over time the traffic fades down to zero.

Here is an example of a natural traffic pattern from when a post caught fire on Jeffalytics:

Natural Traffic Pattern

While it is hard to tell with the scale, the post received 100+ visits per day for months after the initial viral push. A big influx of traffic will often last for months or years before it finally reaches 0. This comes from the initial referrer that made you viral.

It also includes returning direct visits to your website from those who discovered you through that source.

Here is what the traffic pattern looks like for the darodar.com forums:

Russian Forum Traffic

Just as soon as it was there, it was gone. No half life. It increased in traffic over time and then got cut off. This is not a natural traffic pattern.

4) To confirm that traffic is non-human or bot traffic, look at your referral metrics

The easiest way for me to tell that traffic is junk is a 100% bounce rate.

Or a 0% bounce rate.

Or a 100% new visitor rate.

Or a 0% new visitor rate.

I have analyzed billions of website visits and seen millions of traffic sources. I have never seen a perfect 0% or 100% rate for any metric from human traffic.

It is virtually impossible for this to happen, because of how we collect analytics data. A new visit in Google Analytics means that a visitor cookie was never before set in a browser. If your content goes viral, there is no chance that it will only reach new visitors.

A 100% bounce rate means that every single person who visited your site left without taking any kind of incremental action. I have only seen this happen with Google AdWords traffic. With a tiny budget. Sending traffic to a terrible landing page with no navigation.

5) Use secondary dimensions to validate your findings

While high bounce rates or new visitor percentages are usually a dead giveaway, you may want more evidence of a problem. This is where secondary dimensions come in handy.

Try applying secondary dimensions to your source/medium report. Does the traffic look natural?

I like to look at the service provider report to see if the ISP looks legitimate.

Service Provider by Referrer

You can also find interesting properties by looking at the city, country and region of the visitor.

City Report

After looking through several secondary dimensions, noticeable patterns will start to emerge.

6) Block out bad traffic as soon as you can

You can choose your own tolerance for when you should filter out bad traffic to your website. Or you can use mine.

Here are two ground rules for when you should filter out your traffic:

  • If a non-human traffic source makes the report for your top 10 traffic sources, remove it as soon as possible
  • If a bot traffic source accounts for more than 1% of your traffic, remove it asap

Would you like to see how our Video Lesson about Getting Clean Data looks like?

Much like filtering your internal IP addresses, you want your data to reflect your marketing audience. You do this by eliminating non-marketing visitors from Google Analytics. My rule of thumb is to apply a filter when these visitors represent more than 1% of your traffic.

Why 1%? Because if this traffic is more than 1% of your traffic, it can have a noticeable effect on your ability to analyze results. Let’s take the example of darodar from above and examine further.

Darodar Qualitative Traffic Metrics

All the key metrics in the behavior report are significantly different for this traffic than the rest of the site. Especially the session duration metric. This difference is enough to affect your ability to accurately report on website activity. You need a filter.

How do you apply a filter for this traffic?

Easy. Create an advanced filter with the following pattern to protect against future visits:

Darodar.com Referral Filter

The filter pattern is:

.*darodar.com

While it appears that the darodar.com traffic went away in December, I still recommend a filter. This pattern will prevent it from ever coming back into your reports.

To learn more about filters of spam traffic, I recommend reading this excellent article by Analytics Edge.

7) Apply an advanced segment for your historical data

Applying a filter helps you proactively block future traffic, but what about the past? Advanced Segments can be your best friend here.

Create an advanced segment that uses a regular expression to block darodar.com traffic. Here is how this looks:

Advanced Segment Block Darodar.com

If you are uncomfortable with advanced segments, here is a link to the segment. Install Jeff’s Block Russian Forums segment. You can also find this in the Google Analytics Gallery.

When applied to your site, you may notice a large difference in key metrics like time on site. This site was over-reporting time on site by 15 seconds because of darodar.com!

Advanced Segment Applied

The only downside of the advanced segment is that it could result in data sampling for large sites. With that said, large sites may not even notice the forum traffic in their reports, so this may not be necessary.

If you noticed more referrals than just those from Darodar.com, we also have you covered. Here is an advanced segment that covers several more odd referrers.

8) Annotate your account to explain the blip in traffic

Being the good data citizen that you are, it’s important to let others know about your discovery. Spend 2 minutes annotating your account with an explanation of what happened. Being funny is not required, but it does tend to make analytics less boring.

Annotations for Russian Spammers

There you have it. This is a simple checklist that you can apply to just about any situation you have in analytics.

To recap, here are the 8 steps that you should follow:

  1. If it sounds too good to be true, it is
  2. Visit the referral link listed and see what it says
  3. Traffic has a half-life that usually lasts longer than a few days
  4. To confirm that traffic is non-human or bot traffic, look at your referral metrics
  5. Use secondary dimensions to validate your findings
  6. Block out bad traffic as soon as you can
  7. Apply an advanced segment for your historical data
  8. Annotate your account to explain the blip in traffic

About the Author

Jeff Sauer is an independent Digital Marketing Consultant, Speaker and Teacher based out of a suitcase somewhere in the world. Formerly of Minneapolis, MN and San Francisco, CA.

  • Thanks for this article Jeff. I had a sneaky suspicion darodar was similar to a semalt when I started to see it popping up in a few different site reports.

  • Dominic Hurst

    Excellent post Jeff. The 8 steps mentioned need to etched into the minds of all analysts.

  • Guest

    thank you sooo muuuuch for this! I already started to think i was crazy

  • Chris

    Thanks for the post! I woke up to traffic from Samara this morning.

  • Eduardo Ruiz

    thanks a lot! helped with the smailik forums

    • tiger62

      I’m getting this smailik spam on my GA. How did you get rid of it please? Thanks in advance.

      • Eduardo Ruiz

        administration (on top menu), filters, and follow the “How do you apply a filter for this traffic?” on this article. =)

        • What Eduardo said. This should be really easy to follow along with.

          Your clean data only applies moving forward. You will need to apply an advanced segment for retroactive data.

  • WS

    Thank you!!

  • I really hope that GA do a preventive measure on this types of traffic because it continues to mess up with our analytics data. I can’t now see which ones are the legit sources of traffic. I just take note of the sites that I am using to promote our materials.

    • I agree. The problem has gotten worse in recent weeks. I may have to write another post to just highlight how bad it is and ask Google to take quick action.

  • Rajiv

    I have configured Advance Filter to track the sub-domains
    record as follow :

    Filter
    Type: Custom filter > Advanced

    Field A: Hostname

    Extract A: (.*)

    Field B: Request URI

    Extract B: (.*)

    Output To: Request URI

    Constructor: $A1$B1

    After
    that, I am able to see sub-domains record and View Full Page URL In Reports.
    But when I check reports in All page (e.g. Behavior >> All Pages)
    or selecting Landing Page as a Primary Dimension. Further I click on Icon given
    next to displayed Full URL to visit to same domain page, in browser the page
    opened but the double domain name comes so page not open successfully.

    For example :

    In landing page list following URL given :

    http://www.sitegeek.com/compareHosting/arvixe_vs_hostgator

    If I click on icon given next the displayed URL, in browser following URL will
    open

    https://sitegeek.comwww.sitegeek.com/compareHosting/arvixe_vs_hostgator

    Is First Domain with HTTPs coming from ‘View’ where this is taken ?

    How Can I remove double domains?

    Thanks,

    Rajiv

    • The double domain usually means you are applying the filter twice. I have seen this happen before, and that appears to be happening in your case. Check that your view doesn’t have the same filter applied twice.

  • Pietro

    I’m triyng to fiter out: .*social-buttons.com but it’s not working, the expression is correct? Thanks.

    • Filters only work moving forward. Test your filter before committing to see if it would work on visits over the last 7 days. The other test I use is to create an advanced segment and see if that reduces the visits accordingly.

      • Pietro

        Thank you Jeff, I’m testing, but it say that this setting doesn’t affect datas, and i’m sure I have hits from social-buttons.com everyday. Thank you again

        • Usually spam traffic does not affect your data consistently. It is not something that happens every day, but usually in batches. Try creating an advanced segment with that filter as a regex to test. Keep testing until you get it right.

  • Alejo

    Good article. I`m thinking about a little bit more radical solution: block all the traffic which comes from this kind of sites (or bots) with htaccess. I just don´t want even their visits, not only ignore them on the analytics numbers ¬¬

    • You can try that, but it is my understanding they might be using hacked/infected computers. This would mean a moving target for IP Address. I do hope it works and let me know what you find out!

  • JulianPope

    How do I get rid of ru or (not set) language traffic from my site? I have been banning bad referral traffic via Referral Exclusion List, but that doesn’t always seem to work, any ideas? thanks

    • Julian. Referral Exclusion List is NOT how you should be dealing with this problem. Filters are the way to exclude certain referrers. Google is working on a solution that is supposed to handle this as well, but not sure on timing.

  • Molly Adell Bochanyin

    Hi Jeff,

    I’m looking at a particular site my business owns and there are quite a few referrals I believe are spam. I’ve gone it to set a condition to exclude these (about 20). I did also set a filter to remove 1. I am worried that if I filter all of the 20 I’ve located I could perhaps discover that one is an add placement somewhere that I will need to back and locate. Is this a good reason for only setting a condition rather than filtering everything? I believe if I understand correctly you cannot get the information back if it is set in a filter?

    Thank you,

    • Hi Molly,

      You should always create a second view in Google Analytics before applying any filters. Your understanding is correct that if you filter out something, it is gone for good. The second view with no filters can be there if you need to go back and do retroactive analysis, and I highly recommend doing that.
      My recommendation is always to be conservative with filtering, because you never know if something will be needed. But the obvious spam items can be safely discarded.
      As for your ad network placements, they should be using campaign tagging anyway, which will make their source legible to you (I.e. Doubleclick or whatever the network is). You should not be viewing an ad network as a referral, because that means it is not being properly tracked in the first place.
      The two views rule applies throughout.

      Jeff

  • Great article Jeff, Thank you! Once I blocked all the referral traffic, I am getting a bunch that is coming directly to the site, so it is saying direct (as the source of traffic) and None (as the referral). Also, the language, country, service providers are all appearing as not set. This shouldn’t even be happening since I haven’t even set my WP site as active to be crawled by search engines. I got it down by filtering about 40% but now the other 60% are all direct hits, 100% bounce rate and less than 20 seconds on the site. They are killing my site before I even launch. Any suggestions would be greatly appreciated. Thank you,

    Bethanie

    • This could be one of two things:

      1) There is direct referral spam coming in everywhere through the measurement protocol. This is hard to detect, but you might want to look at the hostname or service providers report to see if there is a pattern. 2) You may have eliminated legit referrers and caused traffic to be set as direct improperly. Not Set is often a sign that you have incomplete tagging or are tagging events that didn’t happen and double counting visitors. Are you using GTM at all?
      And please tell me you aren’t using the referral exclusion list to do this vs. filtering.

      • Hey Jeff,
        Thank you so much for responding. I am going to do my best to respond in a way that makes sense to your more developed web mind.
        1. For my infancy stage site, it seems to be like A LOT of direct (not set) traffic, it is over 500 month. And I can see myself on there for hours trying to build the site out through my ISP. I have dug around with no luck, so I will see if my hosting company can help investigate.
        2) So, far I have only excluded the obvious ones like the free share buttons, free social buttons, etc. The ones that literally have 8 different types of ww3, ww4, ww6, site3. Just crazy.

        Sooooo, I apparently I was doing a bunch of no no stuff. And thanks to your blog and screenshots (love screenshots), I am not filtering instead of referral exclusions. And now I have more than one view. I think I would have made your skin crawl had you seen my GA admin console a few days ago. T

        And since I had to Google GTM, that is a pretty profound No on that one. I am reading about it and will see how I can fit that in while I am battling spam bots with ninja stars!

        Thank you so much for your help!!!
        Bethanie

        • Glad it’s at least some help. I am not sure that it has completely solved your problem, but checking with ISP/web host does help.

  • Jess D’souza

    Thank for sharing this article, I worked with GA & yes traffic is one of the irritating problem there. Sometime it shows a huge traffic from invalid source. Also it shows inaccurate report about bounce rate.

  • Andrew Tewksbury

    Is there a reason this traffic starts to pop up? I recall you talking a bit about this in our class this fall. Mainly, what benefits/reasons people go to the effort of creating fake traffic, or is it purely to mess with Larry and Sergey?

    • I think it is a traffic play, just like any other SPAM. Vitali gets visitors to his site because people click on referral list to see what they are. Not all do it, or convert, but you just need a few and it’s profitable (since they use infected computers to create the reports anyway).

  • Steph Dyson

    I’ve tried putting in filters but when I go to verify it, it says that the filter would not have changed my data in the past 7 days (when I know it would). Have I done something wrong?

    • It is probably because your spam referrals occurred more than 7 days ago. Unfortunately spam occurs sporadically and inconsistently, so filters don’t really work that well against them. Once you spot the problem, they stop sending data. If you are not doing this for spam filters, then the story might be that you have an incorrect filter. If the data was in your account within the past 7 days, then the filter should indicate a positive match.

      If you want to provide more details, I can continue to help troubleshoot.

    • Joyce Hall

      Sure would have been nice if this had been answered, because I have the same problem.

  • sneha

    Hello. For the past 3 days i cannot see the users. Earlier when i used to go to my website, i showed as an active user. But since the last 3 days, i cannot see me as active user and that means it is not tracking properly. Can someone help

  • paula

    I manage Google Analytics for my divisions’ sites and always find such a struggle trying to explain this to other project managers that work in GA. This article explains this issue SO comprehensively to people who aren’t as advanced. THANK YOU!

  • I recently started with Google Adwords, but when I checked my referrals it seems legit at first, but when I looked closer then I found the scam traffic. I wonder if Google itself sends or add scam referral sites to it’s adword system!

  • Well can any one Optimizer tells me bot or fake traffic affect on your organic traffic? after receiving lots of traffic from referrals my organic traffic going down sky to land please tell me how to get rid of this? Thanks Telezone

    • At this point I have not seen a lot of spoofed organic traffic. Mostly referrals.

      • i didn’t get what you said please describe.

        • It is hard to spoof organic traffic from Google, because only Google classifies referrals from their site as organic search. But they likely have filters to block out people trying to spoof this traffic or they match on hostname to prevent this from happening.

          They don’t have the same filters for referrals, which makes it easy to spoof.