How to detect fake search engine crawlers and requests with empty user agents

Bots can be categorized into good bots or bad bots.

Good bots are beneficial to all online businesses. They help in creating the required visibility of the sites on the internet, and also help these businesses achieve an online authority.

When you search for a site or phrases related to the site's products or services, you get relevant results listed on the search page. This is made possible with the help of search engine spiders/bots, or crawler bots.

Good bots are regulated. There's a specific pattern to these types of 'regulated' bots and you also get the option to tweak the crawler activity on your site.

These bots help in improving the website’s SEO.

Bad bots don’t play by the rules. They're illegitimate and have a definitive ‘malicious’ pattern and are mostly unregulated. Imagine thousands of page visits originating from a single IP address within a very short span of time. This activity stresses your web servers, and chokes the available bandwidth. This directly impacts those genuine users on your site, trying to access a product or service.

What is Fake Web Crawler signal?

Picture the Google web crawler for a moment – the bot that scans your website for search engine results. How do we know that this particular bot is an official Google bot? Because it tells us it is, through the User Agent ID.

When browsing to a web page, your browser will send along a piece of text that gives the web server details of the browser you’re using. For example, that it’s Google Chrome, the version is 73.0.3686, it’s 64-bit, etc. This is the User Agent ID.

The Google web crawler does the same thing, and it usually has the text ‘Googlebot‘ in there somewhere, or something similar.

But what’s to stop all bots throwing in Googlebot into its User Agent string and making us think it’s a Google web crawler?

Absolutely nothing, of course. And they do.

What can we do about that? Well it turns out there are ways to confirm whether a bot is really a Google bot, or not. Shield Security has been doing this for a long time already.

And because we can confirm whether a bot is real-Google or fake-Google, we can now use this as a ‘bad bot’ signal. We can confidently state that a fake-Google user agent is a bad bot, masquerading as a good one.

What is Empty User Agents signal?

We discussed what User Agents are, above. All normal traffic by people have user agents. But some bots can be sloppy and neglect to include a user agent into their requests. This suggests that perhaps they’re bot.

Care needs to be taken with this setting as not all webhosts are properly configured to populate the user agent request value in PHP. This can make it appear that there’s no User Agent sent with a request, when there is. You’ll need to test this on your hosting platform to ensure that you can use this signal.

Read more about Bot Signals here.

How to detect fake search engine crawlers and requests with empty user agents

Shield uses the above explained "bot signals" to detect characteristics and behavior commonly associated with illegitimate bots. So, based on that, Shield can

  • Detect fake search engine crawlers
  • Detect requests with empty user agents

The settings can be found under the main navigation menu > Config > IP Blocking > Bot Behaviours.

Here you can configure "Fake Web Crawler" and "Empty User Agents" bot signals independently from each other and decide how you want Shield to respond. You’ll have 4 options to choose from:

  • Activity Log Only. This option lets you see the activity of these bots on the Activity Log before applying any offenses or blocks to offenders. It’ll let you test-drive the signal before making it take effect.
  • Increment Offense (by 1). This option puts another black mark against an IP. As always with the offense system, once the limit is reached for an IP address, it is blocked from accessing the site.
  • Double Offense (by 2). We’ve added the ability to give weight to certain behaviours. By allowing the offense counter to increment by 2, the IP will reach the limit more quickly, and be blocked sooner.
  • Immediate block. If you decide that a particular signal on your site is severe enough, you can have Shield immediately mark that IP as blocked.

You can see the all activities about these bots in your activity log. For example, fake web crawler: