
According to Cloudflare’s 2025 data, approximately 30% of all global web traffic is generated by bots.
Not all bots are created equal. This 30% can be categorized into two distinct groups:
- Verified Bots (The “Good” Bots): These are automated services that site owners generally want to interact with. They follow rules (like
robots.txt) and provide a service.
- Unverified & Malicious Bots (The “Bad” Bots) make up the majority of the “unwanted” traffic, don’t follow
robots.txtrules and often hide their identity.
Just because bots are known and verified doesn’t mean you want their traffic and these are some steps to deal with them
Known Bots and Robots.txt
The relationship between bots and robots.txt is best described as an Honor System.
The file itself has no technical power to stop a bot; it is simply a “Keep Off the Grass” sign that bots choose to respect or ignore before entering a site.
For known bots like Googlebot, Bingbot, or DuckDuckGo, the robots.txt file is the law. These are known as “Verified Bots” in Cloudflare’s ecosystem.
The robots.txt file is always located in the root directory of a website.
For WordPress users, keep in mind that WordPress generates a virtual file if a physical one doesn’t exist.
You can add a physical robots.txt file by accessing your the root of your WordPress installation.
I am sure that there might be plugins that take care of that too from the dashboard.
Blocking User Agents using Robots.txt Directives
Just because a bot has been verified doesn’t follow its crawl requests are welcome.
This is how you instruct one of the bots of meta not to crawl your website.
User-Agent: meta-webindexer/1.1
Disallow: /
Even after they see the rule, the bot might finish its current “list” of pages it already planned to visit before stopping.
You can block as many good bots as you want using robots.txt directives so you can focus on the bad actors.
Check the bots directory to learn more
Blocking Paths using Robots.txt Directives
“I include directives for both existing and non-existent directories as a formal ‘keep out’ sign for bots.
While my Cloudflare firewall rules handle the actual enforcement with a hard block.
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /tag/
Disallow: /feed/
Disallow: /category/
Disallow: /search/
Disallow: /author/
Disallow: /pages/
Disallow: /blog/
Disallow: /page/
It’s my way of telling the ‘good’ bots: ‘I don’t want to cause you any trouble, and I’d appreciate it if you didn’t cause me any either.’ By setting these rules early, we both know where the boundaries are
Cloudflare Verified Bot Categories
The Verified Bots category is a whitelist of automated services that have been manually reviewed and confirmed as “legitimate” or “helpful” by Cloudflare.
Instead of just looking at a User-Agent string (which can be easily faked), Cloudflare verifies these bots.
(cf.verified_bot_category eq "Page Preview")
You can check a list of all the bots included in that category by visiting the bots directory.
To ensure your content is correctly ranked, shared, and monetized, your firewall should always whitelist three key categories: Search Engine Crawlers, Page Previews, and Advertising & Marketing bots.
A targeted crawling strategy means allowing Search Engine Crawlers in general, but specifically directing bots like Yandex to back off via robots.txt.
If you don’t want to whitelist an entire category, you can create a list like this in which you allow a user agent that contains an specified keyword and that belongs to known bots
(cf.client.bot and http.user_agent contains "AhrefsBot")
That rule confirms the bot’s identity so you aren’t accidentally letting in a malicious scraper using a fake User-Agent.
Block Access to File Extensions
Since all my images were converted to WebP, I have instructed know bots not to request for them.
Site doesn’t require external sheets or scripts.
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
Disallow: /*.js$
Disallow: /*.css$
Most of the hits to those types of files are made by bots, once the cache is cleared and a few have passed, a real user will not request those files.