ResourcesContent HeroShould you block AI crawlers from scraping your website?

Should you block AI crawlers from scraping your website?

should you block AI bots from crawling website

What do Amazon, Quora, The Guardian, The New York Times, CNN, WikiHow, and Shutterstock have in common? They all block GPTBot.

Perhaps you were unaware, but AI crawlers like GPTBot and Google Bard scrape the internet, using other people’s content as fodder to train their language models (LLMs).

Your content feeds this money-making machine and you don’t see a bean from it. The great AI revolution is built on your back, broken or not.

This raises ethical and intellectual property concerns, but let’s forget all that. I’m here to tell you that you can take matters into your own hands (sometimes).

Read on for instructions and insight.

Big names are blocking AI in droves

A study by Originality.ai analysing the top 1000 websites globally found that as of September 2023, 26% of the top 1000 websites were blocking GPTBot specifically, and 13.9% were blocking Common Crawl Bot.

Key findings include:

  • The first top 100 website to block GPTBot was Reuters on August 8, 2023 – just 1 day after GPTBot launched.
  • Within two weeks, major sites like Amazon, Quora, The New York Times, Shutterstock, WikiHow and CNN had blocked GPTBot.
  • Overall, 26% of the top 1000 block GPTBot, while 14% block the Common Crawl bot and 7% block a generic ChatGPT user agent bot. Very few are targeting other AI bots so far.
  • The study found that top 100 websites were initially more likely to block GPTBot but now the block rate is similar – 26% for both the top 100 and top 1000 websites.
  • News and media outlets have rushed to block GPTBot, including The New York Times, The Guardian, CNN, USA Today, Reuters, Washington Post, NPR, CBS, NBC, Bloomberg and more.

While only 26% of top sites are blocking AI bots, the findings suggest that websites are starting to respond to AI bots by limiting access. For now, media outlets, particularly, are leading the charge in restricting AI bots like GPTBot.

How to block GPTBot

The biggest culprit for internet scraping is GPTBot, the golden child of AI tools because it’s free, capable, and updated every few months with a better model.

Here’s what OpenAI has to say about GPTBot:

Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.

OpenAI.

Dunno, sounds a bit flaky.

The good news – you can block GPTBot and some other bots from crawling your website and eliminate the possibility of this model using your content to train itself.

Here’s how:

Add a Robots.txt File

To stop GPTBot from accessing your entire website, you need to add a line of code to your robots.txt file. Here’s how:

  1. Create a plain text file called robots.txt.
  2. Add the following lines to block GPTBot:
User-agent: GPTBot
Disallow: /
  1. Upload the robots.txt file to your website’s root directory.

This will prevent GPTBot from crawling any pages on your site.

Block GPTBot from Specific Directories

You can also use robots.txt to only block GPTBot from certain folders or pages.

For example:

User-agent: GPTBot 
Allow: /public-pages/
Disallow: /private-pages/

This allows GPTBot to access /public-pages/ but blocks it from /private-pages/.

Adjust the paths and instructions accordingly for your website structure. Make sure to upload your updated robots.txt file after making any changes.

How to block Google Bard

Google does not provide a user-agent disallow feature for Bard, but it does have a standalone product token, Google-Extended, which excludes Bard and Vertex AI from accessing your website’s content.

You can update your robots.txt to block Google-Extended from accessing your content, or parts of it with this text:

User-agent: Google-Extended
Disallow: /

It’s as simple as that.

How to block Bing Chat AI

Bing provides a few options to control how your website content is used in Bing Chat answers and for training Bing’s AI models. You can implement these controls using standard meta tags.

To limit Bing’s use of your content:

  1. To prevent your full content from appearing in Bing Chat, while still allowing titles, snippets, and URLs, add a “NOCACHE” meta tag to your pages.
  1. To prevent any of your content from appearing in Bing Chat answers, and block its use in training AI models, add a “NOARCHIVE” meta tag.
  1. To block AI model training but allow titles, snippets, and URLs in Bing Chat, use both “NOCACHE” and “NOARCHIVE” tags. Bing will treat this as “NOCACHE” rules.

You can find out more about Bing meta tags here.

Should you block AI bots from your website?

Now you know how to block GPTBot, Bard, and Bing Chat AI, should you?

The answer isn’t clear-cut.

On one hand, there are clear ethical and legal concerns over your content (I.E. YOUR INTELLECTUAL PROPERTY) being used by AI without you being compensated for it.

Why shouldn’t you get a slice of the pie? And if not, why not shut up shop?

On the other hand, blocking AI crawlers from your website means your content can’t be used to answer questions in generative AI chat and search results. You will effectively shut the door to a technology that is projected to explode in popularity over the next decade, its value growing twentyfold by 2030.

It should also be said that the whole concept of the open web is for content to be freely accessible, be it to humans or bots.

Should you bite the hand that might feed you?

The truth is that when OpenAI released the code to block their web crawler ChatGPT, big companies like Amazon and CNN raced to implement it.

If you’re a news publisher or sell valuable information, blocking bots makes sense to preserve your copyright. You don’t want your paywalled articles copied freely.

But many businesses rely on discoverability, and being excluded from training data – and chat outputs – could put you at a disadvantage.

For instance, say you sell solar-powered lights. If your site is blocked but competitors’ sites aren’t, AI results will only show alternatives, losing you revenue.

Worse, if biased or false data about your industry abounds elsewhere online, AI tools could inadvertently spread misconceptions by answering based solely on that flawed training. In this case, staying visible actually defends your position.

There’s also the advertising angle. Services like Common Crawl, which indexes the web for public AI training sets, are used by companies to identify websites to target ads. Opting out means losing this potential revenue stream.

But above all this, is the fact that AI bots crawl, scrape, and take content without consent to train their models, and that leaves a bad taste in our mouths.

The jury is out. For now, we haven’t blocked any bots, but we will review our position as AI regulations are introduced in the future.

Jakk Ogden is the founder and CEO of Content Hero.


© 2011 – 2021 Punchy Media Ltd, registered company no. 09001114. ‘Content Hero’ is a registered trademark in the United Kingdom, owned by Punchy Media Ltd. Trademark number: UK00003302609. Class: 35, Copywriting. Terms & Conditions. Privacy Policy. 3 Park Square East, Leeds, LS1 2NE.