Allow GPTBot to Crawl Your Site


Editors Note: This post was originally published in August 2023, shortly after OpenAI announced GPTBot. Our core instinct in response to this update was: don’t block your way out of visibility, and we still support that. However, things have changed significantly since then.

We’ve added 2026 updates throughout this post where our thinking has been refined, corrected, or made more specific, and a full update at the end on our thinking. But we have preserved our original post to really show what’s changed.


A couple weeks ago, OpenAI, the creators of ChatGPT, released information about their web crawler GPTBot

OpenAI states: 

“Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.

Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”

Webmasters can manage GPTBot’s access to their websites through robots.txt, and have the option to block it entirely. Many sites have moved quickly to keep GPTBot’s little robot hands off of their data. But should they? 


2026 Update: The question has changed. When this post was originally written, GPTBot was the bot. By 2026, the major AI vendors each run several distinct crawlers with separate purposes and separate controls.

OpenAI alone operates GPTBot (model training), OAI-SearchBot (AI search indexing), and ChatGPT-User (live fetches on behalf of a real person who asked a question).

Anthropic runs ClaudeBot, Claude-SearchBot, and Claude-User. Each can be controlled independently. The “should I block GPTBot?” question has been replaced by a more useful one: which bots, accessing which content, verified how? More on this below.


Let’s start with the concerns.

There are plenty of understandable concerns about allowing GPTBot unrestricted access to your site.

Most commonly, people are concerned about copyright infringement – and rightfully so: OpenAI didn’t exactly lead with trust when they began collecting data and training their models without any of us knowing about it in 2021.

Sarah Silverman and others are actively suing OpenAI for leveraging their work in its models and not crediting or compensating them.


2026 Update:  The legal risk is real, but it concentrates in specific places. The copyright concerns from 2023 haven’t gone away. If anything, questions of data ownership and downstream use are more unsettled now than they were then.

What’s become clearer is where the risk concentrates: pricing data, inventory feeds, and proprietary content carry very different legal and competitive exposure than public marketing copy or editorial content.

A blanket “allow everything” posture trades away protections you didn’t need to give up. A blanket “block everything” posture costs you visibility you can’t afford to lose. The useful conversation is about content type, not site-wide access.


But I have to admit that this entire conversation feels like deja vu, circa 2014 when Google first began placing featured snippets in SERPs.

Many in the SEO community were upset by the idea that Google would “steal” content and feature it directly on the internet’s homepage. Publishers worried that people would find the answers they sought directly on Google, and never click through to learn more from the site itself. 

That did happen in some cases, but for the most part, featured snippets continued to drive traffic to websites.

The cost of lost traffic for advertisers from AI is very real, and I don’t mean to diminish that; but isn’t the entire purpose of a search engine to provide people with answers to their questions in the easiest way possible?

It’s not exactly an apples-to-apples comparison because ChatGPT doesn’t cite its sources (yet), but it’s not hard to squint and see the similarities.

Now let’s talk about the upsides.

Brands risk harming their reputation if they opt out of participating in LLMs.

Yes, there is certainly a risk to having data and information misrepresented or miscredited within ChatGPT, but I would argue that the risk of not being present where your audience is searching is far greater.

Imagine a website blocking Googlebot in the early 2000’s. That’s a decision any business owner today would have come to regret! 


2026 Update: We no longer have to imagine. We have the data.

In May 2026, we analyzed our own website traffic at seerinteractive.com

 

The machine visitors are overwhelmingly there to find and surface the site: indexing plus live retrieval is about 87% of all agent activity, training only 13%.

The takeaway is not to block broadly, it is to welcome the declared crawlers and reserve blocking for the undeclared and malicious.

We’ve seen what happens when brands get this wrong.

In our Agentic Commerce audits, we’ve scored major national brands in the low single digits out of 100 for one reason: their pages were blocked to crawlers.

Strong brand, real demand, and a total absence from AI answers, because the front door was bolted shut. That is an access problem, and it is fixable in an afternoon once it is spotted.


The push and pull between publishers and the general public has always been there, and will continue to persist. I encourage all publishers, SEOs, and webmasters to step out of their respective roles and put on the hat of an everyday person using the internet: at the end of the day, people just want answers and information as fast and as easy as possible.

And our job as marketers is to remove friction between those users and their answers.

ChatGPT provides an easy solution to getting those answers and that information. It’s a matter of time before people start to take full advantage of it. 

Factual vs. fluid searches

Dr. Pete shared some thoughts on the pros and cons of AI and search engines at August 2023’s MozCon. 

He argues that search engines still provide the best answers for factual searches today, but AI excels at creatively solving problems when people bring it pieces of information they’re looking to string together.

He calls these “fluid” searches – instances where people kind of know what they want, but need help gathering information to draw conclusions.

fluid searches

(sourced from Dr. Pete’s Mozcon slides)

I published a post about six months ago saying that ChatGPT won’t kill Google. I still stand behind that statement, especially with SGE’s recent developments. It took Google longer than I thought to catch up with their response to ChatGPT, but not surprisingly, they’re right back in the game. 

ChatGPT’s usage is slipping, but I stil believe it will be a staple in many people’s tech toolkit. I don’t know if the general public will ever shift to it as their primary search engine, but it will only continue to become more useful as it begins to ingest data more real-time through its web crawler.


2026 Update: The “slipping” concern aged out quickly. The 2023 dip in ChatGPT usage looks like a blip from here. AI-powered answers have become a primary surface for product discovery, brand evaluation, and purchase decisions.

The factual vs. fluid distinction Dr. Pete drew at MozCon still holds intellectually, but in practice, AI has expanded into both categories. The more important question for marketers now isn’t whether people are using AI search, because they are, it’s whether your brand shows up when they do.


My recommendation? Don’t fight it. 

I, for one, welcome our new robot overlords.

In the marketing and search industry, we see time and time again that search engines evolve by removing more and more friction. That’s what we’re seeing here – creating better experiences for users.

My general rule of thumb: if you don’t want search engines (or now, LLMs like GPT) to use your information, don’t put it publicly on the internet.


2026 Update: The “don’t fight it” instinct is still right. The rule of thumb needs a revision. The spirit of “don’t block your way out of visibility” has only become more true. But “if it’s public, it’s fair game” is too blunt a heuristic for 2026.

Here’s the more useful version: allow the declared and verifiable, block the undeclared and evasive, and gate what’s genuinely sensitive by content type, not site-wide.

Your public marketing copy, editorial content, and product descriptions?

Open to the legitimate crawlers.

Real-time pricing, inventory feeds, account data?

Those warrant a real data agreement before access is granted.

And importantly: a bot that claims to be GPTBot isn’t necessarily GPTBot. A user-agent string is a claim anyone can type. Verification matters, and it has to go deeper than the label.


What The Last Three Years Has Taught Us (2026 Update)

If you’re still asking “Bots: Yes or No?”, you’re asking the wrong question.

Every page you own is now visited by crawlers that train models, power AI search answers, fetch pages on behalf of a real person, or quietly scrape you for someone else’s benefit. Block them all and you disappear from the AI answers your customers increasingly trust. Open the doors to everyone and you take on cost, competitive, and legal risk you never agreed to. The work is in telling them apart.

After running bot access audits across enterprise clients, here’s where we’ve landed.

Replace one Yes/No with three better questions.

A blanket allow or a blanket block both leave value on the table. The useful conversation breaks the decision into three parts.

Which bots? Not all bots want the same thing. Some train models, some power live AI answers, some fetch a page for a person who asked, and some are scrapers wearing a disguise. Purpose decides value.

Which content? Your marketing copy, your reviews, your pricing, and your inventory carry very different risk. Exposure should be decided page type by page type, not site-wide.

Verified how? A bot that says it is GPTBot is not necessarily GPTBot. Trust without verification is just an open door with a friendly sign on it.


Sort by purpose: the same vendor runs different bots for different jobs.

The single most useful fact here: the major AI vendors split their crawling into separate, independently controllable bots, so you can decide by purpose. Our guidance is to welcome the declared, legitimate crawlers that make you findable, let approved agents transact, and draw the hard line at the undeclared and the malicious.

The verdicts below are our default starting posture, not a rule.

 

Put the two axes together and the decision stops being abstract. Read down for the kind of bot, across for the kind of page. This is a starting posture to adapt to your goals, your platform, and your risk tolerance, not a fixed policy.


Which content gets exposed?

Expose first, gate second. Open marketing, product descriptions, and reviews to the declared, legitimate crawlers right away, because that is pure visibility upside, and let approved agents transact through declared protocols.

Hold price and inventory back behind a platform or data agreement, because that is where competitive and legal exposure concentrates. Phase access in deliberately rather than flipping the whole site open at once.


Sort by legitimacy: a user-agent is a claim, not proof.

Sorting by purpose only works if the bot is honestly who it says it is. Identity sits on a ladder of confidence. 

The higher you climb, the harder a bot is to fake, and a well-behaved crawler should be transparent: an honest name, a way to verify it, and a contact if something breaks.

  1. User-Agent String: The name the bot announces itself as. Useful for sorting, trivially easy to forge. Necessary, never sufficient.
  2. Published IP Ranges & Reverse DNS: Confirm the request came from the vendor’s own machines. Works well where the vendor publishes their ranges, for example OpenAI. Note the gap: some vendors, including Anthropic, do not publish ranges and use shared cloud IPs, so this check is vendor-dependent.
  3. Managed Bot Verification: Infrastructure platforms maintain verified-bot lists and score traffic by behavior, catching impersonators that pass the name check. This is where most teams get leverage without building it themselves.
  4. Web Bot Auth (Emerging): A cryptographic signature that lets a bot prove its identity rather than merely assert it. Early but the clear direction of travel for trustworthy automation.

In August 2025, Cloudflare reported that when a declared AI crawler was blocked by robots.txt, it observed traffic switching to a generic browser identity, rotating IP addresses and networks, and continuing to pull content, at a scale of millions of requests a day.

Cloudflare removed the vendor from its verified-bot list; the vendor disputed the findings. Whoever is right, the lesson holds: a policy enforced only on a user-agent name is a policy that the least trustworthy actors can simply step around.

Verification has to go deeper than the label.


Our recommended posture: controlled, phased, verified access.

  1. Measure before you decide.
    Read the crawler logs first. Establish who is actually visiting and whether today’s policy matches today’s intent before changing a single line.
  2. Allow the declared, legitimate crawlers.
    Search, retrieval, and training all make you more findable and more accurately represented in AI answers. Treat blocking a declared crawler as the exception that needs a reason.
  3. Let approved agents transact.
    Open the door to agent-driven commerce deliberately, through approved and verifiable protocols, and in phases rather than all at once.
  4. Expose by content type, gate the sensitive, verify identity, open marketing, descriptions, and reviews.
    Hold price, inventory, and account data behind a real agreement. Enforce policy on identity that can be checked, not on a user-agent name anyone can type.
  5. Allowlist your own and your partners’ tools.
    The SEO and site-monitoring crawlers you and your agency depend on (Screaming Frog being the obvious one) look like undeclared bots to a blocking rule. Add their IP addresses to an allowlist before you switch on undeclared-bot blocking, so the policy does not lock out the people working on your behalf, us included.
  6. Nuance is KEY. Bots can disguise themselves as other bots, so before blocking anything, make sure you’re not cutting off legitimate crawlers. Monitor your traffic sources closely for unexpected dips — they’re often the first sign something legitimate got caught in the crossfire. When in doubt, err on the side of caution. Monitor, monitor, monitor.
  7. Put one owner on it.
    This decision lives at the seam between IT, merchandising, and legal. Assign someone to own it.

We do not recommend opening real-time pricing or inventory to general bot access without flagging the infrastructure, competitive, and legal risks first.

The goal is to be deliberate, so you capture the AI visibility you want without taking on the exposure you did not agree to.

For questions about where your site stands, reach out to our team.





Source link

Related Articles