Google’s Robots.txt Parser is Misbehaving

The newly-released open source robots.txt parser is not, as Google claims, the same as the production Googlebot parsing code. In addition, we have found cases where each of the official resources disagrees with the others. As a result, there is currently no way of knowing how the real Googlebot treats robots.txt instructions. Read on for example robots.txt files that are treated differently by Googlebot and by the open source parser.

Googlers: if you’re reading this, please help us clarify for the industry how Googlebot really interprets robots.txt.

Google recently released an open source robots.txt parser that they claimed is “production code”. This was very much needed because, as they said in the announcement blog post, “for 25 years, the Robots Exclusion Protocol (REP) was only a de-facto standard”.

Before they released it, we might have thought that the substantial documentation from Google and the online checking tool from Google amounted to a reasonable ability to know how Googlebot would treat robots.txt directives in the wild.

Since the release of the open source parser, we have found that there are situations where each of the three sources (documentation, online checker, open source parser) behave differently to the others (see below for more on each of these situations):

Table of misbehaving robots.txt parser

This might all be just about OK if, as claimed, the open source parser is the authoritative answer now. I guess we could rely on the community to build correct documentation from the authoritative source, and build tools from the open source code. The online checker is part of the old Google Search Console, is clearly not being actively maintained, and documentation can be wrong, or fall out of date. But to change the rules without an announcement in an area of extreme importance is dangerous for Google, in my opinion.

The existence of robots exclusion protocols is central to their ability to cache the entire public web without copyright concerns. In most situations, this kind of mass copying would require opt-in permission – it’s only the public interest in the existence of web search engines that allows them to work on an assumption of default permission with an opt-out. That opt-out is crucial, however, and Google is in very dangerous territory if they are not respecting robots.txt directives.

It gets worse though. The open source parser is not the same as the production Googlebot robots.txt parsing code. Specifically, in the third case above, where the open source parser disagrees with the documentation and with the online checker, real Googlebot behaves as we would previously have expected (in other words, it agrees with the documentation and online checker, and disagrees with the open source parser). You can read more below about the specifics.

The open source parser is missing Google-specific features

Even if you don’t know C++ (as I don’t), you can see from the comments on the code that there are a range of places where the open source parser contains Google-specific features or differences from the specification they are trying to create (the line linked above – line 330 of robots.cc – is one of a number of changes to make Googlebot more forgiving, in this case to work even if the colon is missed from a “User-agent:” statement).

Given these enhancements, it’s reasonable to believe that Google has, in fact, open-sourced their production parsing code rather than a sanitised specification-compliant version that they extend for their own purposes. In addition, they have said officially that they have retired a number of enhancements that are not supported by the draft specification.

Take the code, their official announcements, and additional statements such as Gary Illyes confirming at Pubcon that it’s production code, and we might think it reasonable to believe Google on this occasion:

The parser that Google open sourced for robots.txt in their GitHub is the actual production code. @methode #Pubcon

— Patrick Stox (@patrickstox) October 8, 2019

That would be a mistake.

If you use the open source tool to build tests for your robots.txt file, you could easily find yourself getting incorrect results. The biggest problem we have found so far is the way that it treats googlebot-image and googlebot-news directives (and rules targeting other sub-googlebots as well as other non-googlebot bots from Google like Adsbot) differently to the way the real Googlebot does.

Worked example with googlebot-image

In the absence of directives specifically targeting googlebot-image, the image bot is supposed to follow regular Googlebot directives. This is what the documentation says. It’s how the old online checker works. And it’s what happens in the wild. But it’s not how the open source parser behaves:

googlebot-image misbehaving

Unfortunately, we can’t fall back on either the documentation or the old online checker as they both have errors too:

The online checker has errors

googlebot/1.2 is equivalent to googlebot user-agent

Now, it’s quite hard to work out exactly what this part of the documentation means (reviewing the specification and parser, it seems that it means that only letters, underscores, and hyphens are allowed in user-agents in robots.txt directives, and anything that comes after a disallowed character is ignored).

But, it is easy to understand the example – that googlebot/1.2 should be treated as equivalent to googlebot.

That’s what the documentation says. It’s also how it’s treated by the new open source parser (and, I believe, how the real Googlebot works). But it’s not how the online robots.txt checker works:

Google Search Console robots.txt checker is wrong

The documentation differs from reality too

Unfortunately, we can’t even try to build our own parser after reading the documentation carefully because there are places where it differs from the online checker and the new open source parser (and, I believe, production Googlebot).

For example:

user agent matches the first most specific rule

There are some examples in the documentation to make it clear that the “most-specific” part refers to the fact that if your robots.txt file disallows /foo, but explicitly allows /foo/bar, then /foo/bar (and anything contained in that, such as /foo/bar/baz) will be allowed.

But note the “first” in there. This means, to my understanding, that if we allow and disallow the exact same path, then the first directive should be the one that is obeyed:

Search Console is correct in this instance

But it turns out the order doesn’t matter:

Search Console doesn't match the documentation

In summary, we have no way of knowing what real Googlebot will do in every situation

All the sources disagree

And we know that real Googlebot can’t agree with all of them (and have tested one of these areas in the wild):

And actual Googlebot behaves differently to all of them

What should happen now?

Well, we (and you) need to do some more testing to figure out how Googlebot behaves in the real world. But the biggest change I’m hoping for is some change and clarity from Google. So if you’re reading this, googlers:

Please give us the real story on the differences between the newly-open-sourced parser and production Googlebot code
Given the proven differences between the old Search Console online checker and real Googlebot behaviour, please remove confusion by deprecating the old checker and building a compliant new one into the new Search Console

Source link