Ridiculously huge SEO Crawl - help

White Hat SEO

Ridiculously huge SEO Crawl – help

Posted by Smokinlizardbreath on November 24, 2022 at 8:41 pm

We were tasked with doing a site assessment for a nonprofit and when I ran it through Screaming Frog just vanilla basic it came up with close to 500,000 URLs. 98% of those turned out to be for a calendar on one page that is creating a unique URL for every day of every year from 1849-2400. I also noticed they don’t have an XML sitemap. Would having a sitemap get rid of all those excess URLs from the calendar? I can’t seem to find a precedent for this and I am not sure what to tell them. Can someone help with this? Thank you! (when I excluded the calendars from the crawl they had about 5500 URLs which is closer to what I thought it should be)

Smokinlizardbreath replied 2 years, 7 months ago 2 Members · 1 Reply
1 Reply

Neither-Emu7933

Guest
November 24, 2022 at 9:34 pm

A sitemap won’t stop those URLs from existing, but it will allow you to show Google which URLs you care about.

But to prevent Google from wasting time on them I would block them in robots.txt file. But make sure that Google hasn’t indexed them first, if so put noindex on those URLs then wait until they are removed from the index then block the crawler.
slapbumpnroll

Guest
November 24, 2022 at 10:39 pm

Ok a few things:

1. You can check in Google search console if these URLs have been found and indexed by Google by using the inspection tool
2. It’s unlikely they have but if they have you can set a noindex tag on them (or block crawling them in robots.txt)
3. If they really are not useful in any way it might be worth asking the dev if they can stop them generating unique URLs at all.
4. Regarding site map, yes it’s good to not have them on there but if Google has already found them it’s irrelevant.
SEOPub

Guest
November 25, 2022 at 1:30 am

A sitemap won’t fix this problem.

You need to get the developer involved and have them figure out why all those pages are being created and how to stop it.
Flaneur_7508

Guest
November 25, 2022 at 6:38 am

They won’t be indexed. I’m betting they are canonicals. Just disallow them in robots.
Cultural-Recipe2404

Guest
November 25, 2022 at 7:17 am

Before blocking in robots, check your pages report on Search Console. If the pages are indexed, put noindex nofollow tags on the pages, let Google crawl the site so the pages drop from the index and then block in Robots.txt. If you miss this step they will remain in the index.

If they aren’t indexed you can go straight to block them in Robots.txt 🙂
ArtisZ

Guest
November 25, 2022 at 9:54 am

If it’s one script that generates these calendar entries then you ought to add meta nofollow, noindex instruction. If SEO is the only concern then this will get the job done.

Otherwise, like others have stated – a change of behaviour for that script ought to be made.

Ridiculously huge SEO Crawl – help

Neither-Emu7933

slapbumpnroll

SEOPub

Flaneur_7508

Cultural-Recipe2404

ArtisZ