Cookie Consent by Free Privacy Policy Generator

The Best Of

Go to the Best Of the SEO Community.

Shawn Rubel
Shawn Rubel
Nov 21, 2023, 8:03 PM
Forwarded from another channel:
What is the biggest site you’ve ever crawled (in terms of number of pages) and what is your tool of choice to crawl it?
Forwarded thread from another channel:
Eric Wu
Eric Wu
Nov 21, 2023, 8:13 PM
It’s been a while but largest was close to 200,000,000 URLs for a single site
That tool was something I built myself because Screaming Frog would freeze
This was 8+ yrs ago, so SF is way better now, but I’d imagine you’d still have to use the server version
For anything over 500,000 URLs I’d probably look at a cloud solution if you can’t make your own
Andrew Prince
Andrew Prince
Nov 21, 2023, 8:17 PM
A few dozen million using Botify with scheduled crawls that could take several days
Kyle Faber
Kyle Faber
Nov 21, 2023, 8:19 PM
60M (pales compared to @e).
Jetoctopus (back when it was more affordable to scale)
Eric Wu
Eric Wu
Nov 21, 2023, 8:23 PM
Honestly I rarely crawl more than 50,000 URLs these days
I haven’t had to do internal link mapping in ages
ah
ah
Nov 21, 2023, 9:06 PM
I segment because anything over 2 million takes forever to scan and chances are the devs will mess something up in that time lol.
ah
ah
Nov 21, 2023, 9:12 PM
When I was first starting to front end script I found myself annoyed that the largest companies often had the worst source ie Yahoo/MSN (2001)
Ash Nallawalla
Ash Nallawalla
Nov 21, 2023, 11:46 PM
My last employer had an estimated 10 quintillion URLs, thanks to a previous SEO's love for the long tail. It was estimated to take Google 350 billion years to crawl. I spent 2.5 years bringing it down to a manageable 5 million pages in the index. We used Screaming Frog to crawl samples, but it could do 9 million pages using the database storage mode.
Ramon Eijkemans
Ramon Eijkemans
Nov 22, 2023, 3:38 AM
Millions, but usually it’s not necessarily to crawl them all. SF database mode is quite powerful these days. And usually you see everything you need to see after the first million
Shawn Huber
Shawn Huber
Nov 22, 2023, 7:33 AM
The last place I worked was like @e hundreds of millions of URLs. The answer, never crawled the full site. Both from a timing stand point, and cost to crawl.
Since it was a very template based programmatic site, I had to sample crawl a good cohort of URLs for issues, server logs, and GSC data was my best friend. I also used to help monitor of engineers rolled back any SEO updates or changes to the templates so I could get them fixed before they became an issue.
Shawn Huber
Shawn Huber
Nov 22, 2023, 7:34 AM
I used both Screaming Frog and Sitebulb
Hevlin Costa
Hevlin Costa
Nov 22, 2023, 8:47 AM
A couple million pages using Botify, but never crawling the entire site, we prioritize
Paul Baterina
Paul Baterina
Nov 22, 2023, 11:05 AM
I have a couple mill.....and I am dying wanting to bring botify on. But business decisions are tough.
ah
ah
Nov 22, 2023, 11:05 AM
you can’t go wrong. I am going to renew my license lol
Check out Sitebulb's flexible pricing options, which allow you to pay on a yearly basis with no minimum term. Add extra team members to your plan for very little additional cost, and pay for only what you need.
Shawn Huber
Shawn Huber
Nov 22, 2023, 11:05 AM
We did get Quattr after a bit of time (less expensive than Botify) which did help a bit with piecing together some insights, but still wasn't able to crawl every URL
ah
ah
Nov 22, 2023, 11:08 AM
Sitebulb has a server version which I am thinking of getting this year so I can work remotely more.
Abishek Rajendra
Abishek Rajendra
Nov 22, 2023, 11:08 AM
We used OnCrawl for a one-time ~200M crawl, and Botify for 35M every month.
Shawn Huber
Shawn Huber
Nov 22, 2023, 11:12 AM
I talked with Patrick and Gareth a few times about crawling such a large site - they were the ones that helped me see the value of sampled crawling versus a full site crawl. It was before server came out though - seems like a solid option for crawling.
ah
ah
Nov 22, 2023, 11:15 AM
They are wonderful people. I contribute ideas to expand on their platform and they usually build it out. I am ever surprised how much is in there.
Alex Wilson
Alex Wilson
Nov 22, 2023, 12:49 PM
once you get past a couple hundred k i question the sanity and immediately tell the client, i wasnt able to crawl it all, we need to pare this down.
ah
ah
Nov 22, 2023, 1:10 PM
I once crawled a react site with 26k pages out of 500k had no headers. That and their competitors were using embedded random anchor text widgets. It was rather blatant
Eric Wu
Eric Wu
Nov 22, 2023, 9:14 PM
For those doing crawls in the millions each month, what questions are you trying to answer with the large crawls that a sample crawl wouldn’t answer?
Ian Cappelletti
Ian Cappelletti
Nov 23, 2023, 7:36 AM
some sites are frankensteins built across like 30 apps and 20 CMSs. unless you have a traveler's guide and collect notes from all of the teams there's no way of knowing if your sample crawling is representative
Ian Cappelletti
Ian Cappelletti
Nov 23, 2023, 7:37 AM
like a single subfolder with 200 URLs might be operating on something completely different from the pages in the parent
Dan
Dan
Feb 2, 2024, 10:16 AM
270m with a custom crawler but you’ll never be able to crawl ‘all the pages’ unless it’s continuously matched with log file data.

Our Values

What we believe in

Building friendships

Kindness

Giving

Elevating others

Creating Signal

Treating each other with respect

What has no home here

Diminishing others

Gatekeeping

Taking without giving back

Spamming others

Arguing

Selling links and guest posts


Sign up for our Newsletter

Join our mailing list for updates

By signing up, you agree to our Privacy Policy and Terms of Service. We may send you occasional newsletters and promotional emails about our products and services. You can opt-out at any time.

Apply now to join our amazing community.

Powered by MODXModx Logo
the blazing fast + secure open source CMS.