Cookie Consent by Free Privacy Policy Generator

The Best Of

Go to the Best Of the SEO Community.

Noah
Noah
Jan 19, 2025, 11:09 AM
Forwarded from another channel:
Forwarded thread from another channel:
Noah
Noah
Oct 29, 2024, 8:58 AM
hey all, I looked at a site that has cloudflare and thought they were suffering with indexation issues (had > 4 million pages so crawl budget was a concern of theirs).
I have a few questions:
• When I crawl the site at 0.5 urls / second with 1 thread, I get 429s after 10 pages.
• I have in theory been whitelisted, but I still run into that problem
• If Cloudflare is blocking a site from being crawled by google, how could one tell they don’t have access to Cloudflare? Does Cloudflare let you look at X number of pages and then block you?
• Once I get access what settings should I check to see if it’s blocking bots (including Google)?
• And how can one look in Logs inside Cloudflare to see activity that would indicate Googlebot being blocked?
And as a follow up, if anyone wants to write a thread on the I think my site is blocking Googlebot from crawling it how do I fix it please jump in.
Jono Alderson
Jono Alderson
Oct 29, 2024, 9:00 AM
Firewall rules etc execute sequentially; you might be being blocked by a rule earlier in the stack, or a process/system higher up. There's a debugging tool somewhere(?) in the UI that does a trace overview that'll tell you!
Jono Alderson
Jono Alderson
Oct 29, 2024, 9:00 AM
Otherwise, the logs system can be filtered by IP etc
Noah
Noah
Oct 29, 2024, 9:01 AM
Thanks @Jono Alderson! Really appreciate you jumping in.
Mika Lepistö
Mika Lepistö
Oct 29, 2024, 10:13 AM
That's a rate limiting response and I would expect Googlebot to adjust as it gets to know the site over time if it's even happening. (See below)
If you're hitting it, you may need to look at rules around that even though you might be whitelisted in other areas.
Cloudflare also has verified bots that it allows. Googlebot is one of them. There is an application process as well, if you ever need to build something yourself. This only matters if you don't have access to whitelist yourself aka you're building a tool for larger audiences.
I've personally never seen Googlebot be blocked. That would probably need to be something that's done (un)intentionally.
If it's a hard block vs rate limit the indexing requests in GSC or using rich snippets tester if you don't have access to GSC would probably fail, but I'd check to see if those use the same user agent.
ah
ah
Oct 29, 2024, 10:47 AM
Is it a regional issue with cloudflare? They have been known to suffer this at times. I’d reach out to them directly assuming you already have direct access to cloudflare.
Derek Perkins
Derek Perkins
Oct 29, 2024, 1:11 PM
here is a HN thread about Cloudflare + Googlebot issues from a few months ago. There was a thread we started in this Slack group then, but it's not visible anymore
Mika Lepistö
Mika Lepistö
Oct 29, 2024, 1:18 PM
@John Mueller thoughts/insight, especially the HN article above where 429s may be reported as 500?
Dave Smart
Dave Smart
Oct 29, 2024, 1:29 PM
I have seen the 429 reporting as 5xx, I always assumed that's because practically, as far as Google is concerned, they cause the same effect as a 503. So they are reported as such (like 410's are reported as 404). Personally I wish they would report the actual status, for reasons of debugging.
For logging, I am a huge fan of logflare that's usually a cheaper option that cloudflare's logging, and streams it all nicely to well ordered, sensibly time partitioned BiqQuery tables.
A quick looker report to get status codes coming from the googlebot asin are fairly simple, or I've even converted them to NCSA format with a quick node script I wrote so you can run them through things like the screaming frog log tool.
Easily access your Cloudflare, Vercel & Elixir logs in a centralized web-based log management solution.
Derek Perkins
Derek Perkins
Oct 29, 2024, 4:55 PM
I thought Cloudflare had shut down apps, where you couldn't add Logflare anymore...
Derek Perkins
Derek Perkins
Oct 29, 2024, 5:03 PM
> Googlebot treats the `429` status code as a signal that the server is overloaded, and it's considered a server error.
>
here are the docs where Google talks about treating 429 as a 5xx error
Mika Lepistö
Mika Lepistö
Oct 29, 2024, 5:27 PM
Let's also consider the 429 could be pass-through from origin.
Dave Smart
Dave Smart
Oct 30, 2024, 12:40 AM
> I thought Cloudflare had shut down apps, where you couldn't add Logflare anymore...
>
You can add the script as a worker directly instead now apps are depreciated.
Dave Smart
Dave Smart
Oct 30, 2024, 6:23 AM
To get the script to use logflare as a worker, in the logflare interface, open the source, then click on the setup icon and scroll down the modal to Custom Cloudflare Worker, take that script and create a new worker in cloudflare, and make sure you set the route to capture all you want it too, perhaps you want it to cover just a subdomain, or everything, or whatever you like.

Our Values

What we believe in

Building friendships

Kindness

Giving

Elevating others

Creating Signal

Discussing ideas respectfully

What has no home here

Diminishing others

Gatekeeping

Taking without giving back

Spamming others

Arguing

Selling links and guest posts


Sign up for our Newsletter

Join our mailing list for updates

By signing up, you agree to our Privacy Policy and Terms of Service. We may send you occasional newsletters and promotional emails about our products and services. You can opt-out at any time.

Apply now to join our amazing community.

Powered by MODXModx Logo
the blazing fast + secure open source CMS.