Cookie Consent by Free Privacy Policy Generator

The Best Of

Go to the Best Of the SEO Community.

Kyle Faber
Kyle Faber
Dec 7, 2024, 8:32 AM
Great technical / beginner question and answers.
Forwarded from another channel:
Forwarded thread from another channel:
JaQueen
JaQueen
Dec 5, 2024, 8:02 PM
*New tech SEO intern here - How do I fix this* ????*?*
I'm auditing 2 domains owned by the same non profit and both are missing a *robots.txt file*.
One domain is on wordpress using the All in One SEO plugin - but it is not submitting a robots.txt to search engines.
The second domain is hosted on Ruby on Rails. For this one, the developer intern asked me, "What content do I put into the robots.txt?" And of course my answer is, "I don't know ????."
*Can you help me answer the question - What content do I put into the robots.txt file?*
For the wordpress issue, I hear I need to contact the host for help. But for the Ruby on Rails domain ... PLEASE HELP.
What I've done so far:
So, of course I asked chatgpt, lol. But the issue is, this domain is a database website. So, I'm not too sure if this is all that I need. Your help would be greatly appreciated.
Below, I've added a picture with code I generated with chatgpt. Please tell me what do you think I'm missing and what needs to change! I can dm you the company website if needed!
Tree
Tree
Dec 5, 2024, 8:05 PM
The bare minimum could be having the file with no distinct directives. Just crawl everything. e.g. The top 3 lines of your screenshot.*
As you learn the backend of your site and problem areas you could update it accordingly.
*I'll also shout-out the newish robots.txt report in GSC. As well as the crawl stats report to start getting a better understanding of what's being crawled.
Kyle Faber
Kyle Faber
Dec 5, 2024, 8:06 PM
Welcome @JaQueen!
No need to add a robots.txt if you don’t need it, or, if you do need to add one, you only need rules that `allow` or `disallow` crawling of certain paths or path patterns.
So the answer to your question is: nothing unless you need it, and it’s wholly dependent on each site, how they’re built, where you want bots accessing (or not), etc.
JaQueen
JaQueen
Dec 5, 2024, 8:10 PM
Thank you, @Tree (I love your name!)! When you say top 3 lines, do you mean these?:
JaQueen
JaQueen
Dec 5, 2024, 8:15 PM
Thank you for the welcome and the help, @Kyle Faber! I appreciate you bringing this up because this is where I'm confusion ????. So, long story shortened, I just completed a tech seo cohort recently and presented this info with these two errors and was told after checking domains in GSC that there was no robots.txt and that could be causing these errors below. With this info, would you still recommend that I only stick with the top allow and disallow lines (plus the disallow search results pages because there are thousands that could be generated from the database queries)? I'm so happy to have your advice!
Tree
Tree
Dec 5, 2024, 8:18 PM
The 1st set of 3 lines.*
The first line is just a comment (all comments start with #), not required.
The next 2 lines specify to who, all search engine bots. And the last line is what to not crawl. Since it's empty it means crawl everything.
JaQueen
JaQueen
Dec 5, 2024, 8:18 PM
@Tree! Ah thank you! and wow, I had no idea I could just add that and keep it moving!
Kyle Faber
Kyle Faber
Dec 5, 2024, 8:18 PM
The errors in your screenshot do not relate to robots.txt!
Kyle Faber
Kyle Faber
Dec 5, 2024, 8:20 PM
The first one is URLs with a noindex tag on them, which may or may not be a problem, and would need further investigation.
The second is the URLs didn’t respond. That is a bigger problem and would need to be investigated, to see where and why that may be occurring.
JaQueen
JaQueen
Dec 5, 2024, 8:24 PM
Thank you, @Kyle Faber, I truly appreciate this. I was told that not having the robots.txt was a huge issue while giving this presentation and we stopped the presentation, looked into the domain urls and when no robots.txt showed up for either, my coach told me to get that handled before trying to proceed with the resolve. From what you're saying, we can continue to resolve for other issues with or without the robots.txt? And thank you (sorry for my long explanation, lol) so very much. You rock!
Kyle Faber
Kyle Faber
Dec 5, 2024, 8:30 PM
Correct.
The three things are independent of each other.
Not having robots.txt does not cause issues by itself, but having one may solve problems if you know what rules need to be in it. This is all dependent on the site and would not be something you can effectively use GPT to help fill in.
This also has no impact on the noindex issue.
For that, there’s a noindex tag either in a meta robots tag or an x-robots-tag in the HTTP headers on 923 URLs. This may or may not be a problem. Sometimes you noindex URLs intentionally, as you don’t want them in the index.
The result of fixing it would be the URLs become indexable, but that may not be a good thing.
Again, it depends on the site.
As for the response errors, that means the pages did not respond to the request. That’s an issue because those pages are wholly inaccessible. Fixing that will mean those pages will hopefully be properly loading for users and bots, which allows them to possibly be crawled and indexed.
You can work to address each independently of the other, and you don’t need to address the robots.txt first before the other two issues you were originally investigating.
Kyle Faber
Kyle Faber
Dec 5, 2024, 8:38 PM
I will say, it’s hard to give a comprehensive answer without more information, so I recommend taking my feedback here as high level vs really specific. Details almost always matter :)
JaQueen
JaQueen
Dec 5, 2024, 8:44 PM
@Kyle Faber this is incredibly helpful. Sheesh I just learned so much. Thanks for this, truly!
Tony Castillo
Tony Castillo
Dec 6, 2024, 6:26 AM
@Kyle Faber - Wanted to ask a follow-up question for my own knowledge as well on this thread ...
"Not having robots.txt does not cause issues by itself ..."
I had the perception that search engines would treat a site with a missing robots.txt file as having a directive of 'Disallow' for the entire site since no rules could be found. Is that no longer the case? Looking to update my knowledge on the matter since I've since this functionality occur earlier in the year, but hoping to learn what's new if the rules have changed.
Kyle Faber
Kyle Faber
Dec 6, 2024, 6:58 AM
The opposite is the case, actually. No robots.txt means there’s unrestricted access to crawl.
Here’s a great resource by Ahrefs:
Tony Castillo
Tony Castillo
Dec 6, 2024, 7:38 AM
Thanks for correcting my misinformation @Kyle Faber
Kyle Faber
Kyle Faber
Dec 6, 2024, 7:43 AM
Happy to help @Tony Castillo, it was a great question!
Kyle Faber
Kyle Faber
Dec 6, 2024, 5:03 PM
As a follow up, here’s better more direct info on how the lack of a robots.txt is handled:
See whether Google can process your robots.txt filesThe robots.txt report shows which robots.txt files Google found for the top 20 hosts on your site, the last time they were crawled, and any warnings
Tony McCreath
Tony McCreath
Dec 6, 2024, 5:16 PM
It's worth noting that a robots.txt file that is not accessible. e.g. it returns a 500 status code, is a signal to Google not to crawl until the file is fixed.
As mentioned, 404 type status codes are fine, and indicate no restriction.
Kyle Faber
Kyle Faber
Dec 6, 2024, 5:22 PM
Correct, that’s a great call out @Tony McCreath!
5xx errors will always be handled differently and with greater scrutiny than 404. Good to be aware of the nuance and exceptions :)
JaQueen
JaQueen
Dec 6, 2024, 9:39 PM
Wow, @Kyle Faber the note you left for Tony answered another issue I've found in the audits. The WordPress admin pages were showing up in the 4xx reports as well as pages that were from old subdirectories that no longer exist (more 4xx). So, these are getting crawled with every bot due to no robots.txt? And I'm guessing they show up as high priority errors due to wasting crawl budget? Perhaps the file will help with these issues. But since the websites are small (approx 3000 urls on one, 5000 on the other), maybe a couple thousand of these showing up as indexation issues isn't a big deal? My hope is that once we apply proper directives to these pages, the domain authority will rise.
Will adding the robots.txt AND no-index code to these pages be the best method to resolve these errors? Or do you think the robots.txt will have little if no impact on ranking?
Also, thank you for the incredible info you've left in this thread. I gave the articles you sent to our developers.
JaQueen
JaQueen
Dec 6, 2024, 9:41 PM
Hi @Tony McCreath ah! So does this mean, a 5xx return on robots.txt tells Google not to crawl the website? ???? Or is it that Google doesn't crawl the robots.txt file but will still crawl the website? P.s. great info. Thanks for sharing!
Kyle Faber
Kyle Faber
Dec 6, 2024, 10:15 PM
@JaQueen 404s are a natural part of the web. By themselves, they are not inherently an issue.
As I said higher up, a lot of what you need to do is in the details.
Crawl budget isn’t a concern until you hit high 10s/100s/1000s thousands of URLs, so at 3-5k, it’s not something I’d be concerned about (from a crawl budget standpoint).
As for domain authority, I’m unsure the definition you’re using when mentioning that, but regardless of definition, crawl and indexing control won’t impact any sense of authority relative to the domain.
As for noindex + robots.txt, I wouldn’t pair those together (at least at the outset). Funny thing about noindex is that the URL that contains it needs to be crawled in order to be seen and work. If you block it, noindex does not work.
Again, details here matter, so I can’t give direct answers to exactly how to proceed in your situation.
Lastly re: your q to @Tony McCreath, 5xx errors indicate a problem with the server/website, so crawling will start to be limited/throttled if they see them consistently over a period of time. It applies to the website in general if they’re seeing a consistent 5xx error over time.
Again, nance to this, but can be taken as a general expected outcome :)
Tony McCreath
Tony McCreath
Dec 6, 2024, 10:22 PM
Google wants to know if a site is signalling that they can crawl it. A 4xx indicates the site does not care, go ahead. A 5xx indicates the robots.txt file is broken. Google takes that to mean it can't access the file, so it does not know if it can crawl or not, so it decides not to crawl. It will periodically try the robots.txt file again until it's back up.
Google normally checks the robots.txt files about once a day, and it caches them for later reference to what it can crawl.
4xxs are not high-priority errors. They are a natural part of a website (as Kyle said). Pages come and go. It is worth auditing 4xx to see if any are caused by mistakes that can be fixed (e.g. via a redirect and/or fixing any links to them).
4xx pages are excluded from the index by default. No need to noindex them as long as they return 404 or 410.
Trying to disallow your missing pages in robots.txt will not help unless you really have massive crawling issues. Google crawls 4xx far less frequently than working pages and will eventually forget about them if they are not linked to from indexed pages.
JaQueen
JaQueen
Dec 8, 2024, 8:58 AM
@Kyle Faber OK! Thanks for the clarity on that and your thorough responses. I can't tell you how much of a difference you're making.
With your knowledge that the 4xxs really aren't too much of an issue for these small sites, do you think it's worth it at all to tell Google to delete them if the business marketing goal is to improve rank and domain authority?
For one of the sites, there are 1000 4xx pages while the domain itself consists of only about 3000 urls in total.
Do you think that telling Google to delete these pages (using GSC) could potentially increase rank/domain authority? The back story: the domain authority on this particular site is less than 20 and ads are not very effective. We've increased impressions per month by thousands with few of my SEO/UX suggestions on home page and implementing my then (and now) novice level of content marketing improvements (topic clustering/author bios/some on page suggestions/etc).
So I'm moving into more depth with the technical side and had a mentor who mentioned that if Google crawls thousands 4xx pages (plus redirects) that by removing them from GSC, it may or may not increase domain authority.
Before spending time figuring out how to bulk add these pages into GSC (perhaps it's a simple copy paste from spreadsheet), I wanted to get your high level opinion! Have you ever seen such an action improve site rank/visibility in search engine queries or ads?
And thank you so for this thread. I am learning hella info and hope that the thread is helping others too! ????????
JaQueen
JaQueen
Dec 8, 2024, 9:09 AM
@Tony McCreath hmm thanks for that! Especially the way you broke down what the 4xx and 5xx signals info in a way that was easy for me to digest and understand. I find that the classes and articles/Google insights are not as clear for me.
Like you mentioned, I'll look to see what pages are linking to the 4xxs if any. Also, the site does have a few hundred if not close to 1000 redirects. I'm not sure how to find redirects that lead to 4xxs as easily quickly but hopefully sitebulb or screaming Frog has that in their reports and I just haven't gotten to that error yet on the list.
I'm getting a feeling that the "Response Codes: Internal no response" could be housing some redirects that go to 4xxs but that just me guessing until I start analyzing those pages.
Thank you both for this info and knowledge that is letting me out of a useless waste of several hours and confusion for developers!!
Super grateful ????
Tony McCreath
Tony McCreath
Dec 8, 2024, 1:26 PM
Sitebulb and Screaming Frog should uncover redirect errors and links to 404 pages. And backlink tools may pick up broken links from external sites.
JaQueen
JaQueen
Dec 11, 2024, 7:22 AM
Thank you @Tony McCreath. Any suggestions on a good back link tool?
Tony McCreath
Tony McCreath
Dec 11, 2024, 1:36 PM
I've not used any in a while. Ahrefs, SEMrush, Majestic and the GSC Links report come to mind.

Our Values

What we believe in

Building friendships

Kindness

Giving

Elevating others

Creating Signal

Discussing ideas respectfully

What has no home here

Diminishing others

Gatekeeping

Taking without giving back

Spamming others

Arguing

Selling links and guest posts


Sign up for our Newsletter

Join our mailing list for updates

By signing up, you agree to our Privacy Policy and Terms of Service. We may send you occasional newsletters and promotional emails about our products and services. You can opt-out at any time.

Apply now to join our amazing community.

Powered by MODXModx Logo
the blazing fast + secure open source CMS.