Cookie Consent by Free Privacy Policy Generator

The Best Of

Go to the Best Of the SEO Community.

Noah
Noah
Jan 19, 2025, 9:59 AM
Forwarded from another channel:
Forwarded thread from another channel:
Shawn Huber
Shawn Huber
Jan 10, 2025, 9:33 AM
Do the AI bots respect the usual directives like search bots? noindex/nofollow, etc.
Joe Pawlikowski
Joe Pawlikowski
Jan 10, 2025, 9:49 AM
What I've found so far is that when performing a function within ChatGPT, it will not work if your domain disallows via robots. But I've also seen results from those sites cited in a GPT Search context. So they're likely not generally respecting Robots, but are in acute user actions
Shawn Huber
Shawn Huber
Jan 10, 2025, 9:50 AM
What about on-page directives for noindex?
Shawn Huber
Shawn Huber
Jan 10, 2025, 10:03 AM
I asked GPT.....
No, *ChatGPT does not respect* noindex *rules* because it does not actively crawl or index web pages. Instead, when retrieving information via its web browsing tool, ChatGPT queries public search engines or directly visits URLs provided by the user. Here’s why:
1. *How ChatGPT Retrieves Information:*
• It uses a search engine to access publicly available information.
• It does not perform direct web crawling like a search engine crawler (e.g., Googlebot).
2. *Noindex Tags in Context:*
• Noindex tags are instructions to search engine crawlers to avoid indexing specific pages.
• ChatGPT bypasses this concept because it doesn’t parse or store information based on a site’s robots.txt or meta tags.
3. *Respect for Private Data:*
• ChatGPT adheres to ethical guidelines and OpenAI’s policies, which prohibit scraping or using data intended to be private or confidential.
If you’re concerned about sensitive or restricted content, the best approach is to avoid sharing such URLs here. Additionally, let me know if you’d like more detail on handling noindex or private data management!
Shawn Huber
Shawn Huber
Jan 10, 2025, 10:04 AM
Though how I am reading this, as long as you don't directly share the URL with GPT, it uses search engines so as long as you have the URL marked as noindex, then the search engines shouldn't return it
Shawn Huber
Shawn Huber
Jan 10, 2025, 10:05 AM
No, ChatGPT does not look at or respect robots.txt files because it doesn’t function as a web crawler. When using its web browsing tool, it interacts with public search engines or directly visits URLs provided by the user, but it does not independently crawl or index websites.
*Why ChatGPT Doesn’t Use robots.txt:*
1. *Not a Crawler:*
• The robots.txt file is designed to provide instructions to web crawlers like Googlebot or Bingbot, not tools like ChatGPT.
• Since ChatGPT doesn’t actively crawl the web, it doesn’t consult robots.txt for access permissions.
2. *Access to Public Content:*
• When browsing, it accesses content that is publicly available and retrievable by a browser or search engine.
3. *Ethical Data Use:*
• ChatGPT follows OpenAI’s policies to avoid accessing or using private, sensitive, or restricted content.
If you have specific concerns about privacy or ethical use of data, feel free to ask for clarification!
Shawn Huber
Shawn Huber
Jan 10, 2025, 10:12 AM
This is interesting - there are a lot of posts out there about adding
Allow: /privatePage/
Disallow: /privatePage/publicFile.jpg```
But if GPTs answer is accurate, then it doesn't matter - you can't block them from accessing your site via standard practices, you'd need to do it at a CDN level.
Joe Pawlikowski
Joe Pawlikowski
Jan 10, 2025, 10:40 AM
wondering why the tool I was using wouldn't function then
Shawn Huber
Shawn Huber
Jan 10, 2025, 10:46 AM
FWIW, I just tested a URL marked noindex and one that is blocked via robots.txt and GPT has zero issues looking at the page and giving me all the details about it
Shawn Huber
Shawn Huber
Jan 10, 2025, 10:46 AM
It even gave me a "helpful" guide on how to use the noindex directive on said URLs :melting_face:
Renee Bigelow
Renee Bigelow
Jan 12, 2025, 10:15 AM
OpenAI has different bots for different use cases. Here is their documentation for how each works:
One for search, one for user queries and one for model training. They all use robots.txt tags.
.
Explore developer resources, tutorials, API docs, and dynamic examples to get the most out of OpenAI's platform.
OpenAI Platform
Renee Bigelow
Renee Bigelow
Jan 12, 2025, 10:17 AM
I can’t tell you how much they adhere to it, but I do suspect that many people who think they are blocking are not including all three.
Shawn Huber
Shawn Huber
Jan 12, 2025, 10:33 AM
Thank you @Renee Bigelow! I’ll play around with this.
John Mueller
John Mueller
Jan 12, 2025, 12:41 PM
Of the AI services, I think openai is the one (well, apart from Google, obvs - I work on some of the controls polices) that has the opt-out well defined & complies. You can test these things (I do), just beware that LLMs will explain things by guessing: if your url is and you ask what it's about, then don't be surprised if it says cheese. The others either don't have (documented) controls or are flexible (one of them will fetch both robots.txt and your page at about the same time, if your page comes back first, it wins). But all of this is for training by crawling, and that's the easy part of AI's on the web.
John Mueller
John Mueller
Jan 12, 2025, 12:45 PM
And back to meta tags, afaik only Bing has a control via meta-tag. Everything else is robots.txt. There are more control systems in discussion, but robots.txt is easy to understand, very strong & observable, and reasonably granular. I'm biases, but I think with transparency (AI model creators tracking how they crawl & what they comply with) it's a strong choice. Fun times. Your robots.txt knowledge will be useful. (JS sites is another fun angle here)
Shawn Huber
Shawn Huber
Jan 12, 2025, 1:05 PM
Thank you @John Mueller, appreciate the detailed response!
Renee Bigelow
Renee Bigelow
Jan 12, 2025, 1:12 PM
Now wondering which one fetches both… ????
Shawn Huber
Shawn Huber
Jan 12, 2025, 1:18 PM
They seem so trustworthy and innocent with their responses :rolling_on_the_floor_laughing:
John Mueller
John Mueller
Jan 12, 2025, 2:02 PM
They can't know about themselves unless it's explicitly in the system prompt or if it does live lookups (in which case, you can just read the docs yourself). They train from public data, and there won't be public data about it before it's live. Always read the docs directly (and ideally verify the behavior, if you need to rely on it).
John Mueller
John Mueller
Jan 12, 2025, 2:04 PM
IMO this is one area where SEOs can be fantastic consultants to site owners - you know how crawling works, you know robots.txt, you can learn how LLMs work, and you can help make decisions on what makes sense for gaining awareness / visibility for companies & brands. The whole AI space is filled with hypemobiles, but you can choose to be a reasonable consultant and help clients to make reasonable decisions (which could be to block all or parts, or leave it all open -- at least make a decision). If you've been doing good work as an SEO for them, they'll trust you (hopefully) more than a rando Sam-From-The-Internet who claims AI is here to explode & fix everything.

Our Values

What we believe in

Building friendships

Kindness

Giving

Elevating others

Creating Signal

Discussing ideas respectfully

What has no home here

Diminishing others

Gatekeeping

Taking without giving back

Spamming others

Arguing

Selling links and guest posts


Sign up for our Newsletter

Join our mailing list for updates

By signing up, you agree to our Privacy Policy and Terms of Service. We may send you occasional newsletters and promotional emails about our products and services. You can opt-out at any time.

Apply now to join our amazing community.

Powered by MODXModx Logo
the blazing fast + secure open source CMS.