Request: Currently, on multiple websites, a rule is added in the robots.txt file to avoid indexing the content sources paths: Disallow: /pf/api/
However, there is a fundamental difference between authorizing crawling (access) and authorizing indexing (visibility). We would like to implement the following changes based on these points:
Google seeks to index content that is useful to human users (HTML). URLs like /pf/api/ generally return JSON code.
Googlebot "consumes" this JSON to understand and build the web page (rendering).
However, Google has no interest in displaying a raw data file (JSON) in its search results because it provides no value to the end user.
By modifying the robots.txt to allow /pf/api/, we are telling Google: "You have permission to read this data to build my pages." This does not mean: "You must display these URLs in your search results."
The best practice for APIs is not to block them in robots.txt, but to add an instruction in the HTTP header.
https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#xrobotstag
Could you please check if you could add at the server or CDN level this header to the API responses: X-Robots-Tag: noindex
Why is this the perfect solution?
The bot can read the content (Crawl: OK): It can therefore render the articles correctly and see the text.
The bot cannot index it (Index: NO): The API URL itself will never be visible in Google search results.
Hi Fatih,
Thanks for the update and for looking into this. That sounds great.
Please keep me posted on the approach you decide on, as it will help us align on our side and anticipate any constraints.
Best regards,
Hi Edem,
Thanks for submitting this idea. This makes sense. We're exploring a generic way to set response headers from PB Engine, on both content sources (/pf/api/ responses) and the pages responses. We'll consider this header (if it'll be whitelisted headers to be allowed, or allow any).