Your robots.txt file tells search engines which pages they can crawl and which they can't.
A robots.txt file
Here are the three main parts of this file and what each does:
User Agent: This line identifies the crawler. And "*" means that the rules are for all search engine robots.
Disallow/Allow: This line tells search engine robots whether they should crawl your website or certain sections of it.
Sitemap: This line indicates the location of your sitemap.
Advice
Add your sitemap index URL (the main sitemap that contains all your sitemaps) to your robots.txt file to help crawlers discover and understand your site structure more quickly.
URL errors
Unlike site errors, URL errors only affect the crawlability of specific pages on your site.
Here is a summary of the different types:
404 Errors
A 404 error means that the search engine bot was unable to find the URL. And it is one of the most common URL errors.
It occurs when:
You have changed the URL of a page without updating the old links that pointed to it
You have deleted a page or article from your site without adding a redirect
You have broken links, for example, there are errors in the URL
This is what a basic 404 page looks like on an Nginx server.
A basic 404 page with the message "404 Not Found".
But most companies today use custom 404 pages.
These custom pages improve the user experience and allow you to maintain consistency with your website's design and branding.
Amazon custom 404 page with a picture of a dog named "Brandi"
Soft 404 errors
Soft 404 errors occur when the server returns a 200 code but Google thinks it should be a 404 error.
Code 200 means everything is fine. It is the expected HTTP response code if there are no problems.
So what causes soft 404 errors?
Problem with JavaScript file: The JavaScript resource is blocked or cannot be loaded
Thin content: The page has insufficient content that does not provide enough value to the user. Such as an empty internal search results page.
Low-quality content or duplicate content : The page is not useful to users or is a copy of another page. For example, placeholder pages that shouldn't be active, such as those with "lorem ipsum" content. Or duplicate content that doesn't use canonical URLs, which tell search engines what the main page is.
This is what a robots.txt file looks like.
-
- Posts: 60
- Joined: Tue Dec 24, 2024 3:31 am