When Should I Block URLs in My Robots.txt File?

  • Thread starter Phoenix1
  • Start date
Phoenix1

Phoenix1

Active member
  • #1
I have a quick question. I recently launched a classifieds website and the software is somewhat of a mess. It's not optimized for search engines at all and I think there are many pages that aren't necessary for the crawlers to crawl. So my question is, how do I know what to block in my robots.txt file? Which pages should I block? What will happen if I block them? I am pretty new at this.
 
Cameron

Cameron

Active member
  • #2
This is actually a great question. While not a lot of focus is placed in the robots.txt file in the SEO world anymore, it's actually one of the best methods for managing search engine crawling on a website. There have been many other and more modern methods for handling SEO introduced through the years, but nothing compares to the power of this one file. All the big sites use theirs extensively. From Amazon to Ebay to Google itself. Have you ever seen these files? I'll link to them here:

https://www.amazon.com/robots.txt

https://www.ebay.com/robots.txt

https://www.google.com/robots.txt

As you can see, the robots.txt file is alive and well. These big guys have got tons blocked.

To answer your question, there are three primary reasons you might want to block a URL, group of URLs, or a directory in your robots.txt file. The first reason is to block the introduction of duplicate content on your own website. I see that you're running a classifieds site. I'm sure you've got category pages with lots of sorting and the like. You may even have ad pages that have ancillary pages that display ad images only or printer friendly pages. All of these types of pages can introduce duplicate content which can severely damage your rankings. I've seen website rankings plummet because of duplicate content. While there are methods to handle this type of thing, it's best to just block these duplicate pages and be done with it. Canonical tags and 301 redirects only work some of the time, but the robots.txt file works all of the time.

The second reason you might want to use this file is to preserve your crawl budget. Google and the other search engines are only willing to crawl a certain number of website pages per day and if you've got them crawling duplicate and low value pages, many of your good pages might never be seen. That's not a good thing. Furthermore, the more low value pages you allow the search engines to crawl, the less they want to crawl. They like high value and they'll reward you with lots of crawling and rankings if you only show them those types of pages.

The final reason it's good to use your robots.txt file is to block access to low value thin content. On your classifieds site, you may have pages that only contain a form, such as a contact seller or send to a friend page. These are very low value and your website rankings will suffer to you allow them to be crawled. So block them and save yourself a headache.

I hope this helps.
 
Newman

Newman

Active member
  • #3
I ran a classifieds website for over 14 years and battled all kinds of duplicate content on it. First, the homepage itself had a number of versions. I found that many of the URLs that duplicated the homepage had question marks (?) in them, so I blocked them in robots.txt. Then, I found a distinct rewritten URL that was a copy of the homepage, so I blocked that too. As for the categories, I had primary URLs that were rewritten like, /CategoryName/Number/PageX.html, so they were fine. They included a bunch of sort options though and all of those sort URLs had parameters in them with more question marks, so I blocked all of the sort option pages. As for the ad pages, each one led to another page that someone could click on to see just the ad images. Then there was the printer only page. And then there were five pages with URLs that included more parameters. They went to the contact seller, send to a friend, vote on ad, view votes, and a few more. I think there were five in all. So for every ad page, there were seven extra system generated pages. Those were all thin. The printer friendly and the image pages were duplicate content. The sort pages for the categories were duplicates. It was a mess. I blocked all of these extra pages. The only ones I allowed the search engines to crawl were the one homepage, category pages and their associated pagination, and the single ad pages with no extra pages. That seemed to work well. I didn't use 301 redirects or canonical. I tried and never had any luck with those options.

The think about blocking pages in the robots.txt file is that if you're blocking pages that link directly from another that you're keeping crawled, you shouldn't notice any temporary ranking drop. If you're blocking a directory that's got only one link going to it, but there are tons of pages contained inside, like an entire section of the website, you'll notice a temporary drop in rankings. The reason for this is that the shallow, one hop, pages still have link flow (juice) going directly to them. The section that gets blocked that only has one entrance path becomes problematic because the link flow gets cut off to all the interior pages completely. So if you have 1,000 pages in a section and you block the door way into that section, all of the URLs contained inside have got to be drained of their pagerank completely before they get dropped from Google's index. That can take months or years sometimes, depending on the website. But, it needs to be done, so just do it.

Basically, my point is, block all pages that don't need to be crawled. The ones that you wouldn't expect to show up in the Google results.
 
Top