Xenforo: Analyzing the Robots.txt File

JGaulard

Moderator
Staff member
A critically important file for just about any website on the planet is the robots.txt file. This one little file can make or break your site. While I'm not going to get into what this file is or how the syntax contained therein works, I will discuss the implications of it on a site that's using Xenforo forum software. There's currently no set robots file for this software yet, so what I share below may or may not be correct. What I write will be based on my years of experience and testing. The truth of the matter is, we simply don't know how some aspects of crawling and indexing are treated by Google, so what we go on is hunches and guesses. And testing.

Down below, I'll share a decent robots.txt for Xenforo. I'll also explain, line by line, why I think it's important to contain a certain blocking protocol. What I've found that's missing from a lot of discussion on this topic in the Xenforo forum is the "why" behind the claim. "We really need to block this directory!" "Why?" "I have no idea. Someone else said to do it." Well, here I'll attempt to give you the whys.

In the most basic sense, we should block search engines from crawling anything that's not important to be indexed and to show in search results. The reason for this is two-fold. First, we don't want to waste crawler budget, having the bot crawl unnecessary pages on our sites and second, we don't want pages that may harm our rankings to be discovered and crawled.

Before I go any further, let me explain a few things I've learned through the years. First, even smaller websites of a few thousand pages can waste crawl budget having Google and the other engines crawl unnecessary pages. I've heard Google claim that only websites with millions of pages need concern themselves with this. It's my opinion that that statement is bull. Almost all website owners need to concern themselves with this because, for some reason, when Google and the other engines crawl junk pages, error pages, and redirects, they tend to forget about the good pages and go elsewhere. I've seen this a thousand times and trust me when I say, I look at log files and crawl rates quite a bit.

Next, and this is a big one, just because a page happens to include a meta noindex tag in its code, you still shouldn't allow the search engines to crawl that page. For some odd reason, folks out there in the wild tend to think that just because they "noindex" a page, it's fair game for each and every search engine to crawl the page, as if the engine will just forget about it. I'm here to say that when search engines crawl pages that include noindex meta tags, they index those pages. They won't show the page in search results, that's true, but they'll certainly index the pages and those pages will harm your rankings if they're thin or junk. I've seen rankings dwindle through the years because of thin pages with noindex tags being crawled and indexed. "Noindex" doesn't mean that search engines won't index a page. It means that they won't return it in search results. If you've got one or two pages that you think you should noindex because of internal duplicate content or competition, then sure, stick the noindex in there, but if you've got a huge directory that's chock full of pages that have no business being crawled at all, don't think that the noindex meta tag is going to help you here. Those pages will be crawled and a very low pagerank will be assigned to them. Those low pagerank pages will tank the entire website. Trust me.

The big question that we don't have a definitive answer for is this: if we block pages with the robots.txt file, will those pages hurt us. Will they be considered thin? It's of my opinion that blocking them will help more than not, if the pages are thin, redirects, or error pages. Take a look at this:

"While Google won't crawl or index the content blocked by robots.txt, we might still find and index information about disallowed URLs from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Search results completely by using your robots.txt in combination with other URL blocking methods, such as password-protecting the files on your server, or inserting meta tags into your HTML."

Our goal is to have Google not crawl or index the pages and directories we block in the robots.txt file. The above was taken from the Google Support area. When reading anything from Google Support, it's extremely important to read closely and take things literally. Now, the reason I say that blocking pages that don't help indexation rates on a website will help more than hurt is because I've had quite a few websites actually recover from ranking drops after I've blocked certain directories and pages for a few months. It's been my experience that once a page is blocked and once Google discovers that it's so, it'll take 90 days for that page to disappear from Google's index, if it has been indexed previously. There's much to say about this subject, so if you have any questions, please ask below.

Okay, let's begin. This is the robots.txt file, as of this moment, that I propose for Xenforo:

User-agent: *

Disallow: /forum/account
Disallow: /forum/admin.php
Disallow: /forum/attachments
Disallow: /forum/conversations
Disallow: /forum/find-threads
Disallow: /forum/goto
Disallow: /forum/help
Disallow: /forum/login
Disallow: /forum/lost-password
Disallow: /forum/members
Disallow: /forum/misc
Disallow: /forum/online
Disallow: /forum/posts
Disallow: /forum/register
Disallow: /forum/search
Disallow: /forum/threads/*/post
Disallow: /forum/threads/*/latest
Disallow: /forum/watched
Disallow: /forum/whats-new

* I'm assuming the Xenforo software is installed in the /forum/ directory at the web host.

Below, I'll go line by line to justify my reasoning for blocking something. Some will be very quick explanations and some will be more in-depth.

Disallow: /forum/account

This is not a critical block. Google and others generally don't crawl any of these pages because there are no links to them. And if they do, the pages will return a 403 status code, meaning "Forbidden." I still like to block the directory anyway, just in case the search engines do find links to the pages. The fewer error pages being returned, the better. Although, I may be wrong.

Disallow: /forum/admin.php

This page really doesn't need to be blocked, but if it's ever discovered, it'll be considered thin content.

Disallow: /forum/attachments

This is a hugely important line. We have three options with the attachments directory. We can let each URL be crawled, which will result in tons of Soft 404 errors in the Google Console, block the URLs in the permission system of Xenforo, which will result in tons of 403 Forbidden (Crawl Anomaly) errors in the Google Console, or block the URLs in the robots.txt file, which will result in tons of Blocked by Robots.txt errors in the Google Console. Your guess is as good as mine when it comes to which is worse. Soft 404s are terrible and the jury is out when it comes to the other two options. If you keep the link to these pages, but block them in the permission system and allow the pages to return 403 errors, you're wasting link juice. The juice will evaporate. You're essentially linking to a dead page. If you block the pages in the robots.txt file, the link juice will remain within your website. It's a toss up. I'm honestly not sure what to do.

FYI - while it appears that these links lead to images, they actually link to pages that contain images. Look at these links very carefully. They need to be blocked one way or another. If they aren't blocked, each and every one of them will be considered thin junk to Google.

Disallow: /forum/conversations

Again, this isn't a critical block. If crawled, these pages will lead to 403 error pages if not logged into the system. See my justification for blocking them above.

Disallow: /forum/find-threads

Same as above.

Disallow: /forum/goto

Here's an interesting one that's been discussed fairly regularly in the Xenforo community. Pages that contain the /goto/ directory path are actually 301 redirects. The problem with redirects is that while they are strict commands, it can take the search engines quite some time to actually obey those commands. Sure, the crawlers will follow the redirects, but may wait quite a bit to actually canonicalize the redirected URLs to the destination URLs. In the meantime, the redirected page and the target page will be considered duplicates.

Disallow: /forum/help

The help pages are just a handful of pages that are duplicated across pretty much every Xenforo website on the internet. It's better to block these pages, lest they be considered duplicates.

Disallow: /forum/login

Same thing. Duplicates that contain noindex.

Disallow: /forum/lost-password

Same.

Disallow: /forum/members

These pages are almost identical to the /attachments/ pages. They're mostly thin. The only difference is that if crawled and indexed, these guys won't return Soft 404 errors, they'll likely be indexed and be considered thin pages that will hurt in the long run. It's better to block all of these pages one way or another.

Disallow: /forum/misc

These pages are mostly contained behind the /posts/ directory that I'll get to below. If you allow members to add their city and state location in their accounts, while browsing the website, users will have the ability to click links that 301 redirect and lead to Google Maps. These links contain the /misc/ path. It's best to block these links.

Disallow: /forum/online

Again, not critical. These pages will return 403 errors if the user/search engine isn't logged in to an account.

Disallow: /forum/posts

This path is important to block. If not blocked, then each post that contains a reaction will lead to a page that uses the /posts/ path. These pages are thin and contain a noindex meta tag.

Disallow: /forum/register

Same as the login and lost-password pages.

Disallow: /forum/search

The search results area should be blocked at all costs. This is what we call a spider trap in the biz. Search engines can get lost in these worthless pages.

Disallow: /forum/threads/*/post
Disallow: /forum/threads/*/latest


Blocking these two paths is highly debatable. Every single thread will have a regular canonical URL that leads to it as well as one or both of these types of 301 redirects. Please see the /goto/ section above for explanation. These links can be more prevalent than them under most circumstances.

Disallow: /forum/watched

Not critical as these pages will return 403 errors if followed and user isn't logged in.

Disallow: /forum/whats-new

The most critical of all. Block, block, block. These pages self replicate based on user session and they all contain the noindex meta tag. There are also no links to the replicated pages, making them very low value. If not blocked, you'll end up with hundreds of thousands of these pages before too long. These types of pages are why Google invented the Panda penalty. Very similar to the search pages above, but worse.

Well there you have it. I hope I've given you something to chew on. The topic of what to block and what not to can drive you crazy. If you have anything to add or if you dispute something I've written, I'd love to read what you have to write. Prove me wrong. Please. I love to learn and I have a very small ego. I won't be offended.
 

KodyWallice

Member
Disallow: /forum/attachments

This is a hugely important line. We have three options with the attachments directory. We can let each URL be crawled, which will result in tons of Soft 404 errors in the Google Console, block the URLs in the permission system of Xenforo, which will result in tons of 403 Forbidden (Crawl Anomaly) errors in the Google Console, or block the URLs in the robots.txt file, which will result in tons of Blocked by Robots.txt errors in the Google Console. Your guess is as good as mine when it comes to which is worse. Soft 404s are terrible and the jury is out when it comes to the other two options. If you keep the link to these pages, but block them in the permission system and allow the pages to return 403 errors, you're wasting link juice. The juice will evaporate. You're essentially linking to a dead page. If you block the pages in the robots.txt file, the link juice will remain within your website. It's a toss up. I'm honestly not sure what to do.

FYI - while it appears that these links lead to images, they actually link to pages that contain images. Look at these links very carefully. They need to be blocked one way or another. If they aren't blocked, each and every one of them will be considered thin junk to Google.
If you block the /attachments/ directory, aren't you blocking the uploaded images from being crawled and indexed for image search too?
 

JGaulard

Moderator
Staff member
If you block the /attachments/ directory, aren't you blocking the uploaded images from being crawled and indexed for image search too?
That's a very good question. The answer is that no, you won't be blocking the images from being crawled and indexed by the search engines. Images are stored in the /data/attachments/ directory. The regular /attachments/ directory is simply a URL rewrite that has nothing to do with the real path that includes the /data/ portion.
 

JGaulard

Moderator
Staff member
Interesting thing happened once I blocked these two URL types in the robots.txt file:

Disallow: /forum/threads/*/post
Disallow: /forum/threads/*/latest

I suspected something like this might happen. When I used to check the indexed site URLs at google.com by typing in site:indyfor.com, no new URLs would appear over the past 24 hours, even though a few posts had been written during that time. I think the reason no pages were showing is because Google was crawling multiple URLs for the page. URLs such as:

https://indyfor.com/threads/41/
https://indyfor.com/threads/41/post-123

The first URL is the canonical one. The second 301 redirects to the first. 301 redirects take a while to settle in. In the meantime, it appears that there's duplicate content on the website. Actually, it would appear that there's duplicate content across the entire website. Xenforo software introduces tons of redirects, so it's our job to find them and block these URL types in robot.txt.
 
Top