XenForo: Why Isn't Google Crawling All My Pages?

JGaulard

Moderator
Staff member
I was reading through a few threads in the XenForo community the other day when I stumbled across a topic that hit near and dear to my heart. A member asked a question about why his pages weren't being crawled by Google at nearly the rate he thought they should be crawled. As a matter of fact, he said that after his migration from vBulletin, only about 30% of his pages had been crawled and indexed. Now, if you aren't aware of indexation rates by search engines, this isn't a very good stat. Many more pages should have been indexed by this point. I believe this member mentioned that he made the site migration a year earlier, which should have given Google ample time to take note of the new URL mapping. Although, Google has been notoriously slow at resolving such things in the past, so that may be what's going on.

I've been dealing with my own indexation rates on a few of my XenForo sites as well, so this member isn't alone. I've done a lot of testing though and have come to the conclusion that there are many pages contained within the default install of XenForo that are completely unnecessary for Google to even know about. Luckily, the software comes with a powerful template and permission system that's fairly easy to work with. And the community they offer is great. I've asked numerous questions and have received prompt replies from other members as well as the developers themselves. Okay, enough of my gushing over XenForo.

There's this thing Google's got that's called crawl budget. You'll need to look this up to familiarize yourself with the concept. It's based on how popular a website is and what kinds of pages are being crawled on any given website. Google rarely crawls all pages on a website. It instead runs off an algorithm that determines how many and which ones should be targeted for crawl. Now here's the kicker - just because a page exists, that doesn't mean that it'll be crawled. It can even be easily accessible and front and center and for reasons I'll explain below, Googlebot will completely ignore it. Weird, I know.

Let's pretend that you've got 20 URL on your website. Ten of those URLs lead to pages that are perfect and are filled with wonderful content, but the other 10 are 301 redirects that lead to those good pages. While one might suspect that the first 10 pages would be easily crawled and indexed, the second set of 10 URLs have somewhat polluted the entire population, resulting in Google only crawling five of the good pages. Because there were lousy 301 redirected URLs on the site, Google decided that the entire site isn't good enough to crawl, so it only crawled a portion of it and then went to go do something else. I'm obviously oversimplifying this scenario, but you get the idea.

Here's the thing, for every one good direct thread URL in the XenForo software URL structure, there are two 301 redirected URLs that point to the same thread. And when someone replies to a post in that thread, another 301 redirected URL is created. And when someone replies to a post via a quote, another 301 redirect. And when someone reacts to a post, another page is formed. Not a 301 redirect, but an entire page that's got a noindex tag element on it. It's worthless for Google to crawl, as are all of the 301 redirects. I mean think about. Let's say that one thread (the only thing Google should even know about) have 5,000 posts added to it and 5,000 replies. And let's say that each and every post was reacted to with a thumbs up or a smily face. Right there you've got got 15,000 extra URLs for Google to crawl and evaluate when it should only know about one. That's right. Just one URL. Now say you're site is huge and you've got millions of threads with millions of replies and so forth. If you think that Google and the other search engines are going to do all that crawling, you've got another think coming.

So there's that, but there's also more. Member pages and attachment pages also shouldn't be crawled. Google shouldn't even know about these pages, which can be either crawled freely, blocked with robots.txt (pagerank drain), or blocked with the permissions system so they return 403 pages when accessed (pagerank drain). Again, if you've got millions of member page URLs and millions of attachment page URLs that are accessible to Google, your problem just became compounded. An SEO web firm out there that's called Botify has cleverly named these pages "non-compliant." Check out these two articles put out by them:

https://www.botify.com/blog/crawl-budget-optimization-for-classified-websites

https://www.botify.com/blog/seo-compliant-urls

So, the big question is, how do we fix this? Well, I'll tell you. Actually, I already did. All you need to do is read this beautifully written post. Please let me know what you think.
 
Top