How to Fix Index Bloat


Index bloat is bad on so many levels. Unless you're a webmaster who dwells on these kinds of things, you may not even know your website has got a problem. You most likely launched your site a while back and have been adding to it ever since. It began to rank for some keywords a few months to a few years later and you've been coasting ever since. Did you know that your site may have the potential to rank much, much better than it currently does? Do you know that you may have an issue we in the webmaster and SEO world refer to index bloat?

So what is index bloat, anyway? I'll explain it this way. The index I'm referring to here is Google's. It can be Bing's as well as any other search engine's, but for now, let's focus on Google, since that's really the only one that matters. If your website consists of a homepage and four other pages, then you've got a five page site. All Google should know about and have indexed are those five pages. Let's say though that you've also got some additional pages on the site that you don't want to count. They don't really matter to search engines, but they help the website operate the way it's supposed to. Every page except for the homepage sorts five different ways. Right there, you've got 20 extra pages. Now let's say that you've got some additional pages that 301 redirect to those four pages. Okay, add on eight more there. Now let's say that you've got a contact page that spins off some additional page every time someone uses it. There's an endless number of pages there. Basically, what I'm trying to say is that, depending on which content management system you're using, you can have a huge number of pages beyond those you currently know about that are polluting Google's index. You may not be aware of this, but these extra pages are dragging down your entire site's ranking.

I'd like to get something out of the way right now. If you block pages in the robots.txt file, Google won't index them. Or rather, they'll index them, but they may fall out of the index in a few months to years. Eventually, they'll disappear, unless they're linked to from a prominent position on the site. The thing is, there's no guarantee that they'll disappear and it's a gamble keeping these pages around. It's not good practice. The same is true for pages that have a noindex meta tag on them. And the same is true for 301 redirects. And the same is true for pages with the canonical tag on them. And the same is true for thin and junk pages. All of these types of pages need to go. They need to disappear from your site in order to regain your rankings or allow them to flourish in the first place.

I've been working in SEO for over a decade and I can't tell you how many times I've seen a supposed SEO professional tell a client that they need to place a noindex meta tag on a thin page in order to remove it from Google's index. This is complete garbage advice. The truth is, if you'd like to remove a page from Google's index, you should delete the page or block it via authentication. Also, you can forget about the canonical tag and 301 redirects for the sake of search engines. They don't need them. I've seen pages consolidate into one within minutes of creation automatically and I've seen pages that were supposed to merge via a 301 redirect never redirect - even after 10 years. Yes, use 301 redirects, but only for users. Don't count on Google or the other search engines obeying them at all. Again, if you've got these types of redirects, you'll need to change your website's architecture so search engines can't see them. If you don't, you'll have a bad case of duplicate content and you don't want that.

If you're running Wordpress, the big sources of bloat are author pages, tag pages, and those pages inside of the /page/ directory. These are the ones that are linked to from the bottom of the homepage. 1, 2, 3, and so on. There are plugins that remove the /page/ directory pages and the author pages. In regards to the tag pages, don't use tags at all. You'll only get yourself in trouble. And as for those individual image pages that get spun off from every image you place in your posts, there are plugins to deal with them as well.

My big point of this post is this: in order to reduce index bloat, you'll need to delete pages. And to do this, you'll need to alter your website's architecture. This isn't an easy task, but it's necessary. In order to find out if you've got index bloat or not, you can use of the many SEO services out there to analyze your site (Moz, Botify, SEMrush, etc...). You can also do a scan yourself with applications such as Xenu Link Sleuth. This is a very handy and free program that I've used a lot in the past. It crawls your site and makes you aware of things you never thought you'd see. Only after you have all the necessary information will you be able to determine the best course of action.

If you've got any questions about any of this, please ask. As I said, I've got lots of experience so I may be able to help out.
I have a website that was launched all the way back in 2004. It's a classifieds site. Back then, each ad page had a "Contact Seller" link that led to a contact page. Each of these pages had a different URL. So basically, there were thousands of these pages through the years. Hundreds of thousands. If a site visitor wasn't logged into their account, they wouldn't land on the actual contact page. They'd land on a login page that was identical to the overall website login page, except for the URL. The thing is, every "login" page had a different URL. The same as the contact page. Essentially, it worked like this: if user logged in, then go to unique contact seller page. If user not logged in, go to unique contact seller page, but show login form instead of contact form. It's pretty standard for these types of sites.

For a few years, I had these contact pages blocked in the robots.txt file, which worked well. Before that though, I allowed Google to crawl each and every one of them. Since none of them appeared in the search results, I assume they were being canonicalized with the site login page. There was no problem with this.

Just recently, I made a change to the site where neither the login page or any of these contact seller pages exist anymore. Now, all of them are returning 404 status code. They're dead pages. The weird thing is, as these pages are being deleted, I'm seeing some from all the way back to 2004 when I use the site: command. I know this because I used to have a very distinct title page for them that I haven't had in over a decade.

I guess I'm writing this post just to say that Google has a very long memory. If you think a page is long gone just because you haven't seen it in a while, it's likely not gone. This is why I prefer to either block pages in the robots.txt file that I don't want to return in the search results or kill them off completely so they return either a 404 or 410 code. I don't like redirects or canonicalizations. They only cause issues in the future.