Google's Allotted Index Size Per Website

Cameron

Member
I've got a website that's definitely been hit by Google Panda. The rankings recently hit the floor after an update. I know I've got tons of thin and junk pages in the index, so I'm currently working on getting them all out. It's been a long road, but I have made the necessary changes so I'm confident of success in the future. I did want to report a few areas of interest though. It's been a few months since I began removing pages and some odd things are occurring.

First off, Google doesn't like to let go of pages. If you're looking to lift your site from a thin content penalty, you better start soon. Don't listen to everyone online who tells you to 301 redirect the old pages to new ones. Just delete the old pages. Force them to return 404 or 410 status codes. It doesn't matter which one. Just get rid of the pages and make sure they return errors.

What I'm noticing as I delete pages is that once one page is gone, it'll sort of open a spot for a better one to appear in the index. Here's my theory. I'm still working this out in my head, so I hope it makes sense. And if you have any sort of follow up on this, please let me know below. I'd appreciate it.

I think Google crawls an entire website and then makes a decision on how many pages to allow in its index. So say you've got a website with a total of 10,000 pages, both good and bad. Some of the pages are worthy of being returned in the search results and some are so thin that they're just junk. Google crawls all of the pages and in this case, it'll determine that the "bucket size" it'll create for the site will be about 5,000 pages. This is based on both number of pages (size of website) as well pagerank. Google knows about all 10,000 pages and has put some of them in a holding tank, but it'll only return 5,000 of those pages in search. Now remember that some of the pages that Google will include in this "search returnable" 5,000 page group are thin and bad. Alternately, some of the good pages will be placed in the holding tank. The reserve, if you will. Pages in this reserve don't show in the search results.

I think Google rates websites with how many pages they have over the "returnable" group and that's what hits you with a Panda penalty. So if a 4,000 page website gets crawled and Google deems its bucket to hold 3,900 pages, that means that most of the pages are good to show in the results and the site has a fairly high overall pagerank. If a 90,000 page website gets crawled and Google deems it to only have a bucket size of 900 search returnable pages, then that's a horrible website that needs to be cleaned up tremendously.

The reason I make all these claims is because I have a website that includes approximately 8,000 good pages. On top of that, it's got about 30,000 junk orphan pages from some previous software that need to be removed and it's also got about 50,000 pages that have been blocked in robots.txt and other methods. What I'm seeing is that Google only shows that there are about 6,000 pages in its index and that number has been steady for a while. But as the bad pages get removed by Google, that 6,000 number appears to be rising, as if the overall quality score is increasing and Google is deeming this website's bucket larger and larger. I'm also seeing pages return in the index that I haven't seen in months. I'll use the site: command to find pages and ones I thought had been removed a long time ago are now being revealed. It's almost as if the more the website shrinks, the more good pages are shown. I know, it's strange, but it seems to be true.

One more critical observance. When I blocked pages with the robots.txt file in the past, it seems that Google counted those blocked URLs as pages in the index. Meaning, it would bump out good pages and show these blocked pages instead. So by blocking the pages with the robots.txt file, I was essentially shrinking the bucket. Once I unblocked the pages and had them return 403, 404, and 410 header status codes, the good pages began getting crawled and included in the index again. It's so odd.

If you have any input or experience with this, please let me know down below. I'd love to learn more about it.
 
Top