Room in Google's Index?

CaptainDan

Member
I was watching a Google Webmasters video a few days ago when I heard something interesting. If memory serves, I believe the person in the video was discussing how web pages canonicalize into one another. Basically, they were expressing how important it is to send Google clear signals so they can exclude pages that don't need to be in the index and merge pages that are duplicates. The person really emphasized how critical it was to allow Google to merge similar and duplicate pages because a website's page count will go down, which will help the website rank overall. Then, and this is the interesting part, the person said that if a site's pages are allow to be merged, or canonicalized, more pages will be allowed into the index. As if there's a cap on the number of each website's allowed pages and by removing and merging the cruft, more good pages can fit into the container. I found that little statement fascinating.

This whole thing got me thinking. Perhaps we're not considering things correctly. We like to talk a lot about crawl budget, where Googlebot only visits a certain number of our web pages per day, but what if we're actually focusing on the wrong thing? What if, instead of crawl budget, we should be thinking about "container size," or how many pages, good or bad, we've currently got in the index, versus how many we'd like to get into the index? What if, because we've got so many thin and bad pages currently in the index, our container is full? Now, because of that, Googlebot simply doesn't have the need to crawl very many of our other pages and as a result, we're stuck with a low crawl rate.

If this is the case, I'd say that the primary goal of any webmaster should be to keep the cruft out of Google's index. Make sure duplicate and similar pages are folding into one another, stay away from thin pages, stay away from noinex, stay away from blocking pages in robots.txt, and if at all possible, don't let Google know about any bad pages at all.

What's your opinion on this? I know this is probably old news, but I think I've been pondering things a bit backwards for years.
 

KodyWallice

Member
So you're saying that pages that are blocked by robots.txt fill up the index just like thin pages do? Can you please elaborate on that? Also, how do 301 redirects play into this?
 

CaptainDan

Member
So you're saying that pages that are blocked by robots.txt fill up the index just like thin pages do? Can you please elaborate on that? Also, how do 301 redirects play into this?
Pretty much. A while ago, I read on Google's website that they don't recommend blocking duplicate content with robots.txt. I suspect that's because each and every URL that's blocked still counts for something. They can't work for you. I couldn't imaging they'd be good for a site. Now, if you have a huge directory filled with pages that you don't want indexed and there's only one path to those pages, then yes, that's the way to block them. With robots.txt. Otherwise, if every single page you'd like to block has an individual path to it, you might want to figure out a way to remove those paths and somehow delete those pages. Examples of these individual pages would be user accounts on classifieds and forum sites. Each and every post has a link to these pages and Google doesn't need to know about any of them. They shouldn't be crawling them. So if you placed a noindex meta tag on them and allowed Google to crawl them, that would be wasting crawl budget. And if you blocked them in robots.txt, you'd be filling up your index "bucket" with blocked pages. No good either way. The best thing to do in this case is to remove the links and then have those pages return an error, such as a 403 status code, so when Google crawls the link, it'll remove the page from the index. And since the link will be removed for those who aren't logged in (search engine crawlers), new users will never even be known about.

As for 301 redirects, I think they're terrible. Search engines follow them all the time be only canonicalize them with the target pages some of the time. I can't stand redirects. I do understand that they are a necessary evil though. If your site has 301 redirects that can be removed or dealt with another way, then focus on that. No website should ever have permanent 301 redirects in the site structure. Any more questions, please ask. Also, here's a video for you from Matt Cutts. Listen closely to how he is suggesting that pages fold into one another. That's the ultimate goal.

 
Last edited:
Room in Google's Index? was posted on 08-23-2020 by CaptainDan in the Tech Forum.
Top