CaptainDan
Well-Known Member
- #1
I was watching a Google Webmasters video a few days ago when I heard something interesting. If memory serves, I believe the person in the video was discussing how web pages canonicalize into one another. Basically, they were expressing how important it is to send Google clear signals so they can exclude pages that don't need to be in the index and merge pages that are duplicates. The person really emphasized how critical it was to allow Google to merge similar and duplicate pages because a website's page count will go down, which will help the website rank overall. Then, and this is the interesting part, the person said that if a site's pages are allow to be merged, or canonicalized, more pages will be allowed into the index. As if there's a cap on the number of each website's allowed pages and by removing and merging the cruft, more good pages can fit into the container. I found that little statement fascinating.
This whole thing got me thinking. Perhaps we're not considering things correctly. We like to talk a lot about crawl budget, where Googlebot only visits a certain number of our web pages per day, but what if we're actually focusing on the wrong thing? What if, instead of crawl budget, we should be thinking about "container size," or how many pages, good or bad, we've currently got in the index, versus how many we'd like to get into the index? What if, because we've got so many thin and bad pages currently in the index, our container is full? Now, because of that, Googlebot simply doesn't have the need to crawl very many of our other pages and as a result, we're stuck with a low crawl rate.
If this is the case, I'd say that the primary goal of any webmaster should be to keep the cruft out of Google's index. Make sure duplicate and similar pages are folding into one another, stay away from thin pages, stay away from noindex, stay away from blocking pages in robots.txt, and if at all possible, don't let Google know about any bad pages at all.
What's your opinion on this? I know this is probably old news, but I think I've been pondering things a bit backwards for years.
This whole thing got me thinking. Perhaps we're not considering things correctly. We like to talk a lot about crawl budget, where Googlebot only visits a certain number of our web pages per day, but what if we're actually focusing on the wrong thing? What if, instead of crawl budget, we should be thinking about "container size," or how many pages, good or bad, we've currently got in the index, versus how many we'd like to get into the index? What if, because we've got so many thin and bad pages currently in the index, our container is full? Now, because of that, Googlebot simply doesn't have the need to crawl very many of our other pages and as a result, we're stuck with a low crawl rate.
If this is the case, I'd say that the primary goal of any webmaster should be to keep the cruft out of Google's index. Make sure duplicate and similar pages are folding into one another, stay away from thin pages, stay away from noindex, stay away from blocking pages in robots.txt, and if at all possible, don't let Google know about any bad pages at all.
What's your opinion on this? I know this is probably old news, but I think I've been pondering things a bit backwards for years.
Last edited by a moderator: