Help Google help us

17 Oct 2003

There has been some buzz lately about blogs making Google's ranking algorithm less useful and reliable than it used to be.
Not only do blogs use linking extensively, but they also introduced new linking behaviors like trackback. Comments and feedback usually also contain links (including at least one back to the comment author's homepage). Even though the blogging community is a small part of the web, it wouldn't be surprising if this "new" publishing behavior had an influence web search engines.

The Register has an article on trackback impacting Google, with a striking example of search query that only brings MovableType trackback pages in the ten first results. Comments also definitely have an effect, as "comment spammers" are starting to exploit blogs to get Google juice.
Although I haven't felt Google's quality degrading too much so far, I think it is worth bloggers doing a little effort to avoid this problem getting worse.

Google probably needs to find a way to improve itself and adapt to the evolving web, but in the mean time, bloggers can help it be the great tool we love.
There are multiple steps that bloggers can take to make the web a better place to find useful information: using robots.txt, getting ride of spam comments, avoiding aggregation pages and using the <title> html tag.

Robots.txt
Robots.txt file is a tool to keep well-behaved crawling engine off some portions of a site. It is described on the web robots page.
Here is the robots.txt file I just added to my blog. It tells crawlers not to index trackback and comments pages.

This does have some negative aspects, as comments do contain useful information. It would be possible to filter out only the trackback pages, but I think "comment spam" is just too annoying, so I'm going with the more radical approach.

Monitor and exterminate spam comments
Don't you hate getting these "Viagra comments"?

If you don't choose the radical solution of having search engines not index the comment page on your blog, or that is just not an option (comments are not on a separate page), at least you should monitor the comments posted to your site. Any blog owner should do that anyway, but it is also important to delete comments that are just meant to improve the ranking of commercial pages in search results.

I don't have any specific knowledge that spammers use bots to post these bogus comments, but if they do, any kind of anti-bot technology would also help keep them away. But that means it also makes it more trouble for a regular reader to post a valid comment.

Avoid aggregation pages
This is a tricker problem. Basically if you have multiple posts on different topics in the same page, you are going to confuse search engines. The front page of any blog falls into this category. Not only does it aggregate multiple posts, it also usually contains blogrolls and link lists.
Search engines usually include some kind of word distance logic, so that words that are grouped together are ranked higher than words that are spread, but they are still sensitive to this problem.

Lots of blogs have individual pages for archived posts and I think that's good. But there are still many that have aggregate pages as their main permanent archive (and use fragments identifier '#' in their permalinks).
For example, I am thinking of removing monthly archives from this site, although I haven't done it yet.

Setting the <title>
I have discussed this topic in detail in a previous post. Basically, setting the <title> for each page will make search results more readable, as the main result link will be more meaningful.

Setting the title properly has many usability benefits besides that, so I highly recommend doing it.

Update: Another good point about using meaningful titles.

Here are some other ideas that can't be used right away, but that could help sorting out the web.

Richer robots.txt specifications
As you can see, the robots.txt specification isn't very fancy or rich. You only can allow or disallow some paths on your site.
I wish robots.txt would allow a category for pages that can be crawled, but should not be indexed or included in search results.

For example, a blog's front page or main archive page (with all the blog titles and permalinks) would fall into this category, as they are aggregation pages.

HTML attributes for crawler hinting
Also, with some special new html attibutes, you could hint the crawler on sub-page units.

For example, adding a "noindexing" tag on the blogroll block, you would be able to tell the crawler that the words that figure in this block shouldn't get indexed.
Similarly, a "indexingblock" attribute could be used to separate multiple indexing regions within a page that aggregates multiple posts on the same page.

You might think that such a hinting system might be too much work, but in fact blogs are probably the best target for such markup because most blogs are template based. You would just need to get the major blog tools to include these tags in their default templates.
There is one problem with this technique, though, as it makes it more work for search engines to index pages.

Link typing
Having some attributes to mark links with some extra semantic might also bring some value. For example, when you post a strong criticism about a website, you'd probably rather not provide this website with extra Google juice.
Here is an article on O'Reilly Network about link typing.

The problem with that is that you might give birth to a new type of link spamming. If you have a link attribute that carries a negative connotation, spammers might be able to use it to lower their competitors ranks.

Update: Wired mentions "Let us link to a page we hate without boosting its ranking" (item 75) in the Google To-Do List (part of "101 Ways to Save the Internet").

Update: The idea of hinting the search engine either via robots.txt or html block tags or link typing is gaining more traction. I posted a quick summary.

Update: Eric Wolfram has a post on how to score higher in Google. I don't know how much impact good HTML accessibility practices will have on your PageRank or even if you should care about your rank, but by following them, you help Google classify you better and you help your users too.
This includes setting the title and having links with useful text information (no "click here" link).

Links
A post by the Movable Type team on comment spam.

______________________________________

You've posted a wealth of great information here and I thank you ... but I'm also curious as to how Permalink pages get searched at all.

That is to say, currently most search engines will return the main page of my blog when certain key words/phrases are searched for (although, currently, Google is only returning my Atmom XML link, so I guess they haven't indexed me yet).

HOWEVER,I've never been able to generate ANY search results for my individual posts/archives/permalinks on any of the search engines. Are they even being indexed? Is there a trick here I'm missing? I never had this problem when I had a normal website, but with the blog I seem invisible...

Posted by: Lucas Brachish (November 4, 2004 01:54 PM) ______________________________________

Sorry about the various spelling mistakes above... yikes.

Correction:

I meant "Atom XML", of course, referring to the Blogger-generated feed of my site. Not "Atmom".

If anyone has comments regarding my Permalink questions, please drop me a line. Thanks!

Posted by: Lucas Brachish (November 4, 2004 01:58 PM) ______________________________________

All your pages should get indexed, if they are linked from a page that is indexed. New pages may take some time (faster than a month) to get indexed.
Check that your robots.txt doesn't tell bots not to index your archive.

Posted by: Julien (January 19, 2005 04:30 PM) ______________________________________

Well, about 6 weeks after I wrote the post above some of my pages finally got searched/saved by Google. However, I took two steps prior to this, and I'm not sure which actually did the trick:

1) I manually created an index (a hot-linked list of pages) on my sidebar (which appears on every page) instead of just relying on the java or Blogger generated “recent posts” list. I was hoping Google might follow these hyperlinks through my site.

2) I hyperlinked the headline of every single post to it's respective individual archive page (which doesn't look amazing, but is quite functional). Again, the intention was to great a web that Google would follow… I was suspicious that Google might be ignoring the “permalink” or giving it a low ranking, so by linking the headline it gave Google a context-appropriate link to follow.

Still, it took a couple of months before all of my pages (I have less than 50 pages total right now) where listed on Google, and most new pages are still taking over a month to be added... a couple of pages may be missing completely ... but my index/homepage updates quite regularly, so that's good.

At least Google sends a dozen hits to my page everyday (90% of which just goes to the article about labial reconstruction, for better or worse). Yahoo and MSN only link my home page and one or two of my big monthly archive pages... they're not crawling my site properly at all, and it shows in my traffic.

I’m suspicious that the search engines are beginning to weed out all but the home pages of blogs in general, even when content-filled and relevant articles are being posted and archived (as opposed to personal journal” link-collection, and news-of-the-nanosecond weblogs). Oh, this isn’t the case yet, but it seems to be the direction the algorithms are slowly going.

Posted by: Lucas Brachish (January 19, 2005 09:33 PM) ______________________________________

Lucas, glad your problem got mostly sorted out.

What makes you think that search engines would start "weeding out all but the homepages of blogs"?
It seems reverse, since I agree with you that individual entry pages are better search engine material: they have a more focused topic, usually have a relevant title and don't change much.
On the other hand, I do think that the front page of blogs is crawled more frequently.

Posted by: Julien (February 6, 2005 10:07 PM) ______________________________________

Julien,

Well, both my individual archive pages and my aggregated archives (monthly pages) have disappeared periodically from Google -- but my homepage always remains. And the home page is definitely crawled the most by the various search engines. I've also noticed blog home pages popping up more often in search results lately, instead of permalink pages (but that could just be my imagination). That's what's made me think that for blogs, Google is concentrating more on just homepages, not considering blog permalinks to be timely.

Currently, only a few of my pages are searchable by Google -- the monthly aggregate pages, the home page, and a couple of other random pages. The rest of the pages are indexed in Google as URL-only. That is to say, they would never turn up under a normal search, because none of their content is cached. You can see what I mean by running a Google on "site:http://celebritycola.blogspot.com" (as of March 7, 2005). After finally getting Google to crawl every one of my pages and follow every one of my links, my reward was two weeks of great search traffic, and then, suddenly, the dropping of all of my pages (except the home page), with the number of pages SLOWLY building back up.

I think this might be because the monthly aggregate pages (such as http://celebritycola.blogspot.com/2004_09_01_celebritycola_archive.html ) are duplicating content from the individual archive pages (the permalinks). I understand that Google will sometimes place a "penalty" on pages that use content duplicated elsewhere on a site (or even on other sites).

So Google seems to be seeing my monthly aggregate archives as being the "real" pages while it looks at the individual pages as being duped or partial, so it doesn't document their content. Since the home page is also an aggregate page (and is crawled by Google more often), it also becomes more of a focus for Google, while the permalinks are ignored.

My only solution, it would seem, would be to drop my archive pages altogether, forcing Google to look at my individual pages instead (I'm going to try that soon, after I've backed up my site). I can't ask the search robots to not index the homepage and monthly archives (while archiving everything else), however, because of the way the Blogger/Blogspot templates work (whatever I put in the header for one for the header robot text goes into ALL the pages). So it’s an all or nothing proposition.

On the other hand, my Yahoo traffic just keeps getting better and better (search results send users directly to useful data on the individual pages of my site). Outside of Yahoo and Google, however, few search engines throw me any traffic at all. MSN barely sends me a visitor a month. But when all of my individual pages are properly listed and cached on Google, it gives me more traffic than everything else combined (if only Google had me completely listed more consistently). We’ll see . . .

Posted by: Lucas Brachish (March 7, 2005 04:29 PM) ______________________________________

Note: I'm also hesitating to pull down my archive pages because for a while traffic will certainly continue to be sent to those pages (and some links point to those pages), and if the pages aren't there then it'll look rather bad (and it might be a while before Google indexes all fo my individual posts)... And since the archive pages are auto-generated, I can't just replace them with a "This page has beeen mvoed to here..." page. Aaargh.

Posted by: Lucas Brachish (March 7, 2005 04:38 PM) ______________________________________

Lucas,

Do you actually think google are stupid enough to penalize blogspot sites for 'duplicate content'? Google own blogspot, why would they penalize it? The duplicate content rule is for multiple *sites* that contain the same content.

Posted by: dwayne (May 18, 2005 12:49 PM) ______________________________________

http://loanboat.m-moore.org/bkmizcb/ beckonframedrotating

Posted by: relax (August 29, 2005 01:24 PM)

Curiosity is bliss Archive Feed About Search

Julien Couvreur's programming blog and more

Help Google help us