Monday, July 19, 2010

Crawling and Indexing explained

The terms crawling and indexing (and indexing's cousin, caching) are frequently used together, but you should not consider them synonyms.

Crawling is the process of an engine requesting, and successfully downloading, a unique URL. Obstacles to crawling include no links to a URL, server downtime, robots exclusion, or using links (such as some JavaScript links) from which bots cannot find a valid URL.

Indexing is the result of successful crawling. I would consider a URL to be indexed (by Google) when an info: or cache: query produces a result, signifying the URL's presence in the Google index.

Obstacles to indexing can include duplication (the engine might decide to index only one version of content for which it finds many nearly identical URLs), unreliable server delivery (the engine may decide to not index a page that it can access during only one-third of its attempts), and so on.

What's the difference between crawling and indexing, in terms of time? In comparing a newly introduced URL to see when it would be indexed, the text cache showed results after 15 days and finally stopped saying "Your search - cache:[URL] - did not match any documents." But what was interesting is that the cached file showed the results of the URL "as retrieved on xDate (7 days prior)." So make special note that the URL was crawled and cached over a week before it appeared in the index.

A better, more comprehensive test would be to watch server logs and see how many times the file was requested, and with what frequency, between the original request date and date at which the cache query showed results. Additional testing would try to detect ways to shorten that time by increasing the number (and prominence) of incoming links and so on.

Spider Simulators:

What you see as a visitor in your browser while watching any web site differs a lot from what the search engines spiders see when indexing your pages.

Find out what spiders see when they crawler your websites by using these simulators:

Spider Simulator - SEO Chat
http://www.seochat.com/seo-tools/spider-simulator/

Spider View - Iwebtool
http://www.iwebtool.com/spider_view

Search Engine Spider Simulator - Anownsite
http://www.anownsite.com/webmaster-resources/search-engine-spider-simulator.php

SE Bot Simulator - XML Sitemaps
http://www.xml-sitemaps.com/se-bot-simulator.html

SE Spider - LinkVendor
http://www.linkvendor.com/seo-tools/se-spider.html

Spider Simulator from Summit Media
http://tools.summitmedia.co.uk/spider/