  | Contextual Advertising Connecting you to your audience - Learn More |
| Recent Articles |
Americans
Turning To Web For Big Decisions
As the Internet grows to become the default source of information for millions
of Americans, 45% of Internet users, or 60 million Americans, say that the World
Wide Web played a major role in major decision-making in the past two years...
Yahoo
Promoting Earth Day Pledges
Through a special website hosted at Yahoo, visitors can pledge to take energy-saving
measures that may help improve the quality of life on Earth.
SEMLogic
Means No More Search Secrets
During a compelling presentation of Fortune Interactive's SEMLogic demo, I realized
a greater implication of the service than accurate search marketing - Mike Marshall
and his team have managed to duplicate...
eBay
Earnings Fall On Stock Expense
Although revenue increased by 35 percent compared to the same quarter last year,
net income for eBay moved down by three percent as stock-based compensation expenses
impacted the bottom line.
|
|
|
|
04.24.06 Matt Cutts Teaches Us To Crawl By
David A. Utter
The Google engineer followed up his WebmasterWorld PubCon Boston discussion of
Google's Bigdaddy infrastructure update and "crawl cache" with a lengthier look
at the topic.
Cutts' latest blog post reviewed Bigdaddy's crawl-caching proxy in greater depth. He even provided helpful charts to illustrate the process.
As a webmaster, one may see numerous fetches from multiple Googlebots, each of them using some bandwidth while accomplishing their appointed rounds. It makes for a more accurate Google index, but the site impact has given some webmasters fits over the bandwidth usage.
The proxy used in the Bigdaddy infrastructure works like other proxies. It handles the effort of retrieving pages from websites, and fulfills requests from the various Google crawlers. Instead of multiple spiders hitting a website, they hit the cache instead.
Cutts breaks down the crawl caching in a summary during his post (spacing added; we like Matt, but we'd really like him to enjoy the Return key a bit more often :) :
So the crawl caching proxy work like this: if service X fetches a page, and
then later service Y would have fetched the exact same page, Google will sometimes
use the page from the caching proxy.
Contextual
Advertising
Connecting you to your audience - Learn
More |
|
Joining service X (AdSense, blogsearch, News crawl, any Google service that uses
a bot) doesn't queue up pages to be include in our main web index. Also, note
that robots.txt rules still apply to each crawl service appropriately. If service
X was allowed to fetch a page, but a robots.txt file prevents service Y from fetching
the page, service Y wouldn't get the page from the caching proxy.
Finally, note that the crawl caching proxy is not the same thing as the cached
page that you see when clicking on the "Cached" link by web results. Those cached
pages are only updated when a new page is added to our index.
It's more accurate to think of the crawl caching proxy as a system that sits outside
of webcrawl, and which can sometimes return pages without putting extra load on
external sites.
The essential goal of the proxy, to reduce bandwidth, seems to have worked to
Google's satisfaction. Cutts wrote that "it was working so smoothly that I didn't
know it was live."
About the Author: David Utter is a staff writer for WebProNews covering technology and business. |