Thursday, August 25, 2005

How Big Is Your Index?

Posted by Phil Aaronson at 12:15 PM

I guess it started with this, a claim on the Yahoo! Search blog that their new index sported 19.2 billion web pages. And Google cried, na-ha. And Yahoo! cried, ya-ha.

And then some academics thought it would be clever to estimate the relative sizes, and in some way lend credence to either Google or Yahoo's na/ya-ha. Their methodology: send random two word queries to both engines, dropping anything with more than 1000 results. Compare peni... errr result sizes in the hopes that these corner case queries would corollate to the overall index size. Their conclusion:
It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Google's index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine or if the Yahoo! search engine is not returning all the documents that match our specific search queries, we find it puzzling that Yahoo!'s search engine consistently returned fewer results than Google.
And it was subsequently picked up by Slashdot and made the rounds all over the internet. Its how I found it.

I'll admit, I'm seriously biased. I work for Yahoo!, but this paper has some serious biases of its own. Just for fun I ran through their results and made a quick histogram of the difference between Yahoo! and Google result counts (duplicates removed). The main mode was a difference of 3. So I dumped every query that resulted in three results from Google and zero from Yahoo!. I put the list of terms here. There are 755 of them. I didn't check every one, but its pretty clear a big chunk of the list are all links to these three results:
They're all dictionaries. Big lists of words that Yahoo! search decided to exclude from their results. And when a significant chunk of the results are pointing to the same three dictionaries, I think its safe to say that they're not going to tell you much about the overall index size. And I didn't look at the terms that are 4 results different. Neat idea, but flawed.

[Update 8/22/2005] Interestingly, the study has been updated to only include queries that returned more than 25 results. Effectively cutting out the 755 or so three result queries I questioned. The old version of the paper is available here. The new version is here.

[Update: 8/25/2005]The new methodology being employed tries to filter out dictionaries two ways. The first by eliminating any result that returns under 25 results from either Yahoo! or Google. This directly addresses those queries that are almost all dictionary results on Google that I wrote about earlier. The second method is by adding a third random exclusion term.

The two measures were only successful in a limited way. Here, I've plotted a histogram of the difference in results between Yahoo! and Google for the original version, and the newly revised version. More results from Yahoo! run negative, more results from Google run positive. I've limited the plot to 500 queries on the y-axis to better view the shape of the distribution.

Original version:

New Verified version:

The peak is at a difference of 22 results with 120 different queries in the new "verified" version. Going through a few of these, its pretty easy to find a dozen or more dictionary like results for these terms. The exclusion term eliminated the larger, more complete dictionaries, but that's not all there are, there are plenty of more "less complete" dictionaries to be found ... and Google finds them. The chunk carved out of the distributions above? I believe we're still looking at dictionary and dictionary like filtering, and it clearly remains.

For fun I pulled out the two queries that were at the local minimum, at a difference of -93 results. They were:
unwell escapade -relatively
horribly hardihood -outpull
unwell escapade -relatively, is filled with blog results for Yahoo!. Google appears a little more timid about adding blogs to their index? This certainly jibes well with my own experience. The second query, horribly hardihood -outpull? I have no idea why. It was skipped by the analysis, but I kept it in because the duplicates removed results are under 1000. They're a little more strict.

A great way to compare Google vs. Yahoo! that Christian Langreiter put together. Pretty fun.

[Update: 8/29/2005] I ran across this study that was conducted in 2000 on 25 hand picked single term search queries. Then they were comparing the Fast, Northern Light and AltaVista search engines at the time. But aside from the number of queries and the fact that they are single words, the methodology is very similar. The first few queries return over 1000 results today.


