How Google Base works
I've been searching for information about how Google Base does it's grouping of attributes to provide a multi faceted search for a while now and I thought it would be useful to share a little with you guys. If you have any additional papers or information please post your comments!
After playing with Google Base for a little while I noticed that it returns the counts of the attributes only when there are 1000 results or fewer. This led me into thinking that Google may have chosen for a sampling approach to figure out which user defined attributes are common in the result set.
See for example the results for "mp3" which gives me more than 167,000 results. It allows me to narrow my results by the attributes (labels in this case) "products" and "music". It won't show counts, since the number of results exceeds 1000 and it had to sample the results to figure out which attributes are common.
I refine my search by clicking the "products" attribute, reducing my set to 145,000 results. Google now shows the refinements "Condition", "Manufacturer", "Brand", "Product type", "Location", "Book", "Capacitors" and "Amplifiers" because these occur frequently in the top 1000 results (let's assume a threshold of 10%). It also shows a drop down for the attribute "Price" which allows me to specify a price range to search for. This happens if all the records from the sample (or if the count is above a high threshold, say 90%) contain that attribute. That's as expected for products, they come with a price :)
Now I narrow down to "Search for mp3 > Products > Brand: apple > Nano", returning only 64 results. Since I have fewer than 1000 results Google has analyzed all the documents for me, so it's now able to show me the exact counts for each attribute.
Google takes the top-k results (top 1000) and analyzes each document using the document-at-a-time (DAAT) approach. Fetching the metadata for 1000 documents doesn't take much time. For each document it increases the counts of each element found in that document. After analyzing it throws away the attributes that occur only a few times, below a certain threshold. These are not interesting to show to the users because it narrows their search too much.
The paper "Sampling search-engine results" describes in-depth how this top-k sampling works and how accurate it's results are. Because Google applies some relevancy ranking algorithm to the results before taking the top-k sample it is likely to return a very good selection of relevant attributes for your search.
I hope you get the idea now of what's going on behind the scenes of Google Base. Of course the actual implementation may differ a bit, this is just my analysis of it. I wasn't able to find any papers from Google that prove this theory.
Google Base isn't rocket science as some might think - you'll see a lot of similar products within a while and it's techniques will become as common as full text indexing.
After playing with Google Base for a little while I noticed that it returns the counts of the attributes only when there are 1000 results or fewer. This led me into thinking that Google may have chosen for a sampling approach to figure out which user defined attributes are common in the result set.
See for example the results for "mp3" which gives me more than 167,000 results. It allows me to narrow my results by the attributes (labels in this case) "products" and "music". It won't show counts, since the number of results exceeds 1000 and it had to sample the results to figure out which attributes are common.
I refine my search by clicking the "products" attribute, reducing my set to 145,000 results. Google now shows the refinements "Condition", "Manufacturer", "Brand", "Product type", "Location", "Book", "Capacitors" and "Amplifiers" because these occur frequently in the top 1000 results (let's assume a threshold of 10%). It also shows a drop down for the attribute "Price" which allows me to specify a price range to search for. This happens if all the records from the sample (or if the count is above a high threshold, say 90%) contain that attribute. That's as expected for products, they come with a price :)
Now I narrow down to "Search for mp3 > Products > Brand: apple > Nano", returning only 64 results. Since I have fewer than 1000 results Google has analyzed all the documents for me, so it's now able to show me the exact counts for each attribute.
Google takes the top-k results (top 1000) and analyzes each document using the document-at-a-time (DAAT) approach. Fetching the metadata for 1000 documents doesn't take much time. For each document it increases the counts of each element found in that document. After analyzing it throws away the attributes that occur only a few times, below a certain threshold. These are not interesting to show to the users because it narrows their search too much.
The paper "Sampling search-engine results" describes in-depth how this top-k sampling works and how accurate it's results are. Because Google applies some relevancy ranking algorithm to the results before taking the top-k sample it is likely to return a very good selection of relevant attributes for your search.
I hope you get the idea now of what's going on behind the scenes of Google Base. Of course the actual implementation may differ a bit, this is just my analysis of it. I wasn't able to find any papers from Google that prove this theory.
Google Base isn't rocket science as some might think - you'll see a lot of similar products within a while and it's techniques will become as common as full text indexing.


