中年engineerの独り言 - crumbjp







Search evaluation at Google
Posted: Monday, September 15, 2008

This series of posts has described Google's search quality efforts in areas such as ranking and search UI. Now I'll describe search evaluation. Simply put, search evaluation is the process of measuring the quality of our search results and our users' experience with search.


Let me introduce myself. I'm Scott Huffman, an engineering director responsible for leading search evaluation, working with a talented team of statisticians and software engineers. I've been here since 2005, and have been working on search in one form or another for the past fourteen years or so.


When I'm interviewing folks interested in joining the search evaluation team, I often use this scenario to describe what we do: Imagine a Google ranking engineer bursts into your office. "I have a great idea for improving our search results!" she exclaims. "It's simple: Whenever a page's title starts with the letter T, move it up in the results three slots." This engineer comes armed with several example search queries where, lo and behold, this idea actually improves the results significantly.





Now, you and I may think that this "letter T" hack is really a silly idea, but how can we know for sure? Search evaluation is charged with answering such questions. This hack hasn't really come up, but we are constantly evaluating everything, which can include:

  • proposed improvements to segmentation of Chinese queries
  • new approaches to fight spam
  • techniques for improving how we handle compound Swedish words
  • changes to how we handle links and anchortext
  • and everything in between


 - 漢字圏のサーチ結果向上の提案
 - スパム対策の新しいアプローチ
 - スウェーデン語に関する新技術
 - リンクとアンカーの扱いの変更
 - その他

As Udi mentioned in his initial post on search quality, in 2007 we launched over 450 improvements to Google search, and every one of them went through a comprehensive evaluation process.

Udi がサーチ品質について言及しているように
2007年我々はGoogle searchに450もの改善を包括的な評価プロセスを通して行った。

Not surprisingly, we take search evaluation very seriously. Precise evaluation enables our teams to know "which way is up". One of our tenets in search quality is to be very data-driven in our decision-making. We try hard not to rely on anecdotal examples, which are often misleading in search (where decisions can affect hundreds of millions of queries a day). Meticulous, statistically-meaningful evaluation gives us the data we need to make real search improvements.


Evaluating search is difficult for several reasons.

First, understanding what a user really wants when they type a query -- the query's "intent" -- can be very difficult. For highly navigational queries like [ebay] or [orbitz], we can guess that most users want to navigate to the respective sites. But how about [olympics]? Does the user want news, medal counts from the recent Beijing games, the IOC's homepage, historical information about the games, ... ? This same exact question, of course, is faced by our ranking and search UI teams. Evaluation is the other side of that coin.

Second, comparing the quality of search engines (whether Google versus our competitors, Google versus Google a month ago, or Google versus Google plus the "letter T" hack) is never black and white. It's essentially impossible to make a change that is 100% positive in all situations; with any algorithmic change you make to search, many searches will get better and some will get worse.

Third, there are several dimensions to "good" results. Traditional search evaluation has focused on the relevance of the results, and of course that is our highest priority as well. But today's search-engine users expect more than just relevance. Are the results fresh and timely? Are they from authoritative sources? Are they comprehensive? Are they free of spam? Are their titles and snippets descriptive enough? Do they include additional UI elements a user might find helpful for the query (maps, images, query suggestions, etc.)? Our evaluations attempt to cover each of these dimensions where appropriate.

Fourth, evaluating Google search quality requires covering an enormous breadth. We cover over a hundred locales (country/language pairs) with in-depth evaluation. Beyond locales, we support search quality teams working on many different kinds of queries and features. For example, we explicitly measure the quality of Google's spelling suggestions, universal search results, image and video searches, related query suggestions, stock oneboxes, and many, many more.

 - ユーザがサーチによって本当は何を望んでいるのか(クエリーの意図)理解することが非常に難しい


 - サーチ品質を他のエンジンと比較することも難しい。本質的に100%良くなる改善は不可能だ。

 - 一口に『良い』といっても、色々な次元がある。


 - Googleサーチの品質は、非常に大きな何百という地域(国/言語)をカバーすることを要求される。地域にまたがって、サーチ品質チームは幾つものクエリーや機能をサポートしている。
  たとえば、spell-suggestion、サーチ結果、画像/動画サーチ、query-suggestion、stock oneboxes?,その他。

To get at these issues, we employ a variety of evaluation methods and data sources:

Human evaluators. Google makes use of evaluators in many countries and languages. These evaluators are carefully trained and are asked to evaluate the quality of search results in several different ways. We sometimes show evaluators whole result sets by themselves or "side by side" with alternatives; in other cases, we show evaluators a single result at a time for a query and ask them to rate its quality along various dimensions.

Live traffic experiments. We also make use of experiments, in which small fractions of queries are shown results from alternative search approaches. Ben Gomes talked about how we make use of these experiments for testing search UI elements in his previous post. With these experiments, we are able to see real users' reactions (clicks, etc.) to alternative results.

 - 人間の評価者。Googleは多くの国や言語の評価者を活かしている。評価者は注意深くトレーニングされ幾つもの異なる方法でサーチ結果を評価することを求められている。

 - Live traffic実験。我々はまた少数のクエリーをに対し実験的に別アプローチのサーチ結果を表示している。Ben Gomesはexperiments for testing serach UI elemetsで

Clearly, we can never measure anything close to all the queries Google will get in the future. Every day, in fact, Google gets many millions of queries that we have never seen before, and will never see again. Therefore, we measure statistically, over representative samples of the query-stream. The "letter T" hack probably does improve a few queries, but over a representative sample of queries it affects, I'm confident it would be a big loser.


One of the key skills of our evaluation team is experimental design. For each proposed search improvement, we generate an experiment plan that will allow us to measure the key aspects of the change. Often, we use a combination of human and live traffic evaluation. For instance, consider a proposed improvement to Google's "related searches" feature to increase its coverage across several locales. Our experiment plan might include live traffic evaluation in which we show the updated related search suggestions to users and measure click-through rates in each locale and break these down by position of each related search suggestion. We might also include human evaluation, in which for a representative sample of queries in each locale, we ask evaluators to rate the appropriateness, usefulness, and relevance of each individual related search suggestion. Including both types of evaluation allows us to understand the overall behavioral impact on users (via the live traffic experiment), and measure the detailed quality of the suggestions in each locale along multiple dimensions (via the human evaluation experiment).

Live traffic実験を含めるだろう。そして、関連サーチ候補の変化、地域毎のクリックレート、関連サーチ候補の位置の上下などを見る。


Choosing an appropriate sample of queries to evaluate can be subtle. When evaluating a proposed search improvement, we consider not only whether a given query's results are changed by the proposal, but also how much impact the change is likely to have on users. For instance, a query whose first three results are changed is likely much higher impact than one for which results 9 and 10 are swapped. In Amit Singhal's previous post on ranking, he discussed synonyms. Recently, we evaluated a proposed update to make synonyms more aggressive in some cases. On a flat (non-impact-weighted) sample of affected queries, the change appeared to be quite positive. However, using an evaluation of an impact-weighted sample, we found that the change went much too far. For example, in Chinese, it synonymized "small" (小) and "big" (大)... not a good idea!

Amit Singhalのprevious post on rankingの中で、彼は同義語について話している。
しかしながら、重みづけサンプル(impact-weighted sample)においては、到底、成功とは言えない結果だった。


We're serious about search evaluation because we are serious about giving you the highest quality search experience possible. Rather than guess at what will be useful, we use a careful data-driven approach to make sure our "great ideas" really are great for you. In this environment, the "letter T" hack never had a chance.

Posted by Scott Huffman, Engineering Director