How to "steal" Google's "did you mean" feature

I really like Google and the classic "did you mean" feature is really great, unfortunately when I wanted to implement it into my project a realized that it's not provided by any of Google's APIs.

There are some threads on stackoveflow.com (like this one) where somebody were asking how does the algorithm work or if they can implement it somehow into their websites. Someone posted link to a great article by Peter Norvig on How to Write a Spelling Corrector which is great but it's not exactly how Google's "did you mean" work (see Search 101 on YouTube) and it's not really useful for real applications because:

I believe for most developers these two conditions are unachievable. And so it's for me. So I was thinking if I can bypass these drawbacks and let Google do all the job for me.

Dummy way: Examining Google's HTML code

My first idea was to make script that tries to "search" the query on Google and then check if Google offered some "did you mean".

Source code in PHP

 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
$lang = 'en';
$query = 'white hose';

// pretend we're an ordinary browser
$agents = array(
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.60 Safari/534.24",
    "Opera/9.63 (Windows NT 6.0; U; ru) Presto/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5",
);

// create the url. $lang is very important parameter because it specifies in what language
// we are searching and therefore it influences the search results a lot.
$url = 'http://www.google.com/search?client=firefox-a&hl=' . $lang . '&q=' . urlencode($query);

// download the search results
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agents[rand(0, count($agents) - 1)]);
$data = curl_exec($ch);
curl_close($ch);

// span with class 'spell' means that Google suggested something
$template_span = '<span class=spell ';
$template_a = '<a rel="nofollow" target="blank" href';
$template_end_a = '</a>';

// check if the html source code contains "did you mean" block
if (strpos($data, $template_span) === false) {
    $answer = false;
} else {
    $str = substr($data, strpos($data, $template_span));
    $str = substr($str, strpos($str, $template_a));
    // and here's the Google's suggestion
    $answer = strip_tags(substr($str, 0, strpos($str, $template_end_a)));
}

if ($answer) {
    echo 'did you mean: ' . $answer;
} else {
    echo 'no suggestion';
}

How does it work?

This script takes search query $query and language $lang, generates URL for Google and downloads the page with results. Then it tries to find "did you mean" pattern which is <span class=spell .

For example if I searched for "white hose" instead of "white house" the URL would be:

http://www.google.com/search?client=firefox-a&hl=en&q=white+hose

Maybe a little ambiguous is where that client=firefox-a came from. It has no reasonable explanation, I was just trying to find client with easily traversable HTML output and firefox-a seemed to be nice.

If you run this script you should see that Google correctly recognised that I was searching for white house.

Conclusion

Actually, this is a very dummy way of analysing page's content but since this example is so simple (basically just checking one single pattern) there's no need to use more advanced techniques like XPath (Parsing HTML pages using XPath).

Another caveat is that Google's HTML structure may change any time and this script might stop working.

Less dummy way: "Stealing" Firefox's suggestions

When I was tracing request from some website in Firefox using Burp Suite I found a lot of requests coming from Firefox when I started typing into the search bar. All these requests were going to suggestqueries.google.com.

I ran Fiddler2 and tried to understand what exactly is Firefox calling and when.

The URL structure is almost the same like in the first example above:

http://suggestqueries.google.com/complete/search?output=firefox&client=firefox&hl=en-US&q=404

and the response is just an ordinary JSON:

Source code in PHP

Source code is very similar to the first example:

 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
$lang = 'en';
$query = 'white hose';

$url = 'http://suggestqueries.google.com/complete/search?output=firefox&client=firefox&hl=' . $lang . '&q=' . urlencode($query);

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.0; rv:2.0.1) Gecko/20100101 Firefox/4.0.1");
$data = curl_exec($ch);
curl_close($ch);

$suggestions = json_decode($data, true);

if ($suggestions) {
    echo 'suggestions: ';
    print_r($suggestions);
} else {
    echo 'no suggestion';
}

You can achieve almost the same results with calling http://www.google.com/complete/search?q=white+hose but the response it not just a simple JSON like with suggestqueries.google.com.

Output

(
    [0] => white hose
    [1] => Array
        (
            [0] => white house
            [1] => white house black market
            [2] => white hose reel
            [3] => white house tours
            [4] => white horse lyrics
            [5] => white house easter egg roll
            [6] => white horse challenge
        )

)

Conclusion

I think the second example is much better than the first one. Both are not nice, but when you really want to implement some suggestion feature this is probably the best solution. By the way, on google-ajax-apis there's an issue where developers are asking for some official API.

I read that Yahoo has some spelling suggestions API but I haven't tried it. Also there's spelling suggestion API for Python but I haven't tried it either.

blog comments powered by Disqus