Saturday 28 February 2015

Python & Google Search API Requests

It's been a while since I've been able to blog something. Anyway, out of all the geeky things I could write about which I've been tinkering with (reponsive HTML email layouts, Symfony2 secure users set ups, etc)., I just thought I'd share something for a Google API that should have easy instructions for accessing it through Python somewhere but doesn't. You'll see lots of information out there about putting a CustomSearch box and results in your browser with Javascript block, but we want to do it with Python, right?

Google provide a generic "how to access Google APIs through Python" guide here:
https://developers.google.com/api-client-library/python/

However, it's not too helpful in showing you specifically how to just get Google search results out, which is what we want! It also doesn't tell you that at the time of writing Google haven't yet made a version of their module for Python 3.

Anyway, the background seems to be that years ago Google had a rather clunky XML/SOAP interface for making requests using an old Search API. Thankfully they got rid of it and replaced with a much nicer JSON API. There is a free version of this new one called the CustomSearch API, which limits you to 100 requests per day. (If you want more there is a paid version.) The catch for this is that you are also limited to only ten search results per request. I guess Google unhelpfully assume you will want to paginate search results, with ten results per page? So for example, if you want to get 100 results searching for something, you'll have to call the CustomSearch API ten times (i.e. 10 requests = 10 x 10 search results).

Shut Up & Show Me the Instructions


OK, so anyway it's Google you're dealing with. That means, if you want to use their CustomSearch API, you need a Google account. So if you haven't created a Google account yet, go ahead and create one. Then what you'll need to do is:
(a) create a Project and assign a Google API to it
(b) create a Search Engine with an ID

(a) Create a Project and assign a Google API to it


Go to the Google Developers Console and login:
https://console.developers.google.com

Click the "Create Project" button
-Enter a Project Name, e.g. "testSearch1"
-Enter a Project ID (Google will suggest one for you)

Click the "Enable an API" button
-click on the Custom Search API
Select "Credentials" in the left navigation bar under "APIs & auth"
   
Under Public API access, click on Create new Key
Then click >Server Key >Create (don't bother entering anything in optional IP addresses).
Copy the key code it gives you and save it somewhere safe.

(b) create a Search Engine with an ID


Next, go to the CustomSearch Search Engine management console page. You need to go here, because you need a search engine and its ID to be able to do Google search requests.
https://www.google.com/cse/manage/all

Click on >New search engine
-Enter a Search Engine name, e.g. Google
-Enter a site to search, e.g. www.google.com
-Choose Search the entire web but emphasize included sites
-After this is finished, in the Edit screen for the search engine you've created, click on the Search Engine ID button. Copy this ID code, you'll need it for later.



Check the available Google APIs pages and find the details for the CustomSearch API. You'll notice it has one entity: cse. This is what we're going to focus on.
https://developers.google.com/apis-explorer/#p/
https://developers.google.com/resources/api-libraries/documentation/customsearch/v1/python/latest/customsearch_v1.cse.html

Have a look in there and you will see a large number of parameters we can pass in to the entity when doing a request via the API. The only parameters we really need though are:
q - the search query string
cx - the search engine ID we created
start - the offset from the beginning of the search results.

Let's get it going with Python


OK, our goal is write a quick script in Python to run on the command line. It will make a number of requests using the Google CustomSearch API and send the results to JSON output file. (I'm assuming you're using Python 2.x and have pip installed to download your Python packages.)

Install the Google API client module:
pip install -U google-api-python-client

Edit this program below, inserting your own Project ID and Search Engine ID. I've saved it as "search_google.py", and made a new directory in my current directory named "output" first. This is where the results file will be outputted.

#!/usr/bin/env python

import datetime as dt
import json, sys
from apiclient.discovery import build


if __name__ == '__main__':
    # Create an output file name in the format "srch_res_yyyyMMdd_hhmmss.json"
    now_sfx = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
    output_dir = './output/'
    output_fname = output_dir + 'srch_res_' + now_sfx + '.json'
    search_term = sys.argv[1]
    num_requests = int(sys.argv[2])

    # Key codes we created earlier for the Google CustomSearch API
    search_engine_id = '[My Search Engine ID]'
    api_key = '[My Project API Key]'
   
    # The build function creates a service object. It takes an API name and API
    # version as arguments.
    service = build('customsearch', 'v1', developerKey=api_key)
    # A collection is a set of resources. We know this one is called "cse"
    # because the CustomSearch API page tells us cse "Returns the cse Resource".
    collection = service.cse()

    output_f = open(output_fname, 'ab')

    for i in range(0, num_requests):
        # This is the offset from the beginning to start getting the results from
        start_val = 1 + (i * 10)
        # Make an HTTP request object
        request = collection.list(q=search_term,
            num=10, #this is the maximum & default anyway
            start=start_val,
            cx=search_engine_id
        )
        response = request.execute()
        output = json.dumps(response, sort_keys=True, indent=2)
        output_f.write(output)
        print('Wrote 10 search results...')

    output_f.close()
    print('Output file "{}" written.'.format(output_fname))


Run the program on the command line with the format:
python search_google.py "[my search term]" [number of requests x 10]
e.g.
python search_google.py "vintage tiddlywinks" 3

Your JSON file should be outputted in the output directory you created, containing your search results.