Google provide a generic "how to access Google APIs through Python" guide here:
https://developers.google.com/api-client-library/python/
However, it's not too helpful in showing you specifically how to just get Google search results out, which is what we want! It also doesn't tell you that at the time of writing Google haven't yet made a version of their module for Python 3.
Anyway, the background seems to be that years ago Google had a rather clunky XML/SOAP interface for making requests using an old Search API. Thankfully they got rid of it and replaced with a much nicer JSON API. There is a free version of this new one called the CustomSearch API, which limits you to 100 requests per day. (If you want more there is a paid version.) The catch for this is that you are also limited to only ten search results per request. I guess Google unhelpfully assume you will want to paginate search results, with ten results per page? So for example, if you want to get 100 results searching for something, you'll have to call the CustomSearch API ten times (i.e. 10 requests = 10 x 10 search results).
Shut Up & Show Me the Instructions
OK, so anyway it's Google you're dealing with. That means, if you want to use their CustomSearch API, you need a Google account. So if you haven't created a Google account yet, go ahead and create one. Then what you'll need to do is:
(a) create a Project and assign a Google API to it
(b) create a Search Engine with an ID
(a) Create a Project and assign a Google API to it
Go to the Google Developers Console and login:
https://console.developers.google.com
Click the "Create Project" button
-Enter a Project Name, e.g. "testSearch1"
-Enter a Project ID (Google will suggest one for you)
Click the "Enable an API" button
-click on the Custom Search API
Select "Credentials" in the left navigation bar under "APIs & auth"
Under Public API access, click on Create new Key
Then click >Server Key >Create (don't bother entering anything in optional IP addresses).
Copy the key code it gives you and save it somewhere safe.
(b) create a Search Engine with an ID
Next, go to the CustomSearch Search Engine management console page. You need to go here, because you need a search engine and its ID to be able to do Google search requests.
https://www.google.com/cse/manage/all
Click on >New search engine
-Enter a Search Engine name, e.g. Google
-Enter a site to search, e.g. www.google.com
-Choose Search the entire web but emphasize included sites
-After this is finished, in the Edit screen for the search engine you've created, click on the Search Engine ID button. Copy this ID code, you'll need it for later.
Check the available Google APIs pages and find the details for the CustomSearch API. You'll notice it has one entity: cse. This is what we're going to focus on.
https://developers.google.com/apis-explorer/#p/
https://developers.google.com/resources/api-libraries/documentation/customsearch/v1/python/latest/customsearch_v1.cse.html
Have a look in there and you will see a large number of parameters we can pass in to the entity when doing a request via the API. The only parameters we really need though are:
q - the search query string
cx - the search engine ID we created
start - the offset from the beginning of the search results.
Let's get it going with Python
OK, our goal is write a quick script in Python to run on the command line. It will make a number of requests using the Google CustomSearch API and send the results to JSON output file. (I'm assuming you're using Python 2.x and have pip installed to download your Python packages.)
Install the Google API client module:
pip install -U google-api-python-client
Edit this program below, inserting your own Project ID and Search Engine ID. I've saved it as "search_google.py", and made a new directory in my current directory named "output" first. This is where the results file will be outputted.
#!/usr/bin/env python
import datetime as dt
import json, sys
from apiclient.discovery import build
if __name__ == '__main__':
# Create an output file name in the format "srch_res_yyyyMMdd_hhmmss.json"
now_sfx = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = './output/'
output_fname = output_dir + 'srch_res_' + now_sfx + '.json'
search_term = sys.argv[1]
num_requests = int(sys.argv[2])
# Key codes we created earlier for the Google CustomSearch API
search_engine_id = '[My Search Engine ID]'
api_key = '[My Project API Key]'
# The build function creates a service object. It takes an API name and API
# version as arguments.
service = build('customsearch', 'v1', developerKey=api_key)
# A collection is a set of resources. We know this one is called "cse"
# because the CustomSearch API page tells us cse "Returns the cse Resource".
collection = service.cse()
output_f = open(output_fname, 'ab')
for i in range(0, num_requests):
# This is the offset from the beginning to start getting the results from
start_val = 1 + (i * 10)
# Make an HTTP request object
request = collection.list(q=search_term,
num=10, #this is the maximum & default anyway
start=start_val,
cx=search_engine_id
)
response = request.execute()
output = json.dumps(response, sort_keys=True, indent=2)
output_f.write(output)
print('Wrote 10 search results...')
output_f.close()
print('Output file "{}" written.'.format(output_fname))
Run the program on the command line with the format:
python search_google.py "[my search term]" [number of requests x 10]
e.g.
python search_google.py "vintage tiddlywinks" 3
Your JSON file should be outputted in the output directory you created, containing your search results.
THANK YOU!! Finally a good tutorial of how to
ReplyDeleteThanks a lot mate ! Very helpfull :)
ReplyDeleteI knew kiwis were the best :)