Scraping Google Images

Motivation

A few years ago when I first started writing small scripts there were a bunch of repos to scrape images from google. So when I wanted to create a simple mms sender I was surprised to find that all popular solutions use some form of chromium emulation to scrape Google Images (PhantomJS, Selenium, Puppeteer).

Chromium continues to dominate the browser market share, and I don't want to let it creep into my scripts. Scripts should be simple, easy to understand, modify, and replicate. A headless browser is simply an engineering schism for me.

The Issue

Google's used to report a 'meta_rg' class that you could use to scrape their site. Now that's axed and since results are generated dynamically, you either need to find a new source for the data, or another way to parse + load a basic image search.

I explore querying a hanging Google api, but there's a great post by Denis Skopa on extracting image metadata from a basic search. Check it out.

Solution

tldr; Use this url to request a json file. It will contain all links to the original full size image. Parse out the links, and request a download of the image. github

(https://www.google.com/search?q={query}&tbm=isch&async=_id:islrg_c,_fmt:json&asearch=ichunklite&ijn={page})

Replace '{query}' with what you're searching for (ex. search?q=linux&)

Replace '{page}' to specify which page of results to request (&ijn=2)

URL Parameters

URL parameters are used to write or communicate queries in the browser. Also called query strings, they help you request information and it's output more specifically from a resource (website). The parameters usually start with a question mark (?) and use ampersands (&) as separators. The equals sign (=) separates the key=value pairs of the query. So we can breakdown the above URL like so:

https:// -> connection protocol, requests resource from port 443.
www.google.com -> domain
/search?   -> subdomain and start of URL parameters
q=an+apple -> search for 'an apple'
&tbm=isch  -> search for an image
&async=_id:islrg_c,_fmt:json -> get full content results, in json format
&asearch=ichunklite -> return mostly just the search's assets
&ijn=0 -> return page 0 of results (100/pg)

If you want to see what this looks like we can modify the URL to return an html page instead of a json file and your browser will render the request.

Try it: html, json

Json Structure

Once we have the requests in json format, we just need to parse out the pictures' source url. You can use the data schema below to get a list of urls.

ichunklite
  │
  └─results
        │
        ├───result_1
        │    │
        │    └──viewer_metadata
        │        │
        │        └─original_image
        │               │
        │               └─url
        │
        └───result_2
             │
             └──...

Download the Image

Use whatever you want to download the image. Risky python below

import requests

def download(url, file_path):
    response = requests.get(url)
    if response.status_code == 200: # Successful request
        with open(file_path, 'wb') as img:
            img.write(response.content)

Be sure to add some error handling if you plan to use this in a project.

Some Code

Checkout my python implementation on github.