- Published on
Scraping Google Images - No Selenium
- Authors
- Name
- Ryan Talley
Motivation
A few years ago when I first started writing small scripts there were a bunch of repos to scrape images from google. So when I wanted to create a simple mms sender I was surprised to find that all popular solutions use some form of chromium emulation to scrape Google Images (PhantomJS, Selenium, Puppeteer).
Chromium continues to dominate the browser market share, and I don't want to let it creep into my scripts. Scripts should be simple, easy to understand, modify, and replicate. A headless browser is simply an engineering schism for me.
The Issue
Google's used to report a 'meta_rg' class that you could use to scrape their site. Now that's axed and since results are generated dynamically, you either need to find a new source for the data, or another way to parse + load a basic image search.
I explore querying a hanging Google api, but there's a great post by Denis Skopa on extracting image metadata from a basic search. Check it out.
Solution
tldr; Use this url to request a json file. It will contain all links to the original full size image. Parse out the links, and request a download of the image. github
Replace '{query}' with what you're searching for (ex. search?q=linux&)
Replace '{page}' to specify which page of results to request (&ijn=2)
URL Parameters
URL parameters are used to write or communicate queries in the browser. Also called query strings, they help you request information and it's output more specifically from a resource (website). The parameters usually start with a question mark (?) and use ampersands (&) as separators. The equals sign (=) separates the key=value pairs of the query. So we can breakdown the above URL like so:
https:// -> connection protocol, requests resource from port 443.
www.google.com -> domain
/search? -> subdomain and start of URL parameters
q=an+apple -> search for 'an apple'
&tbm=isch -> search for an image
&async=_id:islrg_c,_fmt:json -> get full content results, in json format
&asearch=ichunklite -> return mostly just the search's assets
&ijn=0 -> return page 0 of results (100/pg)
If you want to see what this looks like we can modify the URL to return an html page instead of a json file and your browser will render the request.
Json Structure
Once we have the requests in json format, we just need to parse out the pictures' source url. You can use the data schema below to get a list of urls.
ichunklite
│
└─results
│
├───result_1
│ │
│ └──viewer_metadata
│ │
│ └─original_image
│ │
│ └─url
│
└───result_2
│
└──...
Download the Image
Use whatever you want to download the image. Risky python below
import requests
def download(url, file_path):
response = requests.get(url)
if response.status_code == 200: # Successful request
with open(file_path, 'wb') as img:
img.write(response.content)
Be sure to add some error handling if you plan to use this in a project.
Some Code
Checkout my python implementation on github.