Tutorial

Let's see how to extract data from a sample website, http://books.toscrape.com/. If you prefer, you can watch a video demonstrating the same steps.

First, add the Siteparse Google Sheets add-on via the G Suite Marketplace.

Then make a new spreadsheet and choose Add-ons | Siteparse | Help / Activate. This will enable the SITEPARSE() custom function for this spreadsheet and show brief instructions.

Next, open the target site you're scraping (http://books.toscrape.com/) in a new tab. You need to use a Chrome or Firefox browser.

We're going to select the data from the webpage using XPath queries. Siteparse also supports CSS Selectors, but XPath is more convenient for scraping because it allows you to select attributes of HTML elements and to select elements based on the text they contain. Look at this tutorial for an introduction to XPath language, or see this XPath reference if you already have basic understanding (there also is a CSS Selectors reference).

In most cases you'll be able to scrape data without actually writing the selectors manually, though!

We are going to scrape title, image and price for each book. To get the title, right-click on one of the book titles and select Inspect. A snippet similar to the highlighted one is shown in Chrome Dev Tools.

Right-click on the element and select Copy | Copy XPath.

Paste the XPath in the spreadsheet column where you want to put the titles. Repeat with a different title. Typically the two XPaths are going to differ by a single number. Replace that number with * and you'll get an XPath selecting all titles.

The text of longer titles is truncated, but notice the complete long titles are available in the title attribute of the a element. to select it, complete the XPath with /@title as shown.

You can do the same with price. Scraping the image is a bit different - you can easily retrieve the image address (url) using the XPath:

//*[@id="default"]/div[1]/div/div/div/section/div[2]/ol/li[*]/article/div[1]/a/img/@src

Note that SITEPARSE automatically converts a relative url into an absolute one!

To show a single image wrap the SITEPARSE() function in an IMAGE() function. To get this for all image urls returned, use ARRAYFORMULA():

=ARRAYFORMULA(IMAGE(SITEPARSE($A4, D$3)))

You can examine the complete spreadsheet example, including crawling multi-page listings!