Reference
Basic scraping
Try =SITEPARSE("http://example.com", "title") to check if a URL is correct and the page can be retrieved. Then substitute "title" with a CSS selector or an XPath that gets the data you need. If the second argument starts with /, it is interpreted as an XPath, otherwise it's assumed to be a CSS selector.
To get the XPath/CSS selectors, right click on it and choose Inspect element. This opens the Elements tab of the developer tools in Chrome or Firefox. Then right-click on the highlighted element and pick Copy full XPath , Copy XPath or Copy selector. You can highlight all elements found by a given selector by pressing Ctrl-F in the Elements tab and pasting the selector/XPath.
Only XPaths can access an html element's attribute, like the url of a link (a link is coded as <a href="www.example.com">link text<a/> , the attribute is called href in this example but urls could be found also in src attribute of an <image> tag and elsewhere). An XPath to access the attribute above will end with /a/@href.
Scraping multiple pieces of data of the same type in a page
To scrape multiple pieces of data with a single selector (e.g. a column in a table) find the XPaths of two of the elements and replace the differences with *, e.g from:
//*[@id="default"]/div[1]/div/div/div/section/div[2]/ol/li[2]/article/h3/a/@href
and
//*[@id="default"]/div[1]/div/div/div/section/div[2]/ol/li[4]/article/h3/a/@href
you can make:
//*[@id="default"]/div[1]/div/div/div/section/div[2]/ol/li[*]/article/h3/a/@href
Scraping multiple selectors at once
You can also scrape multiple types of data from the same page with different selectors at once; e.g. product name and title. Just pass a range of cells containing selectors as the url parameter of SITEPARSE() (see this example to retrieve a book title, Kindle price and rating for a book on Amazon at once).
Advanced XPath expressions
Some XPath expressions fit for particular use cases are:
Use Case
Select a span element with given CSS class (myclass) with XPath
Select a link ( <a> element) starting with mytext
Select a link ( <a> element) pointing to http://www.example.com
Select a <p> element (paragraph) containing mytext
XPath
//span[contains(concat(' ', @class, ' '), ' myclass')]
//a[contains(starts-with(),'mytext')]
//a[href='http://www.example.com']
//p[text()[contains(.,'mytext')]]
Scrape data behind login
To scrape data behind login forms, you need to:
get the session cookies by submitting a login form, and
include them in subsequent scraping requests, so that the site will serve the data it would serve a logged-in user.
You can watch a tutorial video, or look at a demo sheet.
You can describe the sequence of actions in a 2-column range of the spreadsheet and pass it in place of the selectors. The actions can be:
If the first column contains a selector /xpath, it must be a selector / XPath pointing to an <input> element and the second column contain the value to be filled it
If the second column contains the value #CLICK, the element selected with the selector will be clicked
You can put #WAIT in the first column and the number of seconds to wait in the 2nd column
you can place selectors / XPaths in the first column with an empty second column, then this data will be scraped and returned
Whenever a SITEPARSE() call is passed a list of actions to perform, it would return all session cookies so that they can be passed as a 3rd parameter to subsequent calls to SITEPARSE(). Typically the actions contain information to submit to a form, and the cookies are used in subsequent calls to scrape the data.
All actions must complete within 30 seconds, or the call will time out.