XPath for Crawling with Scrapy

Ah, XML. The ancestor of HTML and predecessor of JSON. 15 years ago, it was the wave of the future, the hip new way to send large amounts of formatted data. Everyone wanted it, and wanted it to contain their information. You couldn’t create a cool new product without using it! Now, its primary purpose (outside of the derivative HTML of course) seems to be to be contain settings and information for various enterprise software platforms.

How the mighty have fallen.

Of course, I’m being more than a little facetious here, but I’ve known of many people using this line of reasoning in order to justify not learning about XML or its associated tools and syntaxes. While we might encounter CSS selectors and JSON parsers in day to day coding, tools built specifically for XML, such as XPath, often fall out of favor. This is extremely unfortunate.

XPath, designed to extract data from XML documents, and CSS selectors, designed to select elements from HTML documents, can both be used with HTML. Most HTML parsing and web crawling libraries (lmxl, Selenium, Scrapy -- with the notable exception of BeautifulSoup) are compatible with both.

While CSS selectors are great, and they’re constantly rolling out new and better features that make them greater, they were still specifically designed for styling. When the going gets tough, it’s 4am, and you’re trying to parse some god-awful website with convoluted HTML that looks like it was written in Notepad by an 8 year old, do you want a selector syntax that was written to make it easy for website designers to put pretty background colors on things? No! What you need is a selector syntax that was designed to dig through crap and target elements with precision and flexibility! You need XPath.

Here is a page:

<html>

    <div class=”large” id=”content”>

        <span>A line of text</span><br/>

        <span><a href="http://google.com">A link</a></span>

    </div>

    <div class=”short” id=”footer”>

    </div>
</html>

 

Let’s take a look at some XPath basics that can be used to select elements on this page:

/html

This selects the root element, the <html> tag. Pretty easy, right?

And if I want to select that link inside the first div in the page, I can use:

/html/div/span/a

This selects all the a tags that are the children of span tags that are the children of div tags.

But web pages have a lot of nested elements in them, so what if I want to immediately drill down to a tag without having to start with “/html”?

//a

will select all of the <a> tags on the page. In this case, it will select the only <a> tag on the page.

Similarly,

//div/span/a

Will select the same element.

Using XPath, I can also move around the tree of XML (er, HTML) tags by using:

..

Those of you who have used computer terminals may recognize this as the “pop me up in the directory structure” command. Similarly, this will select the parent element of the currently selected element. For example:

//a/..

Will select the parent of the only <a> tag on the page: A span element.

Of course, there’s more to XML than tags, so we also need to figure out how to select elements by attributes.

//div[@class]

Selects the <div> tags on the page with an attribute called “class.”

Now, here’s where the pro-CSS Selector group tends to get a bit crazy. With CSS selectors, you don’t have to specifically type out “class” -- the CSS Selectors, already geared towards HTML, use the simple notation of preceding class attribute values with a dot (WARNING: The following is not XPath, it is a CSS Selector!):

div.large

Which selects all the <div> elements with a class value “large” However, can CSS Selectors select all the elements with any type of class regardless of that class’s value? Can they select only the elements without a class? Not without getting into some Regular Expressions madness, if your library of choice supports it. Similarly, if someone is using a custom attribute in their HTML (perhaps for their own internal business logic) CSS Selectors will not be able to support it, while XPath will handle it just as easily as using the built-in HTML attributes like “class” and “id” It’s all the same to XPath!

And, of course, if you want to do the equivalent of the above CSS selector in XPath, you can write:

//div[@class=’large’]

A little longer, yes, but far more flexible for unusual situations.

Another nice feature, for dealing with lists of sibling elements, is the ability to select individual elements based on their index:

//div[0]

selects the first div element on the page.

Note that you can also do this with the identically-functioning expression:

//div[first()]

Another interesting feature that XPath has that CSS Selectors do not is the ability to select tags based on their content using the “text()” function:

//span[text()=’A line of text’]

Will select the span element surrounding “A line of text”

There are other functions like “text()” These are contains(), first() (which we’ve already seen), last(), and position(), among others. The first one, “contains()” can be handy for identifying elements based on their attributes or contents. The other three can be handy for selecting elements based on their position in large groups of sibling elements.

Here’s some HTML we’ll test these tags out on:

<ul>
<li id=”1”>Thing one</li>
<li id=”2”>Thing two</li>
<li id=”3”>Thing three</li>
<li id=”4”>Thing four</li>
</ul>

 

Let’s take a look at the “contains()” function:

//li[contains(text(), "Thing ")]

This takes the text value of all of the li elements, checks to see if that contains “Thing “ (note the space after the word) and returns “true” if that is the case.

The expression:

//li[contains(text(), "Thing t")]

Will return only the elements containing the text “Thing two” and “Thing three.” Try doing that with CSS selectors!

The first, last, and position functions are relatively straightforward:

//li[last()]

Will return the last item in the list, “Thing four”

//li[position() < 3]

Will return the first two items in the list (whose positions are 1, and 2 -- both are less than three).

XPath also has its own sort of built-in Regular Expressions-type language. This can come in handy in situations where a library may not support regular expressions, or where regular expressions might be inconvenient to use.

The following selects all of the elements in the document (recursively), that have at least one attribute:

//*[@*]

Where the asterisk of course, acts as a wildcard. If you’re looking for this type of functionality, you might also be interested in the “or” operator, a pipe: | This can select multiple types of elements and return those that match either type. This example (using the first HTML example) returns divs in either the class “large” or “short”:

//div[@class="large"] | //div[@class="short"]

So that’s the crash course to XPath. There are some other features that aren’t covered here, as well as some neat feature combination techniques, but however, this should be enough to get you started, at least as far as parsing HTML is concerned.

But the question remains: Once you’ve come up with your XPath statement, how do you actually use it in your web scrapers? Let’s look at Scrapy for an example of how this can be done.

This example will use a crawler that scrapes Wikipedia pages, going from article to article, following internal links. On each page, it will identify a few pieces of information on the page, and put it in an “items” object. These pieces of information are:

  • The article title

  • A list of links on the page (internal links to other articles)

  • The last modified date on the page (found in the footer)

The Python for the Article item looks like this:

from scrapy import Item, Field

class Article(Item):

   title = Field()

   links = Field()

   lastModified = Field()

And the Scrapy code for the spider looks something like this:

class ArticleSpider(CrawlSpider):

    name="article"

    allowed_domains = ["en.wikipedia.org"]

    start_urls = ["http://en.wikipedia.org/wiki/Python_%28programming_language%29"]

    rules = [Rule(SgmlLinkExtractor(allow=('(/wiki/)((?!:).)*$'),), callback="parse_item", follow=True)]

    def parse_item(self, response):

        <PARSING CODE HERE>

        return item;

 

The function “parse_item” needs to be filled out with a few XPath rules in order to extract all the information we want from the page response object provided by Scrapy.

Getting the title from the page is fairly straightforward. It is the first (and only) <h1> tag on the page. We can use Scrapy’s “response.xpath” function in order to do this:

title = response.xpath('//h1/text()')[0].extract()

This fetches a list of inner texts of h1 tags, takes the first element (remember, there should only be one on the list, but this converts it from a list to a single object) and use Scrapy to convert it to text data using “extract()”

Getting the last modified date is a little more complicated, but not too bad:

lastMod = response.xpath('//li[@id="footer-info-lastmod"]/text()')[0].extract()

And we can use Python (not XPath) to clean up the text a bit, leaving us with only the date:

lastMod = lastMod.replace("This page was last modified on ", "")

The tricky part here is the links on the page. Wikipedia internal article links have two properties in common:

  • They begin with “/wiki/”

  • They do not contain a colon (“:”) character. This is reserved for special pages (such as history or talk pages)

We also want to make sure we are selecting the value of the attribute (the “href” attribute value -- the actual link) rather than the contents of the <a> tag. This can be performed with the following, more complex XPath selector:

//a[starts-with(@href, "/wiki/") and not(contains(@href,":"))]/@href

This selects all the <a> tags, limited to ones whose href attribute starts with “/wiki/” AND does not contain “:” It then drills down into the href attribute itself and selects that content. There are a couple functions that haven’t been discussed yet (“starts-with”, the “and” operator, “not”), although their functionality should be pretty straightforward, given what you’ve learned about XPath syntax so far.

Putting this all together, we can fill out the “parse_item” function in the Scrapy crawler like this:

    def parse_item(self, response):

        item = Article()

        title = response.xpath('//h1/text()')[0].extract()

        links = response.xpath('//a[starts-with(@href, "/wiki/") and not(contains(@href,":"))]/@href').extract()

        lastMod = response.xpath('//li[@id="footer-info-lastmod"]/text()')[0].extract()

        lastMod = lastMod.replace("This page was last modified on ", "")

        item['title'] = title

        item['links'] = links

        item['lastModified'] = lastMod

        return item

This should run just fine and grab titles, lists of links, and last modified dates for every article encountered! The complete code for this crawler can be downloaded as a zip file here.

 

 

 

Add new comment