Posts to Scrape

  • Second Edition Coming this Fall!

    As much as I like books, they do have one major problem: Print doesn't update automatically. The good news is, I can update it manually! The second edition of Web Scraping with Python will be coming out this Fall. Currently working on the following major changes:

    • Updates of all libraries. This is especially important for BeautifulSoup, which now requires passing in an explicit HTML parser to use. 
    • Less reliance on external websites (they tend to update, move, or go away) Examples will use http://pythonscraping.com when at all possible. 
    • New chapters for the following topics:
      • Scrapy (updated to use Python 3! It had just a section in the first edition, but now it gets its own chapter)
      • Distributed web scraping
      • Advanced web crawling and scraping patterns -- designing large scale scrapers from basic principles. 
    • Moving code examples to IPython notebooks.

    In addition, I'm implementing a more philosophic change with this edition. Previously, I tried to keep all the code samples in line with what was actually printed in the text, even if the code samples relied on old websites or out of date versions of libraries. This was done to avoid confusing readers; I wanted them to see the same code on the screen that they were reading in the book. However, this time around I'll be keeping code samples as updated and working as possible (and will be accepting pull requests!) regardless of how they might diverge from the code written in the book. 

    Although the specific topics and outline is fairly set in stone at this point, I'm always willing to accept requests and feedback! Hope you enjoy.

  • XPath for Crawling with Scrapy

    Ah, XML. The ancestor of HTML and predecessor of JSON. 15 years ago, it was the wave of the future, the hip new way to send large amounts of formatted data. Everyone wanted it, and wanted it to contain their information. You couldn’t create a cool new product without using it! Now, its primary purpose (outside of the derivative HTML of course) seems to be to be contain settings and information for various enterprise software platforms.

    How the mighty have fallen.

    Of course, I’m being more than a little facetious here, but I’ve known of many people using this line of reasoning in order to justify not learning about XML or its associated tools and syntaxes. While we might encounter CSS selectors and JSON parsers in day to day coding, tools built specifically for XML, such as XPath, often fall out of favor. This is extremely unfortunate.

    XPath, designed to extract data from XML documents, and CSS selectors, designed to select elements from HTML documents, can both be used with HTML. Most HTML parsing and web crawling libraries (lmxl, Selenium, Scrapy -- with the notable exception of BeautifulSoup) are compatible with both.

    While CSS selectors are great, and they’re constantly rolling out new and better features that make them greater, they were still specifically designed for styling. When the going gets tough, it’s 4am, and you’re trying to parse some god-awful website with convoluted HTML that looks like it was written in Notepad by an 8 year old, do you want a selector syntax that was written to make it easy for website designers to put pretty background colors on things? No! What you need is a selector syntax that was designed to dig through crap and target elements with precision and flexibility! You need XPath.

    Here is a page:

    <html>

        <div class=”large” id=”content”>

            <span>A line of text</span><br/>

            <span><a href="http://google.com">A link</a></span>

        </div>

        <div class=”short” id=”footer”>

        </div>
    </html>

     

    Let’s take a look at some XPath basics that can be used to select elements on this page:

    /html

    This selects the root element, the <html> tag. Pretty easy, right?

    And if I want to select that link inside the first div in the page, I can use:

    /html/div/span/a

    This selects all the a tags that are the children of span tags that are the children of div tags.

    But web pages have a lot of nested elements in them, so what if I want to immediately drill down to a tag without having to start with “/html”?

    //a

    will select all of the <a> tags on the page. In this case, it will select the only <a> tag on the page.

    Similarly,

    //div/span/a

    Will select the same element.

    Using XPath, I can also move around the tree of XML (er, HTML) tags by using:

    ..

    Those of you who have used computer terminals may recognize this as the “pop me up in the directory structure” command. Similarly, this will select the parent element of the currently selected element. For example:

    //a/..

    Will select the parent of the only <a> tag on the page: A span element.

    Of course, there’s more to XML than tags, so we also need to figure out how to select elements by attributes.

    //div[@class]

    Selects the <div> tags on the page with an attribute called “class.”

    Now, here’s where the pro-CSS Selector group tends to get a bit crazy. With CSS selectors, you don’t have to specifically type out “class” -- the CSS Selectors, already geared towards HTML, use the simple notation of preceding class attribute values with a dot (WARNING: The following is not XPath, it is a CSS Selector!):

    div.large

    Which selects all the <div> elements with a class value “large” However, can CSS Selectors select all the elements with any type of class regardless of that class’s value? Can they select only the elements without a class? Not without getting into some Regular Expressions madness, if your library of choice supports it. Similarly, if someone is using a custom attribute in their HTML (perhaps for their own internal business logic) CSS Selectors will not be able to support it, while XPath will handle it just as easily as using the built-in HTML attributes like “class” and “id” It’s all the same to XPath!

    And, of course, if you want to do the equivalent of the above CSS selector in XPath, you can write:

    //div[@class=’large’]

    A little longer, yes, but far more flexible for unusual situations.

    Another nice feature, for dealing with lists of sibling elements, is the ability to select individual elements based on their index:

    //div[0]

    selects the first div element on the page.

    Note that you can also do this with the identically-functioning expression:

    //div[first()]

    Another interesting feature that XPath has that CSS Selectors do not is the ability to select tags based on their content using the “text()” function:

    //span[text()=’A line of text’]

    Will select the span element surrounding “A line of text”

    There are other functions like “text()” These are contains(), first() (which we’ve already seen), last(), and position(), among others. The first one, “contains()” can be handy for identifying elements based on their attributes or contents. The other three can be handy for selecting elements based on their position in large groups of sibling elements.

    Here’s some HTML we’ll test these tags out on:

    <ul>
    <li id=”1”>Thing one</li>
    <li id=”2”>Thing two</li>
    <li id=”3”>Thing three</li>
    <li id=”4”>Thing four</li>
    </ul>

     

    Let’s take a look at the “contains()” function:

    //li[contains(text(), "Thing ")]

    This takes the text value of all of the li elements, checks to see if that contains “Thing “ (note the space after the word) and returns “true” if that is the case.

    The expression:

    //li[contains(text(), "Thing t")]

    Will return only the elements containing the text “Thing two” and “Thing three.” Try doing that with CSS selectors!

    The first, last, and position functions are relatively straightforward:

    //li[last()]

    Will return the last item in the list, “Thing four”

    //li[position() < 3]

    Will return the first two items in the list (whose positions are 1, and 2 -- both are less than three).

    XPath also has its own sort of built-in Regular Expressions-type language. This can come in handy in situations where a library may not support regular expressions, or where regular expressions might be inconvenient to use.

    The following selects all of the elements in the document (recursively), that have at least one attribute:

    //*[@*]

    Where the asterisk of course, acts as a wildcard. If you’re looking for this type of functionality, you might also be interested in the “or” operator, a pipe: | This can select multiple types of elements and return those that match either type. This example (using the first HTML example) returns divs in either the class “large” or “short”:

    //div[@class="large"] | //div[@class="short"]

    So that’s the crash course to XPath. There are some other features that aren’t covered here, as well as some neat feature combination techniques, but however, this should be enough to get you started, at least as far as parsing HTML is concerned.

    But the question remains: Once you’ve come up with your XPath statement, how do you actually use it in your web scrapers? Let’s look at Scrapy for an example of how this can be done.

    This example will use a crawler that scrapes Wikipedia pages, going from article to article, following internal links. On each page, it will identify a few pieces of information on the page, and put it in an “items” object. These pieces of information are:

    • The article title

    • A list of links on the page (internal links to other articles)

    • The last modified date on the page (found in the footer)

    The Python for the Article item looks like this:

    from scrapy import Item, Field

    class Article(Item):

       title = Field()

       links = Field()

       lastModified = Field()

    And the Scrapy code for the spider looks something like this:

    class ArticleSpider(CrawlSpider):

        name="article"

        allowed_domains = ["en.wikipedia.org"]

        start_urls = ["http://en.wikipedia.org/wiki/Python_%28programming_language%29"]

        rules = [Rule(SgmlLinkExtractor(allow=('(/wiki/)((?!:).)*$'),), callback="parse_item", follow=True)]

        def parse_item(self, response):

            <PARSING CODE HERE>

            return item;

     

    The function “parse_item” needs to be filled out with a few XPath rules in order to extract all the information we want from the page response object provided by Scrapy.

    Getting the title from the page is fairly straightforward. It is the first (and only) <h1> tag on the page. We can use Scrapy’s “response.xpath” function in order to do this:

    title = response.xpath('//h1/text()')[0].extract()

    This fetches a list of inner texts of h1 tags, takes the first element (remember, there should only be one on the list, but this converts it from a list to a single object) and use Scrapy to convert it to text data using “extract()”

    Getting the last modified date is a little more complicated, but not too bad:

    lastMod = response.xpath('//li[@id="footer-info-lastmod"]/text()')[0].extract()

    And we can use Python (not XPath) to clean up the text a bit, leaving us with only the date:

    lastMod = lastMod.replace("This page was last modified on ", "")

    The tricky part here is the links on the page. Wikipedia internal article links have two properties in common:

    • They begin with “/wiki/”

    • They do not contain a colon (“:”) character. This is reserved for special pages (such as history or talk pages)

    We also want to make sure we are selecting the value of the attribute (the “href” attribute value -- the actual link) rather than the contents of the <a> tag. This can be performed with the following, more complex XPath selector:

    //a[starts-with(@href, "/wiki/") and not(contains(@href,":"))]/@href

    This selects all the <a> tags, limited to ones whose href attribute starts with “/wiki/” AND does not contain “:” It then drills down into the href attribute itself and selects that content. There are a couple functions that haven’t been discussed yet (“starts-with”, the “and” operator, “not”), although their functionality should be pretty straightforward, given what you’ve learned about XPath syntax so far.

    Putting this all together, we can fill out the “parse_item” function in the Scrapy crawler like this:

        def parse_item(self, response):

            item = Article()

            title = response.xpath('//h1/text()')[0].extract()

            links = response.xpath('//a[starts-with(@href, "/wiki/") and not(contains(@href,":"))]/@href').extract()

            lastMod = response.xpath('//li[@id="footer-info-lastmod"]/text()')[0].extract()

            lastMod = lastMod.replace("This page was last modified on ", "")

            item['title'] = title

            item['links'] = links

            item['lastModified'] = lastMod

            return item

    This should run just fine and grab titles, lists of links, and last modified dates for every article encountered! The complete code for this crawler can be downloaded as a zip file here.

     

     

     

  • Selenium Headers

    So, after my DefCon talk a few weeks ago, someone came up to me and mentioned that something I had said -- Selenium browsers have the same headers as regular browsers -- was incorrect. He said that there were subtle differences between the headers that Selenium sent, and the headers that a normal human-operated browser would send. I only had the chance to speak with them very briefly, so I may have misunderstood, but I thought I'd put this to the test to see what the deal is between normal browser headers, and headers as seen through the Selenium Python library. Obviously, some might point out that this sort of thing is probably Googlable -- however, I invite them to try and google things like "Selenium different headers" and see where that gets you. No such luck finding anything relevant here!

    In the interest of totally unofficial research, I'm using the page http://www.procato.com/my+headers/as a neutral third-party judge of browser headers. 

    Firefox

    Headers via a human opening browser:

    Host    www.procato.com
    User-Agent    Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:35.0) Gecko/20100101 Firefox/35.0
    Accept    text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language    en-US,en;q=0.5
    Accept-Encoding    gzip, deflate
    Connection    keep-alive

    Headers via Selenium:

    Host www.procato.com
    User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:35.0) Gecko/20100101 Firefox/35.0
    Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language en-US,en;q=0.5
    Accept-Encoding gzip, deflate
    Referer http://www.procato.com/my+headers/
    Connection keep-alive

    Chrome:

    Headers via a human opening the browser:

    Host    www.procato.com
    Connection    keep-alive
    User-Agent    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
    Accept    */*
    Referer    http://www.procato.com/my+headers/
    Accept-Encoding    gzip, deflate, sdch
    Accept-Language    en-US,en;q=0.8

    Headers via Selenium:

    Host www.procato.com
    Connection keep-alive
    User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
    Accept */*
    Referer http://www.procato.com/my+headers/
    Accept-Encoding gzip, deflate, sdch
    Accept-Language en-US,en;q=0.8

    So far, it looks like the headers are exactly the same (as I'd expect, given that, fundamentally, the exact same software is being used to make the exact same request here). Just for fun, let's take a look at the headers that PhantomJS (obviously, via Selenium) is sending:

    User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.8 Safari/534.34
    Referer http://www.procato.com/my+headers/
    Accept */*
    Connection Keep-Alive
    Accept-Encoding gzip
    Accept-Language en-US,*
    Host www.procato.com

    Well, there's certainly a big fat "PhantomJS" in there. If you were checking headers for potential bots, I'd say you'd want to block that. However, I haven't been able to figure out what he was talking about, other than maybe meaning "block things with PhantomJS in the headers," and I simply misunderstood. Anyway, if anyone has any advice/pointers, I welcome them in the comments!

  • Terms of Service and Robots.txt

    It's a commonly held rule of thumb: "If you're going to scrape the web, make sure you follow the Terms of Service and robots.txt" in order to avoid trouble. And sure, if you follow both of these things (and rate limit your bots, of course) you probably won't be getting any cease and desist letters any time soon. But is following the TOS and robots.txt necessary for avoiding trouble? Not really. 

    Let's start with robots.txt -- this is the weaker of the two documents, legally-speaking. And by "weaker of the two" I mean "it means nothing." This is because Robots Exclusion Standard is an unofficial standard. The IETF doesn't recognize it, no governments recognize it, it's at a strange obscure location that is often unlinked to anywhere on the site. Depending on robots.txt to stop bots is like having an open storefront, with a note under your doormat that says "Do Not Enter!" then complaining that no one follows the "note under the doormat" protocol that you and your friends came up with.

    In fact, robots.txt can sometimes promote unwanted bot activity by acting as a listing for all of the locations you don't want scraped on the site. Obviously, security through obscurity is hardly security worth having, but you might think twice about putting out a sign that says "do not go to our hidden login page at this location" for every bot that wanders by. 

    The Terms of Service, of course, does not directly prevent bots by itself, but it can give you a better legal standing for prosecuting or suing the companies or individuals that control them, in certain circumstances. If you have a TOS down in the bottom of your page and users never have to explicitly agree to it, this is known as a "browserwrap" and is legally unenforcable, as courts have shown time and time and time again. Of course, if you find that sites are actually blocking you (usually by blocking your IP address) and sending you cease and desist letters based on their browserwrap TOS, you do have to stop scraping their site. They can't take you to court based on the TOS alone, but they can take additional action to block you from the site, and then take you to court if you try to circumvent that.

    However, if you have a TOS that users have to actually agree to in order to use the site or create an account, you can actually enforce the terms without much of this additional action. Of course, in order to enforce the terms of even a "clickwrap" TOS (the kind where users actually have to click to agree), you still need to gather the evidence, identify the violators, put together a case, lawyer up, and file with the courts. Obviously, that takes time, money, and large companies are unlikely to follow through if you haven't caused any lasting damage and you immediately cease and desist when notified to do so. 

    So, keep proceeding with an overabundance of caution if you must, but, in my experience, it's best to proceed with a little knowledge, research, and calculated risks ;)

  • Scraping with JavaScript

    One Amazon reviewer recently mentioned this in a review:

    At the time of publication the world is awash in Javascript-littered websites. Producing a book that dedicates only a few of its pages to scraping web pages after or while Javascript is running makes this book an anachronism and of reduced current value. I don't mean this to come across as harsh, but this is a 6-star book for scraping Tripod and Angelfire sites.

    Initially, I wrote a comment response to the reviewer and left it at that, but then another person on Twitter mentioned the same thing -- a small chapter on JavaScript just isn't enough, when most websites today use JavaScript.

    Because this seems to be a common reaction, and because I think it's a very interesting topic -- "what is JavaScript and how do you scrape it?" I'd like to address it in greater detail here, and explain why I wrote the book the way I did, and what I will change in upcoming editions. 

    To understand how to approach the problem of scraping JavaScript, you have to look at what it does. Ultimately, all it does is modify the HTML and CSS on the page, as well as send requests back to the server. "But, but..." you might be thinking "What about drag and drop? Or JavaScript animations? It makes pretty things move!" Just HTML and CSS changes. All of them. Ajax loading -- HTML and CSS changes. Logging users in in an Ajax form -- a request back to the server followed by HTML and CSS changes. 

    Yes, sure, you can scrape the JavaScript itself, and in some cases this can be useful -- such as scraping latitudes and longitudes directly from code that powers a Google Map, rather than scraping the generated HTML itself. And, in fact, this is one technique I mention in the book. However, 99% of the time, what you're going to be doing (and what you can fall back on in any situation), is executing the JavaScript (or interacting with the site in a way that triggers the JavaScript), and scraping the HTML and CSS changes that result. 

    Contrary to, what seems to be, popular belief, scraping, parsing, cleaning, and analyzing HTML isn't useless in the world of JavaScript -- it's necessary! HTML is HTML is HTML, whether it's generated by JavaScript on the front end, or a PHP script on the back end. In the case of PHP, the server takes care of the hard work for you, and in the case of JavaScript, you have to do that yourself. 

    But how? If you've read the book, you already know the answer: Selenium and PhantomJS. 

    from selenium import webdriver
    import time
    driver = webdriver.PhantomJS(executable_path='')
    driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3)
    print(driver.find_element_by_id("content").text)
    driver.close()

    These seven lines (including the print statement) can solve your Ajax loading problems. Note: There are also ways of waiting to return content by checking to see if a particular element on the page has loaded or not before returning, but waiting a few seconds usually works fine as well. 

    But, of course, there's another class of HTML and CSS changes JavaScript can create -- those are user-triggered. And in order to get user-triggered changes, well, the user has to trigger the page. In Chapter 13, "Testing with Selenium," I discuss these in detail. 

    Key to this sort of testing is the concept of Selenium elements. This object was briefly encountered in Chapter 10, and is returned by calls like:
    usernameField = driver.find_element_by_name('username')

    Just as there are a number of actions you can take on various elements of a website in your browser, there are many actions Selenium can perform on any given element. Among these are:
    myElement.click()
    myElement.click_and_hold()
    myElement.release()
    myElement.double_click()
    myElement.send_keys_to_element("content to enter")

    All of these actions can be strung together in chains, put into functions to act on variable elements, and can even be used to drag and drop elements (see Github: https://github.com/REMitchell/python-scraping/blob/master/chapter13/4-dr...)

    After your JavaScript has been executed, whether it's something you had to wait around for to finish, or take action to make happen -- You scrape the resulting HTML! That's all covered in the first half of the book. Let me say that again: Knowing how to scrape HTML is not just good for (as one reviewer put it) scraping Angelfire and Geocities sites -- you need it to scrape every site, whether it's loaded with JavaScript, a server side script, or monkey farts*. If there's content you can see in your browser, there's HTML there. You don't need special tools to scrape JavaScript pages (other than the tools necessary to execute the JavaScript, or trigger it to execute) just like you don't need special tools to scrape .aspx pages and PHP pages.

    So there you have it, in just a few paragraphs, I've covered all you need to know to scrape every JavaScript-powered website. In the book, I devote a full 10 pages to the topic, followed by sections in later chapters that revisit Selenium and JavaScript execution. In future editions, I will likely take some time to explain why you don't need an entire book devoted to "scraping JavaScript sites" but that information about scraping websites in general is relevant -- and necessary -- to scraping JavaScript. Hindsight is 20/20! 

     

    *I know someone's going to take this as an opportunity to mention Flash, Silverlight, or other third-party browser plugins. I know, I know. You don't have to mention it. I'm hoping they go away! Sans extra software you have to add to your browser to make it work however, this principle holds true.