Posts to Scrape

Second Edition Is Out!

Well, the second edition has been out for a few months now, but the nice thing about being primarily an author of books (as opposed to being an author of blog posts) is that you're expected to be able to produce a lot of content all at once rather than churning out continuous updates!

A lot of people have asked me about the changes between the first and second editions. The publishing industry gets a bad reputation for releasing "editions" with minor updates, but I promise, this is a good one. Four new chapters:

Second Edition Coming this Fall!

As much as I like books, they do have one major problem: Print doesn't update automatically. The good news is, I can update it manually! The second edition of Web Scraping with Python will be coming out this Fall. Currently working on the following major changes:

XPath for Crawling with Scrapy

Ah, XML. The ancestor of HTML and predecessor of JSON. 15 years ago, it was the wave of the future, the hip new way to send large amounts of formatted data. Everyone wanted it, and wanted it to contain their information. You couldn’t create a cool new product without using it! Now, its primary purpose (outside of the derivative HTML of course) seems to be to be contain settings and information for various enterprise software platforms.

How the mighty have fallen.

Selenium Headers

So, after my DefCon talk a few weeks ago, someone came up to me and mentioned that something I had said -- Selenium browsers have the same headers as regular browsers -- was incorrect. He said that there were subtle differences between the headers that Selenium sent, and the headers that a normal human-operated browser would send. I only had the chance to speak with them very briefly, so I may have misunderstood, but I thought I'd put this to the test to see what the deal is between normal browser headers, and headers as seen through the Selenium Python library.

Terms of Service and Robots.txt

It's a commonly held rule of thumb: "If you're going to scrape the web, make sure you follow the Terms of Service and robots.txt" in order to avoid trouble. And sure, if you follow both of these things (and rate limit your bots, of course) you probably won't be getting any cease and desist letters any time soon. But is following the TOS and robots.txt necessary for avoiding trouble? Not really. 

Scraping with JavaScript

One Amazon reviewer recently mentioned this in a review:

At the time of publication the world is awash in Javascript-littered websites. Producing a book that dedicates only a few of its pages to scraping web pages after or while Javascript is running makes this book an anachronism and of reduced current value. I don't mean this to come across as harsh, but this is a 6-star book for scraping Tripod and Angelfire sites.