Error message

  • Notice: Trying to access array offset on value of type int in element_children() (line 6595 of /home/pythoafa/public_html/includes/
  • Notice: Trying to access array offset on value of type int in element_children() (line 6595 of /home/pythoafa/public_html/includes/
  • Notice: Trying to access array offset on value of type int in element_children() (line 6595 of /home/pythoafa/public_html/includes/
  • Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /home/pythoafa/public_html/includes/
  • Deprecated function: The each() function is deprecated. This message will be suppressed on further calls in menu_set_active_trail() (line 2405 of /home/pythoafa/public_html/includes/

Second Edition Is Out!

Well, the second edition has been out for a few months now, but the nice thing about being primarily an author of books (as opposed to being an author of blog posts) is that you're expected to be able to produce a lot of content all at once rather than churning out continuous updates!

A lot of people have asked me about the changes between the first and second editions. The publishing industry gets a bad reputation for releasing "editions" with minor updates, but I promise, this is a good one. Four new chapters:

  • Web Crawling Models
    • This demonstrates common patterns that web crawlers tend to follow and how to build them extensibly and flexibly using software engineering best practices. 
  • Scrapy
    • This conveniently follows the "Web Crawling Models" chapter, as some of the same principles are carried over here. When the first edition was written Scrapy did not support Python 3. While there was a section dedicated to the framework (in Python 2) it was fairly cursory and felt a little tacked on. I greatly expanded the content, brought everything over to Python 3, and made it its own chapter.
  • Crawling Through APIs
    • This replaces the chapter in the first edition "Using APIs." The content in that chapter was, honestly, pretty introductory and felt out of place with the rest of the material. Rather than simply "using" APIs as they're meant to be used, this chapter describes finding hidden APIs in websites and scraping their data as an alternative to using more intensive solutions like Selenium.
  • Web Crawling in Parallel
    • A glaring omission from the previous edition -- how to write crawlers that run in parallel, whether across the same site or different sites. Please use with caution, and be kind to web servers!

In addition to the new chapters, I added additional sections and expanded content in almost every chapter. The appendix from the previous edition was completely removed (removed "Intro to Python" and "Internet 101" material, which was not necessary for the target audience, and moved the "Legalities and Ethics" section into its own chapter with expanded and revised content). Despite the trimming, the new edition is 51 pages longer with at least 80 pages of entirely new content.

The entire book has been heavily revised. I saved some of the more "complicated" errata that had been piling up for the second edition, and I don't think a single codeblock was left untouched during the editing and reformatting. Most of the code samples are in iPython notebooks, except for chapters 5 (Scrapy) and 16 (scraping in parallel) due to the special nature of how those are organized and/or executed.

If you liked the first edition, I do recommend the second edition as well! While it's obviously tough to convince someone to buy a second copy of a book they already have, there are always Safari subscriptions, public libraries, and -- hey -- maybe you have a friend who would needs this book in their life and could lend it to you for a bit?

Happy scraping!

Add new comment