Scraping with JavaScript

One Amazon reviewer recently mentioned this in a review:

At the time of publication the world is awash in Javascript-littered websites. Producing a book that dedicates only a few of its pages to scraping web pages after or while Javascript is running makes this book an anachronism and of reduced current value. I don't mean this to come across as harsh, but this is a 6-star book for scraping Tripod and Angelfire sites.

Initially, I wrote a comment response to the reviewer and left it at that, but then another person on Twitter mentioned the same thing -- a small chapter on JavaScript just isn't enough, when most websites today use JavaScript.

Because this seems to be a common reaction, and because I think it's a very interesting topic -- "what is JavaScript and how do you scrape it?" I'd like to address it in greater detail here, and explain why I wrote the book the way I did, and what I will change in upcoming editions. 

To understand how to approach the problem of scraping JavaScript, you have to look at what it does. Ultimately, all it does is modify the HTML and CSS on the page, as well as send requests back to the server. "But, but..." you might be thinking "What about drag and drop? Or JavaScript animations? It makes pretty things move!" Just HTML and CSS changes. All of them. Ajax loading -- HTML and CSS changes. Logging users in in an Ajax form -- a request back to the server followed by HTML and CSS changes. 

Yes, sure, you can scrape the JavaScript itself, and in some cases this can be useful -- such as scraping latitudes and longitudes directly from code that powers a Google Map, rather than scraping the generated HTML itself. And, in fact, this is one technique I mention in the book. However, 99% of the time, what you're going to be doing (and what you can fall back on in any situation), is executing the JavaScript (or interacting with the site in a way that triggers the JavaScript), and scraping the HTML and CSS changes that result. 

Contrary to, what seems to be, popular belief, scraping, parsing, cleaning, and analyzing HTML isn't useless in the world of JavaScript -- it's necessary! HTML is HTML is HTML, whether it's generated by JavaScript on the front end, or a PHP script on the back end. In the case of PHP, the server takes care of the hard work for you, and in the case of JavaScript, you have to do that yourself. 

But how? If you've read the book, you already know the answer: Selenium and PhantomJS. 

from selenium import webdriver
import time
driver = webdriver.PhantomJS(executable_path='')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

These seven lines (including the print statement) can solve your Ajax loading problems. Note: There are also ways of waiting to return content by checking to see if a particular element on the page has loaded or not before returning, but waiting a few seconds usually works fine as well. 

But, of course, there's another class of HTML and CSS changes JavaScript can create -- those are user-triggered. And in order to get user-triggered changes, well, the user has to trigger the page. In Chapter 13, "Testing with Selenium," I discuss these in detail. 

Key to this sort of testing is the concept of Selenium elements. This object was briefly encountered in Chapter 10, and is returned by calls like:
usernameField = driver.find_element_by_name('username')

Just as there are a number of actions you can take on various elements of a website in your browser, there are many actions Selenium can perform on any given element. Among these are:
myElement.click()
myElement.click_and_hold()
myElement.release()
myElement.double_click()
myElement.send_keys_to_element("content to enter")

All of these actions can be strung together in chains, put into functions to act on variable elements, and can even be used to drag and drop elements (see Github: https://github.com/REMitchell/python-scraping/blob/master/chapter13/4-dr...)

After your JavaScript has been executed, whether it's something you had to wait around for to finish, or take action to make happen -- You scrape the resulting HTML! That's all covered in the first half of the book. Let me say that again: Knowing how to scrape HTML is not just good for (as one reviewer put it) scraping Angelfire and Geocities sites -- you need it to scrape every site, whether it's loaded with JavaScript, a server side script, or monkey farts*. If there's content you can see in your browser, there's HTML there. You don't need special tools to scrape JavaScript pages (other than the tools necessary to execute the JavaScript, or trigger it to execute) just like you don't need special tools to scrape .aspx pages and PHP pages.

So there you have it, in just a few paragraphs, I've covered all you need to know to scrape every JavaScript-powered website. In the book, I devote a full 10 pages to the topic, followed by sections in later chapters that revisit Selenium and JavaScript execution. In future editions, I will likely take some time to explain why you don't need an entire book devoted to "scraping JavaScript sites" but that information about scraping websites in general is relevant -- and necessary -- to scraping JavaScript. Hindsight is 20/20! 

 

*I know someone's going to take this as an opportunity to mention Flash, Silverlight, or other third-party browser plugins. I know, I know. You don't have to mention it. I'm hoping they go away! Sans extra software you have to add to your browser to make it work however, this principle holds true.

Comments

Two points to add:
- I tried to be a smart alec and asked what you'd do if the website is actually one big jpg. Turns out there's a chapter on that
- People are entirely too dismissive of the amazing things you can find with a dedicated search of Angelfire sites . For example: http://gizmodo.com/5993535/holy-crap-is-this-mark-zuckerbergs-childhood-...

First of all nice tutorial.I am web scraper and created many web scraper using php and dotnet along with java script.

Add new comment