Error message

  • Notice: Trying to access array offset on value of type int in element_children() (line 6595 of /home/pythoafa/public_html/includes/
  • Notice: Trying to access array offset on value of type int in element_children() (line 6595 of /home/pythoafa/public_html/includes/
  • Notice: Trying to access array offset on value of type int in element_children() (line 6595 of /home/pythoafa/public_html/includes/
  • Deprecated function: implode(): Passing glue string after array is deprecated. Swap the parameters in drupal_get_feeds() (line 394 of /home/pythoafa/public_html/includes/
  • Deprecated function: The each() function is deprecated. This message will be suppressed on further calls in menu_set_active_trail() (line 2405 of /home/pythoafa/public_html/includes/

Terms of Service and Robots.txt

It's a commonly held rule of thumb: "If you're going to scrape the web, make sure you follow the Terms of Service and robots.txt" in order to avoid trouble. And sure, if you follow both of these things (and rate limit your bots, of course) you probably won't be getting any cease and desist letters any time soon. But is following the TOS and robots.txt necessary for avoiding trouble? Not really. 

Let's start with robots.txt -- this is the weaker of the two documents, legally-speaking. And by "weaker of the two" I mean "it means nothing." This is because Robots Exclusion Standard is an unofficial standard. The IETF doesn't recognize it, no governments recognize it, it's at a strange obscure location that is often unlinked to anywhere on the site. Depending on robots.txt to stop bots is like having an open storefront, with a note under your doormat that says "Do Not Enter!" then complaining that no one follows the "note under the doormat" protocol that you and your friends came up with.

In fact, robots.txt can sometimes promote unwanted bot activity by acting as a listing for all of the locations you don't want scraped on the site. Obviously, security through obscurity is hardly security worth having, but you might think twice about putting out a sign that says "do not go to our hidden login page at this location" for every bot that wanders by. 

The Terms of Service, of course, does not directly prevent bots by itself, but it can give you a better legal standing for prosecuting or suing the companies or individuals that control them, in certain circumstances. If you have a TOS down in the bottom of your page and users never have to explicitly agree to it, this is known as a "browserwrap" and is legally unenforcable, as courts have shown time and time and time again. Of course, if you find that sites are actually blocking you (usually by blocking your IP address) and sending you cease and desist letters based on their browserwrap TOS, you do have to stop scraping their site. They can't take you to court based on the TOS alone, but they can take additional action to block you from the site, and then take you to court if you try to circumvent that.

However, if you have a TOS that users have to actually agree to in order to use the site or create an account, you can actually enforce the terms without much of this additional action. Of course, in order to enforce the terms of even a "clickwrap" TOS (the kind where users actually have to click to agree), you still need to gather the evidence, identify the violators, put together a case, lawyer up, and file with the courts. Obviously, that takes time, money, and large companies are unlikely to follow through if you haven't caused any lasting damage and you immediately cease and desist when notified to do so. 

So, keep proceeding with an overabundance of caution if you must, but, in my experience, it's best to proceed with a little knowledge, research, and calculated risks ;)

Add new comment