Lesson 39:

Downloading from the Web with the Requests Module

The requests module lets you easily download files from the web without complicated issues.

requests does not come with Python, so it must be installed manually with pip.


In [2]:
# Test the requests module by importing it
import requests

# Store a website url in a response object that can be queried
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')

Response objects can be checked via status codes:

  • '404' is the typical 'file not found' code.
  • '200' is the typical 'success' code.

In [3]:
res.status_code


Out[3]:
200

The response object has succeded, and all values are stored within it:


In [5]:
# Print the first 100 lines
print(res.text[:1000])


The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org/license


Title: Romeo and Juliet

Author: William Shakespeare

Posting Date: May 25, 2012 [EBook #1112]
Release Date: November, 1997  [Etext #1112]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***













*Project Gutenberg is proud to cooperate with The World Library*
in the presentation of The Complete Works of William Shakespeare
for your reading for education and entertainment.  HOWEVER, THIS
IS NEITHER SHAREWARE NOR PUBLIC DOMAIN. . .AND UNDER THE LIBRARY
OF THE FUTURE CONDITIONS OF THIS PRESENTATION. . .NO CHARGES MAY
BE MADE FOR *ANY* ACCESS TO THIS MATERIAL.  YOU ARE ENCOURAGED!!
TO GIVE IT AWAY TO ANYONE YOU LIKE, BUT

A typical way to deal with status is to use a raise_for_status() statement, which will crash if a file is not found, and can be used in conjunction with boolean statements, and try and except statements.


In [7]:
# Run method on existing response object; won't raise anything because no error
res.raise_for_status()

# An example bad request

badres = requests.get('https://automatetheboringstuff.com/134513135465614561456')
badres.raise_for_status()


---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-7-862807cc57ca> in <module>()
      5 
      6 badres = requests.get('https://automatetheboringstuff.com/134513135465614561456')
----> 7 badres.raise_for_status()

/usr/local/lib/python3.5/site-packages/requests/models.py in raise_for_status(self)
    829 
    830         if http_error_msg:
--> 831             raise HTTPError(http_error_msg, response=self)
    832 
    833     def close(self):

HTTPError: 404 Client Error: Not Found

Files downloaded in this way must be stored in wb or write-binary method, to preserve the unicode formatting of this text. An explanation of unicode and its relationship to Python can be found here.

To store this file, we therefore need to write it in 'byte' chunks to a binary file. A useful method to help do this is the response object's iter_content method.


In [10]:
# Open/create a file to store the bytes, using a new name
playFile= open('files/RomeoAnd Juliet.txt', 'wb')

# Iteratively write each 100,000 byte 'chunk' of data into this file
for chunk in res.iter_content(100000):
    playFile.write(chunk)
    
# Close to save file
playFile.close()

The requests module is the preferred method for dealing with files, and the documentation can help explore a variety of use cases.

It excels only at downloading specific files from specific urls; it cannot handle logins and other complex actions. A browser simulator like selenium is often superior for such actions.

Recap

  • The requests module is a third-party module for downloading web pages and files.
  • requests.get() returns a Respone object.
  • It stores data from a url as a response object accessible within the program, which can then be handled like any other typical variable.
  • The .status_code and raise_for_status() methods can retrieve the status codes of the response object, which can inform the success or failure of the operation.
  • The iter._content() method can be used to iteratively write byte chunks to a file, in order to save binary files locally.