Load the content from a website with urllib.request

In this example we use urllib.request to load the content from a website.


In [1]:
import urllib.request

Some sites will block request from urlib, so we set a custom 'User-Agent' header to load the content from the remote site.


In [2]:
url = 'https://medium.com/tag/machine-learning'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib.request.urlopen(req)

Let's check the HTTP status and the message.


In [3]:
print(con.status, con.msg)


200 OK

We can check if a specific HTTP request header exists


In [4]:
con.getheader('Content-Type')


Out[4]:
'text/html; charset=utf-8'

Now we can load the content from the website


In [5]:
text = con.read()
text[:500]


Out[5]:
b'<!DOCTYPE html><html xmlns:cc="http://creativecommons.org/ns#"><head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0, viewport-fit=contain"><title>The most insightful stories about Machine Learning \xe2\x80\x93 Medium</title><link rel="canonical" href="https://medium.com/tag/machine-learning"><link id="feedLink" rel="al'

In [ ]: