However, some steps in a program are done so many times by so many people that, eventually, someone writes a package that bundles up those operations into something easy to use that saves you having to figure out the gory details.
In this lesson we'll see how packages are useful in the context of tacking problems programmatically. The problem we will use here as an example is: download a data file that we know is hosted on a web site and then do some analysis of those data.
In this example, we are going to access and analyse a data set that we created: Cities from Wikipedia Data. Have a look at those data in your browser now by pasting the URL into a new tab. We'll use the urlopen
package to help us access the data.
A: A package is just a bundle of useful functions.
There's a little more to it than this, but simplest way to think of this is that when we type import foo
then we are including functions from the foo
package in our code. You'll see how this works below.
The point of the bundle-of-functions is that they can help us to achieve quite a lot very quickly since we don't need to reinvent the wheel and can just make use of someone else's code. In the same way that we won't mark you down for Googling the answer to a coding question, we won't mark you down for using someone else's package to help you get going with your programming. That's the whole point!
Often, if you're not sure where to start Google (or StackOverflow) is the place to go:
how to read text file on web server python
Boom!
A: Because it keeps our code 'clean' and allows us to have functions with the same name that do different things.
Imagine the following:
read_data(...)
that takes a URL and extracts the data that you want from that page as a CSV file, except...read_data(...)
functions are specific to that one web site!So, if both packages have functions callend read_data()
, then after you have done:
import nomis_reader
import ldn_data_store_reader
How does Python know which read_data
you want to use?
The answer is the namespace: you can only access the NOMIS read_data
function by typing nomis.read_data(...)
and you can only access the Data Store read_data
function by typing ldn_data_store_reader.read_data(...)
.
The length and awkwardness of that last line is why you can write this:
import ldn_data_store_reader as ldn
Because the alias ("...as ldn"
) lets you type ldn.read_data(...)
and Python will know exactly which function you mean!
The first step to writing a program is thinking about your goal and the steps required to achieve that. We don't write programs like we write essays: all at once by writing a whole lot of code and then hoping for the best when we hit 'submit'.
When you're tackling a programming problem you break it down into separate, simpler steps, and then tick them off one by one. Doing this gets easier as you become more familiar with programming, but it remains crucial and, in many cases, good programmers in large companies spend more time on design than they do on actual coding.
To a computer, reading data from a remote location (e.g. a web site halfway around the world) is not really any different from reading one that's sitting on your your local hard drive (e.g. on your desktop). To simplify things a great deal: the computer really just needs to know the location of the file and an appropriate protocol for accessing that file (e.g. http, https, ftp, local...) and then a clever programming language like Python will typically have packages that can kind of take of the rest.
In all cases -- local and remote -- you use the package to handle the hard bit of knowing how to actually 'read' data (because all files are just 1
s and 0
s of data) at the device level and then Python gives you back a 'file handle' that helps you to achieve things like 'read a line' or 'close an open file'. You can think of a filehandle as something that gives you a 'grip' on a file-like object no matter where or what it is, and the package is the way that this magic is achieved.
Let's recall our problem: download a data file that we know is hosted on a web site and then do some analysis of those data.
We need to break this seemingly hard problem down into something simpler and can do this by thinking about it as three separate steps:
We can tackle each of those in turn, getting the first bit working, then adding the second bit, etc. It's just like using lego to build something: you take the same pieces and assemble them in different ways to produce different things.
So, as I said, in Step #1 we are going to download a file hosted on a remote web site at Cities from Wikipedia Data (this can also be stored as a bit-link to make it easier to copy+paste and avoid really long lines: http://bit.ly/2vrUFKi)
We aren't going to to try to turn it into data or otherwise make 'sense' of it yet, we just want to get it. We are then going to build from this first step towards more substantial exercises and, eventually, you could easily request Megabytes of data in real-time according to flexibly-specified parameters!
Because we're accessing data from a URL we will use the urlopen
function from the urllib.request
package.
If you're wondering how we know to use this function and package, you might google something like: read remote csv file python 3 which in turn might get you to a StackOverflow question and anwer like this.
Of course, just knowing that you need urlopen
doesn't necessarily help you to actually use it. In addition to finding example code on StackOverflow, you can also ask the package itself for help with dir
and help
.
The 'Dive into Python' web site will tell you that "dir returns a list of the attributes and methods of any object". That introduces yet another term ('modules') that we don't want to get into right now, but everything in Python is an object and so dir
will give you help with packages, variables, functions... you name it.
import math
dir(math)
What dir
gives you is information about things you can potentially do: it's like navigating the menu of a web site -- you aren't yet looking at the information you need, you're trying to figure out if the site even has what you need. So dir
on a package will give you a list of the functions (and any variables) that the person who created the package has provided.
Typically, the information given by dir
is highly abbreviated and is really just a prelude to using help
.
The help
function gives you the actual detail you need about how to use a particular function: what are the inputs, what are the outputs, and what will the function actually do ?
Both of these can be used on a package, a function, or a variable that you've created; for example:
import math
help(math.acos)
Let's see this in action!
In [ ]:
from urllib.request import urlopen
print("dir(urlopen) returns:\n")
print(dir(urlopen))
print("\n\n")
print("help(urlopen) returns:\n")
print(help(urlopen)) # Notice!
Note: you can also get help in Jupyter by typing ?urlopen
in a code block and then hitting 'run'.
In [ ]:
?urlopen
In [ ]:
from urllib.request import urlopen
# Given the info you were given above, what do you
# think the value of 'url' should be? What
# type of variable is it? int or string?
url = ???
# Read the URL stream into variable called 'response'
# using the function that we imported above
response = ???(url)
#now read from the stream, decoding so that we get actual text
datafile = response.read().decode('utf-8')
print("datafile variable is of type: '" + datafile.__class__.__name__ + "'.\n")
In [ ]:
from urllib.request import urlopen
url = "http://bit.ly/2vrUFKi"
response = urlopen(url)
datafile = response.read().decode('utf-8')
print("datafile variable is of type: '" + datafile.__class__.__name__ + "'.\n")
Note that the datafile variable is of type string
(because we decoded it as such). If we hadn't decoded it, the result would have been of type bytes
which wouldn't be as easy for us (humans) to work with.
Now that we've read our data as text, we can print it to check.
In [ ]:
print(datafile)
So this is definitely text, but even though it has been nicely formatted visually in the notebook, right now it's actually not in a very convenient format to work with in our code. The datafile
str
object is actually a single string of text that the notebook is interpreting as having line breaks. We can see this by printing the 'raw' object (without jupyter notebook formatting for us) using the repr
function:
In [ ]:
print(repr(datafile))
Note the \n
character that jupyter notebook is using to determine when to print a new line for us to read nicely.
To split the text into individual lines ourselves (ready to work with in our code), we can use the handily named .splitlines()
method (more on methods below):
In [ ]:
url = "http://bit.ly/2vrUFKi"
response = urlopen(url)
datafile = response.read().decode('utf-8').splitlines()
print("datafile variable is of type: '" + datafile.__class__.__name__ + "'.\n")
Note now, that the data variable has type list
. When we print this datafile
list
object (without repr
) the \n
characters have gone and the list elements are split where those new line characters were previously (look carefully at where the '
are):
In [ ]:
print(datafile)
We can see this more clearly if we use a for loop to print out each element of the list (each element being a row of the original online file):
In [ ]:
for row in datafile:
print(row)
The last row should be 10,Sheffield,10,-163545.3257,7055177.403,685368
.
If you've managed to get the code above to run and have received 11 rows of text in response to your urlopen
query then, congratulations, you've now read a text file sitting on a server in, I think, Alberta, Canada and Python didn't care.
We now need to work on turning the response we got to our urlopen
request into useful data. You'll notice that we are dealing with a CSV (Comma-Separated Value) file and that the format is quite simple since none of the rows have fields that themselves contain commas. So to turn this into data we just need to split the row into separate fields using the commas.
In the code below, dir('string')
lists the available function for strings (because 'string'
is itself a String; we could just as easily written dir('foo')
or dir('supercalifragilisticexpialidocious')
because 'foo' and 'supercalifragilisticexpialidocious' are also strings and so have the same functions available.
In the output below, the functions that start and end with __
are generally considered private, so you can skip over these and focus on the ones further down that are designed to be useful to programmers. Can you spot the method that is most likely to be useful?
Remember that you can find out what methods are supported by a string using dir(<string>)
:
dir('supercalifragilisticexpialidocious')
I'm going to save you some time (this time!) and tell you that we're interested in the split
method. Why not use the help
function to figure out how to make use of it?
In [ ]:
help('supercalifragilisticexpialidocious'.split)
Now, using the output of the help
command, how would you use split
to turn that word into a list like this:
['sup','rcalifragilisticexpialidocious']
If you replace the ???
with the right bits of code then running the block below will print out "You got it!". You only need to change the ???
and nothing else!
In [ ]:
if ['sup','rcalifragilisticexpialidocious']=='supercalifragilisticexpialidocious'.split(???, maxsplit=???):
print("You got it!")
else:
print("Not yet!")
In [ ]:
if ['sup','rcalifragilisticexpialidocious']=='supercalifragilisticexpialidocious'.split('e', maxsplit=1):
print("You got it!")
else:
print("Not yet!")
In [ ]:
# Some other string methods
print('supercalifragilisticexpialidocious'.upper())
print('supercalifragilisticexpialidocious'.title())
OK, so you've tracked down the way to split a string
using a delimiter and even how to limit the number of 'words' that come out of the split operation. And you already saw another of these methods above (i.e. splitlines
). We work a lot with string
s, so it's handy to get to know the readily-available methods well.
Let's test string
splitting using our sample data (the last line of the 'simple' CSV file) to make sure it works the way we think it does... We want to turn the string
below into a list like this:
['10', 'Sheffield', '10', '-163545.3257', '7055177.403', '685368']
Again, we only need to change the ???
.
In [ ]:
test = datafile[-1].split(???)
print(test)
In [ ]:
test = datafile[-1].split(',')
print(test)
Hopefully you can see that a) is a list
and b) is a str
(string). Because a) is a list we can easily access each element. For example try the following code yourself:
Here's a clue:
print("The population of " + myList[1] + " is " + myList[5])
It is much more difficult to access the individual pieces of information from the string...
You can hopefully see how we're breaking a complex problem down into a set of increments , each of which is a bit easier to write and understand.
In [ ]:
total = 0
count = 0
for idx,row in enumerate(datafile):
if(idx > 0):
total = total + int(row[-1])
count += 1
#print(row)
mean = total / count
print(mean)
The following individuals have contributed to these teaching materials:
The content and structure of this teaching project itself is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license, and the contributing source code is licensed under The MIT License.
Supported by the Royal Geographical Society (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.
This notebook may depend on the following libraries: None