Notebook-12: Introduction to Packages

Functions vs. Packages

We saw in a previous Code Camp Lesson how functions are a useful programming tool to enable us to re-use chunks of code. Basically, a function is a way to do something to something in a portable, easy-to-use little bundle of code.

However, some steps in a program are done so many times by so many people that, eventually, someone writes a package that bundles up those operations into something easy to use that saves you having to figure out the gory details.

In this lesson we'll see how packages are useful in the context of tacking problems programmatically. The problem we will use here as an example is: download a data file that we know is hosted on a web site and then do some analysis of those data.

In this example, we are going to access and analyse a data set that we created: Cities from Wikipedia Data. Have a look at those data in your browser now by pasting the URL into a new tab. We'll use the urlopen package to help us access the data.

Q: How is a package different from a function?

A: A package is just a bundle of useful functions.

There's a little more to it than this, but simplest way to think of this is that when we type import foo then we are including functions from the foo package in our code. You'll see how this works below.

The point of the bundle-of-functions is that they can help us to achieve quite a lot very quickly since we don't need to reinvent the wheel and can just make use of someone else's code. In the same way that we won't mark you down for Googling the answer to a coding question, we won't mark you down for using someone else's package to help you get going with your programming. That's the whole point!

Often, if you're not sure where to start Google (or StackOverflow) is the place to go:

how to read text file on web server python

Boom!

Q: But why have packages at all, why not just copy and paste all the functions that we need into our notebook directly?

A: Because it keeps our code 'clean' and allows us to have functions with the same name that do different things.

Imagine the following:

  • You want to write a program to read data from two different web sites (e.g. NOMIS and the London Data Store)
  • It turns out that both already have handy Python packages that help you to read the data if you give them the URL of the page to read
  • Even better, both packages have a simple function called read_data(...) that takes a URL and extracts the data that you want from that page as a CSV file, except...
  • Because the web sites are completely different the read_data(...) functions are specific to that one web site!

So, if both packages have functions callend read_data(), then after you have done:

import nomis_reader
import ldn_data_store_reader

How does Python know which read_data you want to use?

The answer is the namespace: you can only access the NOMIS read_data function by typing nomis.read_data(...) and you can only access the Data Store read_data function by typing ldn_data_store_reader.read_data(...).

Import... As...

The length and awkwardness of that last line is why you can write this:

import ldn_data_store_reader as ldn

Because the alias ("...as ldn") lets you type ldn.read_data(...) and Python will know exactly which function you mean!

Packages in Programming Context

The first step to writing a program is thinking about your goal and the steps required to achieve that. We don't write programs like we write essays: all at once by writing a whole lot of code and then hoping for the best when we hit 'submit'. 

When you're tackling a programming problem you break it down into separate, simpler steps, and then tick them off one by one. Doing this gets easier as you become more familiar with programming, but it remains crucial and, in many cases, good programmers in large companies spend more time on design than they do on actual coding.

To a computer, reading data from a remote location (e.g. a web site halfway around the world) is not really any different from reading one that's sitting on your your local hard drive (e.g. on your desktop). To simplify things a great deal: the computer really just needs to know the location of the file and an appropriate protocol for accessing that file (e.g. http, https, ftp, local...) and then a clever programming language like Python will typically have packages that can kind of take of the rest.

In all cases -- local and remote -- you use the package to handle the hard bit of knowing how to actually 'read' data (because all files are just 1s and 0s of data) at the device level and then Python gives you back a 'file handle' that helps you to achieve things like 'read a line' or 'close an open file'. You can think of a filehandle as something that gives you a 'grip' on a file-like object no matter where or what it is, and the package is the way that this magic is achieved.

Let's recall our problem: download a data file that we know is hosted on a web site and then do some analysis of those data.

We need to break this seemingly hard problem down into something simpler and can do this by thinking about it as three separate steps:

  1. First we want to read a remote file (i.e. a text file somewhere the planet),
  2. Then we want to turn it into a local data structure (i.e a list or a dictionary),
  3. Finally we want to perform some calculations on the data (e.g. calculate the mean).

We can tackle each of those in turn, getting the first bit working, then adding the second bit, etc. It's just like using lego to build something: you take the same pieces and assemble them in different ways to produce different things.

Step 1: Reading a Remote File

So, as I said, in Step #1 we are going to download a file hosted on a remote web site at Cities from Wikipedia Data (this can also be stored as a bit-link to make it easier to copy+paste and avoid really long lines: http://bit.ly/2vrUFKi)

We aren't going to to try to turn it into data or otherwise make 'sense' of it yet, we just want to get it. We are then going to build from this first step towards more substantial exercises and, eventually, you could easily request Megabytes of data in real-time according to flexibly-specified parameters!

Because we're accessing data from a URL we will use the urlopen function from the urllib.request package.

If you're wondering how we know to use this function and package, you might google something like: read remote csv file python 3 which in turn might get you to a StackOverflow question and anwer like this.

Getting help with packages

Of course, just knowing that you need urlopen doesn't necessarily help you to actually use it. In addition to finding example code on StackOverflow, you can also ask the package itself for help with dir and help.

dir

The 'Dive into Python' web site will tell you that "dir returns a list of the attributes and methods of any object". That introduces yet another term ('modules') that we don't want to get into right now, but everything in Python is an object and so dir will give you help with packages, variables, functions... you name it.

import math
dir(math)

What dir gives you is information about things you can potentially do: it's like navigating the menu of a web site -- you aren't yet looking at the information you need, you're trying to figure out if the site even has what you need. So dir on a package will give you a list of the functions (and any variables) that the person who created the package has provided.

Typically, the information given by dir is highly abbreviated and is really just a prelude to using help.

help

The help function gives you the actual detail you need about how to use a particular function: what are the inputs, what are the outputs, and what will the function actually do ?

Both of these can be used on a package, a function, or a variable that you've created; for example:

import math
help(math.acos)

Let's see this in action!


In [ ]:
from urllib.request import urlopen
print("dir(urlopen) returns:\n")
print(dir(urlopen))
print("\n\n")
print("help(urlopen) returns:\n")
print(help(urlopen)) # Notice!

Note: you can also get help in Jupyter by typing ?urlopen in a code block and then hitting 'run'.


In [ ]:
?urlopen

A Challenge for you!

Fix the code to set the url variable, then use it with the urlopen function to open it.


In [ ]:
from urllib.request import urlopen

# Given the info you were given above, what do you 
# think the value of 'url' should be? What
# type of variable is it? int or string? 
url = ???

# Read the URL stream into variable called 'response'
# using the function that we imported above
response = ???(url)

#now read from the stream, decoding so that we get actual text
datafile = response.read().decode('utf-8')

print("datafile variable is of type: '" + datafile.__class__.__name__ + "'.\n")

In [ ]:
from urllib.request import urlopen

url = "http://bit.ly/2vrUFKi"
response = urlopen(url)
datafile = response.read().decode('utf-8')

print("datafile variable is of type: '" + datafile.__class__.__name__ + "'.\n")

Note that the datafile variable is of type string (because we decoded it as such). If we hadn't decoded it, the result would have been of type bytes which wouldn't be as easy for us (humans) to work with.

Now that we've read our data as text, we can print it to check.


In [ ]:
print(datafile)

So this is definitely text, but even though it has been nicely formatted visually in the notebook, right now it's actually not in a very convenient format to work with in our code. The datafile str object is actually a single string of text that the notebook is interpreting as having line breaks. We can see this by printing the 'raw' object (without jupyter notebook formatting for us) using the repr function:


In [ ]:
print(repr(datafile))

Note the \n character that jupyter notebook is using to determine when to print a new line for us to read nicely.

To split the text into individual lines ourselves (ready to work with in our code), we can use the handily named .splitlines() method (more on methods below):


In [ ]:
url = "http://bit.ly/2vrUFKi"

response = urlopen(url)
datafile = response.read().decode('utf-8').splitlines()

print("datafile variable is of type: '" + datafile.__class__.__name__ + "'.\n")

Note now, that the data variable has type list. When we print this datafile list object (without repr) the \n characters have gone and the list elements are split where those new line characters were previously (look carefully at where the ' are):


In [ ]:
print(datafile)

We can see this more clearly if we use a for loop to print out each element of the list (each element being a row of the original online file):


In [ ]:
for row in datafile:
    print(row)

The last row should be 10,Sheffield,10,-163545.3257,7055177.403,685368.

If you've managed to get the code above to run and have received 11 rows of text in response to your urlopen query then, congratulations, you've now read a text file sitting on a server in, I think, Alberta, Canada and Python didn't care.

Step 2: Turning Text into Data

We now need to work on turning the response we got to our urlopen request into useful data. You'll notice that we are dealing with a CSV (Comma-Separated Value) file and that the format is quite simple since none of the rows have fields that themselves contain commas. So to turn this into data we just need to split the row into separate fields using the commas.

In the code below, dir('string') lists the available function for strings (because 'string' is itself a String; we could just as easily written dir('foo') or dir('supercalifragilisticexpialidocious') because 'foo' and 'supercalifragilisticexpialidocious' are also strings and so have the same functions available.

In the output below, the functions that start and end with __ are generally considered private, so you can skip over these and focus on the ones further down that are designed to be useful to programmers. Can you spot the method that is most likely to be useful?

A Brief Musical Interlude

Just in case you need help pronouncing supercalifragilisticexpialidocious:

Remember that you can find out what methods are supported by a string using dir(<string>):

dir('supercalifragilisticexpialidocious')

I'm going to save you some time (this time!) and tell you that we're interested in the split method. Why not use the help function to figure out how to make use of it?


In [ ]:
help('supercalifragilisticexpialidocious'.split)

Now, using the output of the help command, how would you use split to turn that word into a list like this:

['sup','rcalifragilisticexpialidocious']

If you replace the ??? with the right bits of code then running the block below will print out "You got it!". You only need to change the ??? and nothing else!


In [ ]:
if ['sup','rcalifragilisticexpialidocious']=='supercalifragilisticexpialidocious'.split(???, maxsplit=???):
    print("You got it!")
else:
    print("Not yet!")

In [ ]:
if ['sup','rcalifragilisticexpialidocious']=='supercalifragilisticexpialidocious'.split('e', maxsplit=1):
    print("You got it!")
else:
    print("Not yet!")

In [ ]:
# Some other string methods
print('supercalifragilisticexpialidocious'.upper())
print('supercalifragilisticexpialidocious'.title())

OK, so you've tracked down the way to split a string using a delimiter and even how to limit the number of 'words' that come out of the split operation. And you already saw another of these methods above (i.e. splitlines). We work a lot with strings, so it's handy to get to know the readily-available methods well.

Let's test string splitting using our sample data (the last line of the 'simple' CSV file) to make sure it works the way we think it does... We want to turn the string below into a list like this:

['10', 'Sheffield', '10', '-163545.3257', '7055177.403', '685368']

Again, we only need to change the ???.


In [ ]:
test = datafile[-1].split(???)
print(test)

In [ ]:
test = datafile[-1].split(',')
print(test)

Consider

A question: why do you think that I consider a) to be data and not b)?

a)

['10', 'Sheffield', '10', '-163545.3257', '7055177.403', '685368']

b)

"10,Sheffield,10,-163545.3257,7055177.403,685368"

Decide what you think the answer is before clicking for the solution

Hopefully you can see that a) is a list and b) is a str (string). Because a) is a list we can easily access each element. For example try the following code yourself:

Here's a clue:

print("The population of " + myList[1] + " is " + myList[5])

It is much more difficult to access the individual pieces of information from the string...

You can hopefully see how we're breaking a complex problem down into a set of increments , each of which is a bit easier to write and understand.

Step 3 Analysis


In [ ]:
total = 0
count = 0

for idx,row in enumerate(datafile):
    
    if(idx > 0):
        total = total + int(row[-1])
        count += 1
    #print(row)
mean = total / count 

print(mean)

Credits!

Contributors:

The following individuals have contributed to these teaching materials:

License

The content and structure of this teaching project itself is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license, and the contributing source code is licensed under The MIT License.

Acknowledgements:

Supported by the Royal Geographical Society (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.

Potential Dependencies:

This notebook may depend on the following libraries: None