In What is Data 1, we saw some examples of tabular data. Now lots of people just use spreadsheet tools like Excel or Google sheets for manipulating tabular data. And that's fine... But some tasks end up being really complicated in spreadsheets, and there's an easier approach: use a programming language!
That might sound scary, but we're going to make it easy. First, we'll use a programming language called Python that was designed for teaching. Second, we'll start with the very basics. And third, we'll try to make sure that even the complicated things are actually pretty straightforward.
There are lots of introductory tutorial resources for learning Python. But we're going to make a start right here, since we'll be able to integrate it with different ways of dealing with data. If you already know some Python or a similar programming language, feel free to skip this lesson.
The really cool thing about the page you are reading now is that it's got little boxes (we call them cells) where you can write in bits of code and run the code. Just below — where it says "In ...
" followed by a number inside [] — is a code cell. The expression in the cell is a Python expression. You don't need to be an experienced programmer to see what it means!
If you place your mouse cursor in the cell and then click on the &rtrif symbol in the top menu, the code in the cell will run. But the best thing is that you can edit the cell and re-run the code. So replace the 4
by a 5
and run the code to see what happens.
In [70]:
3 + 5
Out[70]:
Unless something went very wrong, the cell with the result — that is, the cell where it says "Out ...
" followed by a number inside [] — will now contain an 8
.
Try a few other things. Change the numbers some more; add together several numbers in the same line (e.g., 1 + 2 + 3 + ...
); see what happens if you change the +
to a -
(subtraction) or a *
(multiplication). What about division? In Python, the division operator is a /
.
Programmers love lists. And lists are easy to use in Python. Here's a list of numbers from What is Data 1. You should recogise them as our rainy Edinburgh days:
In [71]:
[23.87, 19.85, 19.22, 28.93, 29.41, 22.23, 23.50, 24.95]
Out[71]:
Now, you can eyeball the list and see that it's got eight items in it. But if the list was too big to inspect, you can get Python to tell you how long it is.
In [72]:
len([23.87, 19.85, 19.22, 28.93, 29.41, 22.23, 23.50, 24.95])
Out[72]:
So len
is a special Python expression that computes the length of a list. You should be able to see that in the cell above we've written len
followed by (
, then the list, then a )
. We call len()
a function and in this case we've applied the function to something else, namely the list. We usually write len()
with ()
after the name of the function to remind us that it is a function. When we apply a function to something else, the function should give us a result. That's what happens when we run the code.
It can be cumbersome however to have to write out the whole list every time we want to do something with it. Let's give the list a name that we can use as a short way of referring to it:
In [73]:
rainy_days = [23.87,19.85,19.22,28.93,29.41,22.23,23.50,24.95]
Now it's up to us (pretty much) what name we use. We could have called the list Monty
or xyz
or l
. But it makes more sense to use a name that is appropriate and relevant for the thing it names.
OK, now for the cool bit. What happens if we apply len()
to rainy_days
?
In [74]:
len(rainy_days)
Out[74]:
Yes! We get the same result as before. Awesome!
Suppose now that we can to figure out the total number of rainy days across all the eight years in our little dataset. We could write 23.87 + 19.85 + 19.22 + ...
but this is very tedious. And there's a smarter way.
In [75]:
sum(rainy_days)
Out[75]:
Yes, we can apply the function sum()
to a list to add up all the items in the list.
You may be able to see where we're going with this. What was the average (or mean) number of days of rainfall per year over the eight year period? Well, we need to get the total for the period and divide it by the number of years, right? Let's break it down into steps.
In [76]:
total = sum(rainy_days)
num_years = len(rainy_days)
total / num_years
Out[76]:
Hmm. Almost 24 days of solid rainfall per year. That's pretty impressive (or depressing, depending on your point of view).
In this example, we've given names to the results of applying functions as we go along. This can make things easier to follow, but isn't strictly necessary. We could just calculate the average in one go.
In [77]:
sum(rainy_days) / len(rainy_days)
Out[77]:
Hint: define a new variable whose value is the list with elements 4.5, 3.1, 8.6, then adapt the example above.
Letters, words, sentences, whole books — these can all be treated as strings in Python. What is a string? You can think of it as a sequence of characters, including all letters of the alphabet, funny characters like $
, §
and _
, punctuation marks, and even the "whitespace" between words. When we want to write a string down in Python, we have to enclose it in quotes. Either single quotes ('
) or double ones ("
) will do.
In the next example, we again to the trick of introducing a name for our string, and we can also find out the length of the string.
In [78]:
jabber = "The vorpal blade went snicker-snack"
len(jabber)
Out[78]:
So there are 35 characters in the string, include the whitespaces.
Let's briefly play around with a simpler example:
In [79]:
s = "bora"
What happens if we add a string to a string?
In [80]:
s + s
Out[80]:
We can even multiply a string a number of times:
In [81]:
s * 5
Out[81]:
Quite often we will encounter strings that we want to split into smaller parts. Suppose, for example, that we have a hyphenated word such as strawberry-ice-cream-flavoured
that we want to break into component parts. Here's how we might do it:
In [82]:
"strawberry-ice-cream-flavoured".split('-')
Out[82]:
split()
is similar to a function like len()
, except that it is put right after a full-stop at the end of the string that we want to split. (Yes, this is a bit confusing and you'll just have to take it on trust for the moment that there is a good reason for doing it this way.) Inside the parentheses, we've written the character, namely -
, that marks the points where we want to divide up the string. And finally, you can see that although we applied split()
to a single string, the output is a list of strings.
Suppose we want to split the string jabber
into its component words. We do this by spltting on the space between words. One way is to put quotes around the space character:
In [83]:
jabber.split(" ")
Out[83]:
Because it's so common to split a string on whitespace, Python allows us to do this just by leaving the parentheses empty:
In [84]:
jabber.split()
Out[84]:
Let's give a name to the list produced in this way, and again check the length of the list.
In [85]:
words = jabber.split()
len(words)
Out[85]:
A new trick with lists involves something called indexing. The best way to illustrate this is with an example.
In [86]:
words[0]
Out[86]:
Writing a number $n$ in square brackets after a list gives you back the $n^{th}$ item in the list. But — and this may seem deeply weird if you've not encountered it before — we start counting at 0. So words[0]
returns the first element of words
, words[1]
returns the second element, and so on.
But be careful. If you index with a number that is greater than the number of items in the list, Python will throw a hissy fit (more technically, an error).
In [87]:
words[5]