This is a 4 week (8 hour) course that will introduce you to the basics of handling, manipulating, exploring, and modeling data with Python. This notebook is a review of some Python language essentials. We'll go over this material during the first class, but taking a look at it before then will help you out, especially if you haven't used Python before.
The environment you're in right now is called a Jupyter notebook. Project Jupyter is an interactive environment that data scientists use for collaboration and communication. Each cell in the notebook can either contain text or code (often Python, but R, Julia, and lots of other languages are supported). This allows you to seamlessly weave explanations and plots into your code.
Each cell in a notebook can be executed independently, but declarations persist across cells. For example, I can define a variable in one cell...
In [1]:
my_variable = 10
... and then access that variable in a later cell:
In [2]:
print(my_variable)
We'll be using Jupyter notebooks extensively in this class. I'll give a more detailed introduction during the first class, but for now, the most important thing is to understand how to run code in the notebook.
As I mentioned above, there are two fundamental types of cells in a notebook - text (i.e. markdown) and code. When you click on a code cell, you should see a cursor appear in the cell that allows you to edit the code in that cell. A cell can have multiple lines - to begin a new line, press Enter
. When you want to run the cell's code, press Shift
+Enter
.
Try changing the values of the numbers that are added together in the cell below, and observe how the output changes:
In [3]:
a = 10
b = 15
print(a + b)
You can also edit the text in markdown cells. To display the editable, raw markdown in a text cell, double click on the cell. You can now put your cursor in the cell and edit it directly. When you're done editing, press Shift
+Enter
to render the cell into a more readable format.
Try editing text cell below with your name:
Make some edits here -> Hello, my name is Nick!
To change whether a cell contains text or code, use the drop-down in the toolbar. When you're in a code cell, it will look like this:
and when you're in a text cell, it will look like this:
Now that you know how to navigate the notebook, let's review some basic Python.
This is actually a surprisingly tricky question! There are (at least) two answers:
Python is an open source programming language that is extremely popular in the data science and web development communities. The roots of its current popularity in data science and scientific computing have an interesting history, but suffice to say that it's darn near impossible to be a practicing data scientist these days without at least being familiar with Python.
The guiding principles behind the design of the Python language specification are described in "The Zen of Python", which you can find here or by executing:
In [4]:
import this
Python syntax should be easy to write, but most importantly, well-written Python code should be easy to read. Code that follows these norms is called Pythonic. We'll touch a bit more on what it means to write Pythonic code in class.
A unique feature of Python is that whitespace matters, because it defines scope. Many other programming languages use braces or begin
/end
keywords to define scope. For example, in Javascript, you write a for
loop like this:
var count;
for(count = 0; count < 10; count++){
console.log(count);
console.log("<br />");
}
The curly braces here define the code executed in each iteration of the for loop. Similarly, in Ruby you write a for
loop like this:
for count in 0..9
puts "#{count}"
end
In this snippet, the code executed in each iteration of the for
loop is whatever comes between the first line and the end
keyword.
In Python, for
loops look a bit different:
In [5]:
print('Entering the for loop:\n')
for count in range(10):
print(count)
print('Still in the for loop.')
print("\nNow I'm done with the for loop.")
Note that there is no explicit symbol or keyword that defines the scope of code executed during each iteration - it's the indentation that defines the scope of the loop. When you define a function or class, or write a control structure like a for
look or if
statement, you should indent the next line (4 spaces is customary). Each subsequent line at that same level of indentation is considered part of the scope. You only escape the scope when you return to the previous level of indentation.
If you open up the terminal on your computer and type python
, it runs a program that looks something like this:
This is a program called CPython (written in C, hence the name) that parses, interprets, and executes code written to the Python language standard. CPython is known as the "reference implementation" of Python - it is an open source project (you can download and build the source code yourself if you're feeling adventurous) run by the Python Software Foundation and led by Guido van Rossum, the original creator and "Benevolent Dictator for Life" of Python.
When you type simply python
into the command line, CPython brings up a REPL (Read Execute Print Loop, pronounced "repple"), which is essentially an infinite loop that takes lines as you write them, interprets and executes the code, and prints the result.
For example, try typing
>>> x = 'Hello world"
>>> print(x)
in the REPL. After you hit Enter
on the first line, the interpreter assigns the value "Hello world" to a string variable x
. After you hit Enter
on the second line, it prints the value of x
.
We can accomplish the same result by typing the same code
x = "Hello world"
print(x)
into a file called test.py
and running python test.py
from the command line. The only difference is that when you provide the argument test.py
to the python
command, the REPL doesn't appear. Instead, the CPython interpreter interprets the contents of test.py
line-by-line until it reaches the end of the file, then exits. We won't use the REPL much in this course, but it's good to be aware that it exists. In fact, behind the pretty front end, this Jupyter notebook is essentially just wrapping the CPython interpreter, executing commands line by line as we enter them.
So to review, "Python" sometimes refers to a language specification and sometimes refers to an interpreter that's installed on your computer. We will use the two definitions interchangeably in this course; hopefully, it should be obvious from context which definition we're referring to.
One fundamental idea in Python is that everything is an object. This is different than some other languages like C and Java, which have fundamental, primitive data types like int
and char
. This means that things like integers and strings have attributes and methods that you can access. For example, if you want to read some documentation about an object my_thing
, you can access its __doc__
attribute like this:
In [6]:
thing_1 = 47 # define an int object
print(thing_1.__doc__)
In [7]:
thing_1 = 'blah' # reassign thing_1 to an string object
print(thing_1.__doc__)
To learn more about what attributes and methods a given object has, you can call dir(my_object)
:
In [8]:
dir(thing_1)
Out[8]:
That's interesting - it looks like the string object has a method called __add__
. Let's see what it does -
In [9]:
thing_2 = 'abcd'
thing_3 = thing_1.__add__(thing_2)
print(thing_3)
So calling __add__
with two strings creates a new string that is the concatenation of the two originals. As an aside, there are a lot more methods we can call on strings - split
, upper
, find
, etc. We'll come back to this.
The +
operator in Python is just syntactic sugar for the __add__
method:
In [10]:
thing_4 = thing_1 + thing_2
print(thing_4)
print(thing_3 == thing_4)
Any object you can add to another object in Python has an __add__
method. With integer addition, this works exactly as we would expect:
In [11]:
int_1 = 11
int_2 = 22
sum_1 = int_1.__add__(int_2)
sum_2 = int_1 + int_2
print(sum_1)
print(sum_2)
print(sum_1 == sum_2)
But it's unclear what to do when someone tries to add an int
to a str
:
In [12]:
thing_1 + int_1
There are a few native Python data types, each of which we'll use quite a bit. The properties of these types work largely the same way as they do in other languages. If you're ever confused about what type a variable my_var
is, you can always call type(my_var)
.
Just like in other languages, bool
s take values of either True
or False
. All of the traditional Boolean operations are present:
In [13]:
bool_1 = True
type(bool_1)
Out[13]:
In [14]:
dir(bool_1)
Out[14]:
In [15]:
bool_2 = False
In [16]:
bool_1 == bool_2
Out[16]:
In [17]:
bool_1 + bool_2
Out[17]:
In [18]:
bool_1 and bool_2
Out[18]:
In [19]:
type(bool_1 * bool_2)
Out[19]:
In [20]:
int_1 = 2
type(int_1)
Out[20]:
In [21]:
dir(int_1)
Out[21]:
In [22]:
int_2 = 3
print(int_1 - int_2)
In [23]:
int_1.__pow__(int_2)
Out[23]:
In [24]:
int_1 ** int_2
Out[24]:
One change from Python 2 to Python 3 is the default way that integers are divided. In Python 2, the result of 2/3
is 0
, the result of 4/3
is 1
, etc. In other words, dividing integers in Python 2 always returned an integer with any remainder truncated. In Python 3, the result of the division of integers is always a float
, with a decimal approximation of the remainder included. For example:
In [25]:
int_1 / int_2
Out[25]:
In [26]:
type(int_1 / int_2)
Out[26]:
In [27]:
int_1.__truediv__(int_2)
Out[27]:
In [28]:
int_1.__divmod__(int_2)
Out[28]:
In [29]:
float_1 = 23.46
type(float_1)
Out[29]:
In [30]:
dir(float_1)
Out[30]:
In [31]:
float_2 = 3.
In [32]:
float_1 / float_2
Out[32]:
With int
s and float
s, we can also do comparison operators like in other languages:
In [33]:
int_1 < int_2
Out[33]:
In [34]:
float_1 >= int_2
Out[34]:
In [35]:
float_1 == float_2
Out[35]:
In [36]:
str_1 = 'hello'
type(str_1)
Out[36]:
In [37]:
dir(str_1)
Out[37]:
We already saw that the +
operator concatenates two strings. Generalizing from this, what do you expect the *
operator to do?
In [38]:
a = 'Hi'
print(a*5)
There are a number of very useful methods built into Python str
objects. A few that you might find yourself needing to use when dealing with text data include:
In [39]:
# count the number of occurances of a sub-string
"Hi there I'm Nick".count('i')
Out[39]:
In [40]:
# Find the next index of a substring
"Hi there I'm Nick".find('i')
Out[40]:
In [41]:
"Hi there I'm Nick".find('i', 2)
Out[41]:
In [42]:
# Insert variables into a string
digit = 7
'The digit "7" should appear at the end of this sentence: {}.'.format(digit)
Out[42]:
In [43]:
another_digit = 15
'This sentence will have two digits at the end: {} and {}.'.format(digit, another_digit)
Out[43]:
In [44]:
# Replace a sub-string with another sub-string
my_sentence = "Hi there I'm Nick"
my_sentence.replace('e', 'E')
Out[44]:
In [45]:
my_sentence.replace('N', '')
Out[45]:
There are plenty more useful string functions - use either the dir()
function or Google to learn more about what's available.
So, to sum it up - basic data types like bool
, int
, float
, and str
are all objects in Python. The methods in each of these object classes define what operations can be done on them and how those operations are performed. For the sake of readability, however, many of the common operations like + and < are provided as syntactic sugar.
When we were looking at the methods in the various data type classes above, we saw a bunch of methods like __add__
and __pow__
with double leading underscores and double trailing underscores (sometimes shorted to "dunders"). As it turns out, underscores are a bit of a thing in Python. Idiomatic use dictates a few important uses of underscores in variable and function names:
my_variable
) rather than camelCase (myVariable
)._my_function
or _my_variable
) denotes a function or variable that is not meant for end users to access directly. Python doesn't have a sense of strong encapsulation, i.e. there are no strictly "private" methods or variables like in Java, but a leading underscore is a way of "weakly" signaling that the entity is for private use only.type_
) is used to avoid conflict with Python built-in functions or keywords. In my opinion, this is often poor style. Try to come up with a more descriptive name instead.__init__
, __add__
) correspond to special variables or methods that correspond to some sort of "magic" syntax. As we saw above, the __add__
method of an object describes what the result of some_object + another_object
is.For lots more detail on the use of underscores in Python, check out this post.
Single variables can only take us so far. Eventually, we're going to way to have ways of storing many individual variables in a single, structured format.
The list is one of the most commonly used Python data structures. A list is an ordered collection of (potentially heterogeneous) objects. Similar structures that exist in other languages are often called arrays.
In [46]:
my_list = ['a', 'b', 'c', 'a']
In [47]:
len(my_list)
Out[47]:
In [48]:
my_list.append(1)
print(my_list)
To access individual list elements by their position, use square brackets:
In [49]:
my_list[0] # indexing in Python starts at 0!
Out[49]:
In [50]:
my_list[4]
Out[50]:
In [51]:
my_list[-1] # negative indexes count backward from the end of the list
Out[51]:
Lists can hold arbitrary objects!
In [52]:
type(my_list[0])
Out[52]:
In [53]:
type(my_list[-1])
Out[53]:
In [54]:
# let's do something crazy
my_list.append(my_list)
type(my_list[-1])
Out[54]:
In [55]:
my_list
Out[55]:
In [56]:
my_list[-1]
Out[56]:
In [57]:
my_list[-1][-1]
Out[57]:
In [58]:
my_list[-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1]
Out[58]:
Lists are also mutable objects, meaning that any part of them can be changed at any time. This makes them very flexible objects for storing data in a program.
In [59]:
my_list = ['a', 'b', 1]
In [60]:
my_list[0] = 'c'
my_list
Out[60]:
In [61]:
my_list.remove(1)
my_list
Out[61]:
In [62]:
my_tuple = ('a', 'b', 1, 'a')
In [63]:
my_tuple[2]
Out[63]:
In [64]:
my_tuple[0] = 'c'
In [65]:
my_tuple.append('c')
In [66]:
my_tuple.remove(1)
In [67]:
my_set = {'a', 'b', 1, 'a'}
print(my_set) # note that order
In [68]:
my_set.add('c')
print(my_set)
Note above that the order of items in a set doesn't have the same meaning as in lists and tuples.
In [69]:
my_set[0]
Sets are used for a couple reasons. Sometimes, finding the number of unique items in a list or tuple is important. In this case, we can convert the list/tuple to a set, then call len
on the new set. For example,
In [70]:
my_list = ['a', 'a', 'a', 'a', 'b', 'b', 'b']
my_list
Out[70]:
In [71]:
my_set = set(my_list)
len(my_set)
Out[71]:
The other reason is that the in
keyword for testing a collection for membership of an object is much faster for a list than a set.
In [72]:
my_list = list(range(1000000)) # list of numbers 0 - 999,999
my_set = set(my_list)
In [73]:
%%timeit
999999 in my_list
In [74]:
%%timeit
999999 in my_set
Any idea why there's such a discrepancy?
In [75]:
my_dict = {'name': 'Nick',
'birthday': 'July 13',
'years_in_durham': 4}
In [76]:
my_dict['name']
Out[76]:
In [77]:
my_dict['years_in_durham']
Out[77]:
In [78]:
my_dict['favorite_restaurant'] = 'Mateo'
my_dict['favorite_restaurant']
Out[78]:
In [79]:
my_dict['age'] # hey, that's personal. Also, it's not a key in the dictionary.
In addition to accessing values by keys, you can retrieve the keys and values by themselves as lists:
In [80]:
my_dict.keys()
Out[80]:
In [81]:
my_dict.values()
Out[81]:
Note that if you're using Python 3.5 or earlier, the order that you insert key/value pairs into the dictionary doesn't correspond to the order they're stored in by default (we inserted favorite_restaurant
after years_in_durham
!). This default behavior was just recently changed in Python 3.6 (released in December 2016).
As data scientists, we're data-driven people, and we want our code to be data-driven, too. Control structures are a way of adding a logical flow to your programs, making them reactive to different conditions. These concepts are largely the same as in other programming languages, so I'll quickly introduce the syntax here for reference without much comment.
Like most programming languages, Python provides a way of conditionally evaluating lines of code.
In [82]:
x = 3
if x < 2:
print('x less than 2')
elif x < 4:
print('x less than 4, greater than or equal to 2')
else:
print('x greater than or equal to 4')
In [83]:
my_list = ['a', 'b', 'c']
for element in my_list:
print(element)
To iterate for a specific number of times, you can create an iterator object with the range
function:
In [84]:
for i in range(5): # iterate over all integers (starting at 0) less than 5
print(i)
In [85]:
for i in range(2, 6, 3): # iterate over integers (starting at 2) less than 6, increasing by 3
print(i)
In [86]:
my_list = ['a', 'b', 'c']
idx = 0
while idx < len(my_list):
print(my_list[idx])
idx += 1
In [87]:
my_list = ['a', 'b', 'c']
for element in my_list:
print(element)
There are occasionally other reasons for using while loops (waiting for an external input, for example), but we won't make extensive use of them in this course.
Of course, as data scientists, one of our most important jobs is to manipulate data in a way that provides insight. In other words, we need ways of taking raw data, doing some things to it, and returning nice, clean, processed data back. This is the job of functions!
It turns out that Python has a ton of functions built in already. When we have a task that can be accomplished by a built-in function, it's almost always a good idea to use them. This is because many of the Python built-in functions are actually written in C, not Python, and C tends to be much faster for certain tasks.
In [88]:
my_list = range(1000000)
In [89]:
%%timeit
sum(my_list)
In [90]:
%%timeit
my_sum = 0
for element in my_list:
my_sum += element
my_sum
Some common mathematical functions that are built into Python:
sum
divmod
round
abs
max
min
And some other convenience functions, some of which we've already seen:
int
, float
, str
, set
, list
, dict
: for converting between data structureslen
: for finding the number of elements in a data structuretype
: for finding the type that an object belongs to
In [91]:
def double_it(x):
return x * 2
In [92]:
double_it(5)
Out[92]:
Python has dynamic typing, which (in part) means that the arguments to functions aren't assigned a specific type:
In [93]:
double_it('hello') # remember 'hello' * 2 from before?
Out[93]:
In [94]:
double_it({'a', 'b'}) # but there's no notion of multiplication for sets
When defining a function, you can add defaults to arguments that you want to be optional. When defining and providing arguments, required arguments always go first, and the order they're provided in matters. Optional arguments follow, and can be passed by their keyword in any order.
In [95]:
def multiply_them(x, y, extra_arg1=None, extra_arg2=None):
if extra_arg1 is not None:
print(extra_arg1)
if extra_arg2 is not None:
print(extra_arg2)
print('multiplying {} and {}...'.format(x, y))
return x * y
In [96]:
multiply_them(3, 5)
Out[96]:
In [97]:
multiply_them(3, 5, extra_arg1='hello')
Out[97]:
In [98]:
multiply_them(3, 5, extra_arg2='world', extra_arg1='hello')
Out[98]:
In [99]:
multiply_them(extra_arg2='world', extra_arg1='hello', 3, 5)
Knowing how to create your own functions can be a rabbit hole - once you know that you can make Python do whatever you want it to do, it can be easy to go overboard. Good data scientists are efficient data scientists - you shouldn't reinvent the wheel by reimplementing a bunch of functionality that someone else worked hard on. Doing anything nontrivial can take a ton of time, and without spending even more time to write tests, squash bugs, and address corner cases, your code can easily end up being much less reliable than code that someone else has spent time perfecting.
Python has a very robust standard library of external modules that come with every Python installation. For even more specialized work, the Python community has also open-sourced tens of thousands of packages, any of which is a simple pip install
away.
The Python standard library is a collection of packages that ships with Python itself. In other words, it contains a bunch of code that you can import into code you're writing, but that you don't have to download separately after downloading Python.
Here are a few examples -
In [100]:
import random # create (pseudo-) random numbers
random.random()
Out[100]:
In [101]:
import math # common mathematical functions that aren't built into base Python
print(math.factorial(5))
In [102]:
math.log10(100)
Out[102]:
In [103]:
import statistics # some basic summary statistics
my_list = [1, 2, 3, 4, 5]
statistics.mean(my_list)
Out[103]:
In [104]:
statistics.median(my_list)
Out[104]:
In [105]:
statistics.stdev(my_list)
Out[105]:
There are dozens of packages in the standard library, so if you find yourself writing a function for something lots of other people might want to do, it's definitely worth checking whether that function is already implemented in the Python standard library.
We'll use a handful of packages from the standard library in this course, which I'll introduce as they appear.
Nonetheless, the standard library can't contain functionality that covers everything people use Python for. For more specialized packages, the Python Software Foundation runs the Python Package Index (PyPI, pronounced pie-pee-eye). PyPI is a package server that is free to upload and download from - anyone can create a package and upload it to PyPI, and anyone can download any package from PyPI at any time.
To download and install a package from PyPI, you typically use a program called pip
(pip installs packages) by running the command pip install <package name>
from the command line.
This notebook is fairly information-dense, especially if you haven't used Python before. Keep it close by for reference as the course goes along! Thankfully, Python syntax is fairly friendly toward beginners, so picking up the basics usually doesn't take too long. I hope you'll find as the course goes along that the Python syntax starts to feel more natural. Don't get discouraged; know when to ask for help, and look online for resources. And remember - the Python ecosystem is deep, and it can take years to master!
In [ ]: