Welcome!

This is a 4 week (8 hour) course that will introduce you to the basics of handling, manipulating, exploring, and modeling data with Python. This notebook is a review of some Python language essentials. We'll go over this material during the first class, but taking a look at it before then will help you out, especially if you haven't used Python before.

The environment you're in right now is called a Jupyter notebook. Project Jupyter is an interactive environment that data scientists use for collaboration and communication. Each cell in the notebook can either contain text or code (often Python, but R, Julia, and lots of other languages are supported). This allows you to seamlessly weave explanations and plots into your code.

Each cell in a notebook can be executed independently, but declarations persist across cells. For example, I can define a variable in one cell...


In [1]:
my_variable = 10

... and then access that variable in a later cell:


In [2]:
print(my_variable)


10

We'll be using Jupyter notebooks extensively in this class. I'll give a more detailed introduction during the first class, but for now, the most important thing is to understand how to run code in the notebook.

As I mentioned above, there are two fundamental types of cells in a notebook - text (i.e. markdown) and code. When you click on a code cell, you should see a cursor appear in the cell that allows you to edit the code in that cell. A cell can have multiple lines - to begin a new line, press Enter. When you want to run the cell's code, press Shift+Enter.

Try changing the values of the numbers that are added together in the cell below, and observe how the output changes:


In [3]:
a = 10
b = 15
print(a + b)


25

You can also edit the text in markdown cells. To display the editable, raw markdown in a text cell, double click on the cell. You can now put your cursor in the cell and edit it directly. When you're done editing, press Shift+Enter to render the cell into a more readable format.

Try editing text cell below with your name:

Make some edits here -> Hello, my name is Nick!

To change whether a cell contains text or code, use the drop-down in the toolbar. When you're in a code cell, it will look like this:

and when you're in a text cell, it will look like this:

Now that you know how to navigate the notebook, let's review some basic Python.

What is Python?

This is actually a surprisingly tricky question! There are (at least) two answers:

  • A language specification
  • A program on your computer that interprets and executes code written to that language specification

Python (the language)

Python is an open source programming language that is extremely popular in the data science and web development communities. The roots of its current popularity in data science and scientific computing have an interesting history, but suffice to say that it's darn near impossible to be a practicing data scientist these days without at least being familiar with Python.

The guiding principles behind the design of the Python language specification are described in "The Zen of Python", which you can find here or by executing:


In [4]:
import this


The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Python syntax should be easy to write, but most importantly, well-written Python code should be easy to read. Code that follows these norms is called Pythonic. We'll touch a bit more on what it means to write Pythonic code in class.

What's the deal with whitespace?

A unique feature of Python is that whitespace matters, because it defines scope. Many other programming languages use braces or begin/end keywords to define scope. For example, in Javascript, you write a for loop like this:

var count;
for(count = 0; count < 10; count++){
               console.log(count);
               console.log("<br />");
            }

The curly braces here define the code executed in each iteration of the for loop. Similarly, in Ruby you write a for loop like this:

for count in 0..9
   puts "#{count}"
end

In this snippet, the code executed in each iteration of the for loop is whatever comes between the first line and the end keyword.

In Python, for loops look a bit different:


In [5]:
print('Entering the for loop:\n')
for count in range(10):
    print(count)
    print('Still in the for loop.')

print("\nNow I'm done with the for loop.")


Entering the for loop:

0
Still in the for loop.
1
Still in the for loop.
2
Still in the for loop.
3
Still in the for loop.
4
Still in the for loop.
5
Still in the for loop.
6
Still in the for loop.
7
Still in the for loop.
8
Still in the for loop.
9
Still in the for loop.

Now I'm done with the for loop.

Note that there is no explicit symbol or keyword that defines the scope of code executed during each iteration - it's the indentation that defines the scope of the loop. When you define a function or class, or write a control structure like a for look or if statement, you should indent the next line (4 spaces is customary). Each subsequent line at that same level of indentation is considered part of the scope. You only escape the scope when you return to the previous level of indentation.

Python (the interpreter)

If you open up the terminal on your computer and type python, it runs a program that looks something like this:

This is a program called CPython (written in C, hence the name) that parses, interprets, and executes code written to the Python language standard. CPython is known as the "reference implementation" of Python - it is an open source project (you can download and build the source code yourself if you're feeling adventurous) run by the Python Software Foundation and led by Guido van Rossum, the original creator and "Benevolent Dictator for Life" of Python.

When you type simply python into the command line, CPython brings up a REPL (Read Execute Print Loop, pronounced "repple"), which is essentially an infinite loop that takes lines as you write them, interprets and executes the code, and prints the result.

For example, try typing

>>> x = 'Hello world"
>>> print(x)

in the REPL. After you hit Enter on the first line, the interpreter assigns the value "Hello world" to a string variable x. After you hit Enter on the second line, it prints the value of x.

We can accomplish the same result by typing the same code

x = "Hello world"
print(x)

into a file called test.py and running python test.py from the command line. The only difference is that when you provide the argument test.py to the python command, the REPL doesn't appear. Instead, the CPython interpreter interprets the contents of test.py line-by-line until it reaches the end of the file, then exits. We won't use the REPL much in this course, but it's good to be aware that it exists. In fact, behind the pretty front end, this Jupyter notebook is essentially just wrapping the CPython interpreter, executing commands line by line as we enter them.

So to review, "Python" sometimes refers to a language specification and sometimes refers to an interpreter that's installed on your computer. We will use the two definitions interchangeably in this course; hopefully, it should be obvious from context which definition we're referring to.

Variables, Objects, Operators, and Naming

One fundamental idea in Python is that everything is an object. This is different than some other languages like C and Java, which have fundamental, primitive data types like int and char. This means that things like integers and strings have attributes and methods that you can access. For example, if you want to read some documentation about an object my_thing, you can access its __doc__ attribute like this:


In [6]:
thing_1 = 47    # define an int object
print(thing_1.__doc__)


int(x=0) -> integer
int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments
are given.  If x is a number, return x.__int__().  For floating point
numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base.  The literal can be preceded by '+' or '-' and be surrounded
by whitespace.  The base defaults to 10.  Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int('0b100', base=0)
4

In [7]:
thing_1 = 'blah'    # reassign thing_1 to an string object
print(thing_1.__doc__)


str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.

To learn more about what attributes and methods a given object has, you can call dir(my_object):


In [8]:
dir(thing_1)


Out[8]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

That's interesting - it looks like the string object has a method called __add__. Let's see what it does -


In [9]:
thing_2 = 'abcd'
thing_3 = thing_1.__add__(thing_2)
print(thing_3)


blahabcd

So calling __add__ with two strings creates a new string that is the concatenation of the two originals. As an aside, there are a lot more methods we can call on strings - split, upper, find, etc. We'll come back to this.

The + operator in Python is just syntactic sugar for the __add__ method:


In [10]:
thing_4 = thing_1 + thing_2
print(thing_4)
print(thing_3 == thing_4)


blahabcd
True

Any object you can add to another object in Python has an __add__ method. With integer addition, this works exactly as we would expect:


In [11]:
int_1 = 11
int_2 = 22
sum_1 = int_1.__add__(int_2)
sum_2 = int_1 + int_2
print(sum_1)
print(sum_2)
print(sum_1 == sum_2)


33
33
True

But it's unclear what to do when someone tries to add an int to a str:


In [12]:
thing_1 + int_1


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-3b4969d216fd> in <module>()
----> 1 thing_1 + int_1

TypeError: Can't convert 'int' object to str implicitly

Data types

There are a few native Python data types, each of which we'll use quite a bit. The properties of these types work largely the same way as they do in other languages. If you're ever confused about what type a variable my_var is, you can always call type(my_var).

Booleans

Just like in other languages, bools take values of either True or False. All of the traditional Boolean operations are present:


In [13]:
bool_1 = True
type(bool_1)


Out[13]:
bool

In [14]:
dir(bool_1)


Out[14]:
['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

In [15]:
bool_2 = False

In [16]:
bool_1 == bool_2


Out[16]:
False

In [17]:
bool_1 + bool_2


Out[17]:
1

In [18]:
bool_1 and bool_2


Out[18]:
False

In [19]:
type(bool_1 * bool_2)


Out[19]:
int

Integers

Python ints are whole (positive, negative, or 0) numbers implemented as long objects of arbitrary size. Again, all of the standard operations are present:


In [20]:
int_1 = 2
type(int_1)


Out[20]:
int

In [21]:
dir(int_1)


Out[21]:
['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

In [22]:
int_2 = 3
print(int_1 - int_2)


-1

In [23]:
int_1.__pow__(int_2)


Out[23]:
8

In [24]:
int_1 ** int_2


Out[24]:
8

One change from Python 2 to Python 3 is the default way that integers are divided. In Python 2, the result of 2/3 is 0, the result of 4/3 is 1, etc. In other words, dividing integers in Python 2 always returned an integer with any remainder truncated. In Python 3, the result of the division of integers is always a float, with a decimal approximation of the remainder included. For example:


In [25]:
int_1 / int_2


Out[25]:
0.6666666666666666

In [26]:
type(int_1 / int_2)


Out[26]:
float

In [27]:
int_1.__truediv__(int_2)


Out[27]:
0.6666666666666666

In [28]:
int_1.__divmod__(int_2)


Out[28]:
(0, 2)

Floats

Python floats are also consistent with other languages:


In [29]:
float_1 = 23.46
type(float_1)


Out[29]:
float

In [30]:
dir(float_1)


Out[30]:
['__abs__',
 '__add__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getformat__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__int__',
 '__le__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmod__',
 '__rmul__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__setformat__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 'as_integer_ratio',
 'conjugate',
 'fromhex',
 'hex',
 'imag',
 'is_integer',
 'real']

In [31]:
float_2 = 3.

In [32]:
float_1 / float_2


Out[32]:
7.82

With ints and floats, we can also do comparison operators like in other languages:


In [33]:
int_1 < int_2


Out[33]:
True

In [34]:
float_1 >= int_2


Out[34]:
True

In [35]:
float_1 == float_2


Out[35]:
False

Strings


In [36]:
str_1 = 'hello'
type(str_1)


Out[36]:
str

In [37]:
dir(str_1)


Out[37]:
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

We already saw that the + operator concatenates two strings. Generalizing from this, what do you expect the * operator to do?


In [38]:
a = 'Hi'
print(a*5)


HiHiHiHiHi

There are a number of very useful methods built into Python str objects. A few that you might find yourself needing to use when dealing with text data include:


In [39]:
# count the number of occurances of a sub-string
"Hi there I'm Nick".count('i')


Out[39]:
2

In [40]:
# Find the next index of a substring
"Hi there I'm Nick".find('i')


Out[40]:
1

In [41]:
"Hi there I'm Nick".find('i', 2)


Out[41]:
14

In [42]:
# Insert variables into a string
digit = 7
'The digit "7" should appear at the end of this sentence: {}.'.format(digit)


Out[42]:
'The digit "7" should appear at the end of this sentence: 7.'

In [43]:
another_digit = 15
'This sentence will have two digits at the end: {} and {}.'.format(digit, another_digit)


Out[43]:
'This sentence will have two digits at the end: 7 and 15.'

In [44]:
# Replace a sub-string with another sub-string
my_sentence = "Hi there I'm Nick"
my_sentence.replace('e', 'E')


Out[44]:
"Hi thErE I'm Nick"

In [45]:
my_sentence.replace('N', '')


Out[45]:
"Hi there I'm ick"

There are plenty more useful string functions - use either the dir() function or Google to learn more about what's available.

So, to sum it up - basic data types like bool, int, float, and str are all objects in Python. The methods in each of these object classes define what operations can be done on them and how those operations are performed. For the sake of readability, however, many of the common operations like + and < are provided as syntactic sugar.

An aside: What's the deal with those underscores???

When we were looking at the methods in the various data type classes above, we saw a bunch of methods like __add__ and __pow__ with double leading underscores and double trailing underscores (sometimes shorted to "dunders"). As it turns out, underscores are a bit of a thing in Python. Idiomatic use dictates a few important uses of underscores in variable and function names:

  • Underscores are used to separate words in names. That is, idiomatic Python uses snake_case (my_variable) rather than camelCase (myVariable).
  • A single leading underscore (_my_function or _my_variable) denotes a function or variable that is not meant for end users to access directly. Python doesn't have a sense of strong encapsulation, i.e. there are no strictly "private" methods or variables like in Java, but a leading underscore is a way of "weakly" signaling that the entity is for private use only.
  • A single training underscore (type_) is used to avoid conflict with Python built-in functions or keywords. In my opinion, this is often poor style. Try to come up with a more descriptive name instead.
  • Double leading underscore and double trailing underscore (__init__, __add__) correspond to special variables or methods that correspond to some sort of "magic" syntax. As we saw above, the __add__ method of an object describes what the result of some_object + another_object is.

For lots more detail on the use of underscores in Python, check out this post.

Collections of objects

Single variables can only take us so far. Eventually, we're going to way to have ways of storing many individual variables in a single, structured format.

Lists

The list is one of the most commonly used Python data structures. A list is an ordered collection of (potentially heterogeneous) objects. Similar structures that exist in other languages are often called arrays.


In [46]:
my_list = ['a', 'b', 'c', 'a']

In [47]:
len(my_list)


Out[47]:
4

In [48]:
my_list.append(1)
print(my_list)


['a', 'b', 'c', 'a', 1]

To access individual list elements by their position, use square brackets:


In [49]:
my_list[0]    # indexing in Python starts at 0!


Out[49]:
'a'

In [50]:
my_list[4]


Out[50]:
1

In [51]:
my_list[-1]    # negative indexes count backward from the end of the list


Out[51]:
1

Lists can hold arbitrary objects!


In [52]:
type(my_list[0])


Out[52]:
str

In [53]:
type(my_list[-1])


Out[53]:
int

In [54]:
# let's do something crazy
my_list.append(my_list)
type(my_list[-1])


Out[54]:
list

In [55]:
my_list


Out[55]:
['a', 'b', 'c', 'a', 1, [...]]

In [56]:
my_list[-1]


Out[56]:
['a', 'b', 'c', 'a', 1, [...]]

In [57]:
my_list[-1][-1]


Out[57]:
['a', 'b', 'c', 'a', 1, [...]]

In [58]:
my_list[-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1]


Out[58]:
['a', 'b', 'c', 'a', 1, [...]]

Lists are also mutable objects, meaning that any part of them can be changed at any time. This makes them very flexible objects for storing data in a program.


In [59]:
my_list = ['a', 'b', 1]

In [60]:
my_list[0] = 'c'
my_list


Out[60]:
['c', 'b', 1]

In [61]:
my_list.remove(1)
my_list


Out[61]:
['c', 'b']

Tuples

A tuple in Python is very similar to a list, except that tuples are immutable. This means that once they're defined, they can't be changed. Otherwise, they act very much like lists.


In [62]:
my_tuple = ('a', 'b', 1, 'a')

In [63]:
my_tuple[2]


Out[63]:
1

In [64]:
my_tuple[0] = 'c'


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-64-0f5a2ef37dc5> in <module>()
----> 1 my_tuple[0] = 'c'

TypeError: 'tuple' object does not support item assignment

In [65]:
my_tuple.append('c')


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-65-204bb512c7ec> in <module>()
----> 1 my_tuple.append('c')

AttributeError: 'tuple' object has no attribute 'append'

In [66]:
my_tuple.remove(1)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-66-3fcce5bc5de6> in <module>()
----> 1 my_tuple.remove(1)

AttributeError: 'tuple' object has no attribute 'remove'

Sets

A set in Python acts somewhat like a list that contains only unique objects.


In [67]:
my_set = {'a', 'b', 1, 'a'}
print(my_set)    # note that order


{1, 'b', 'a'}

In [68]:
my_set.add('c')
print(my_set)


{1, 'b', 'a', 'c'}

Note above that the order of items in a set doesn't have the same meaning as in lists and tuples.


In [69]:
my_set[0]


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-c84e5ee70af8> in <module>()
----> 1 my_set[0]

TypeError: 'set' object does not support indexing

Sets are used for a couple reasons. Sometimes, finding the number of unique items in a list or tuple is important. In this case, we can convert the list/tuple to a set, then call len on the new set. For example,


In [70]:
my_list = ['a', 'a', 'a', 'a', 'b', 'b', 'b']
my_list


Out[70]:
['a', 'a', 'a', 'a', 'b', 'b', 'b']

In [71]:
my_set = set(my_list)
len(my_set)


Out[71]:
2

The other reason is that the in keyword for testing a collection for membership of an object is much faster for a list than a set.


In [72]:
my_list = list(range(1000000))    # list of numbers 0 - 999,999
my_set = set(my_list)

In [73]:
%%timeit
999999 in my_list


100 loops, best of 3: 12.5 ms per loop

In [74]:
%%timeit
999999 in my_set


10000000 loops, best of 3: 59.9 ns per loop

Any idea why there's such a discrepancy?

Dictionaries

The final fundamental data structure we'll cover is the Python dictionary (aka "hash" in some other languages). A dictionary is a map of keys to values.


In [75]:
my_dict = {'name': 'Nick',
           'birthday': 'July 13',
           'years_in_durham': 4}

In [76]:
my_dict['name']


Out[76]:
'Nick'

In [77]:
my_dict['years_in_durham']


Out[77]:
4

In [78]:
my_dict['favorite_restaurant'] = 'Mateo'
my_dict['favorite_restaurant']


Out[78]:
'Mateo'

In [79]:
my_dict['age']    # hey, that's personal. Also, it's not a key in the dictionary.


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-79-5491ea872585> in <module>()
----> 1 my_dict['age']    # hey, that's personal. Also, it's not a key in the dictionary.

KeyError: 'age'

In addition to accessing values by keys, you can retrieve the keys and values by themselves as lists:


In [80]:
my_dict.keys()


Out[80]:
dict_keys(['birthday', 'years_in_durham', 'name', 'favorite_restaurant'])

In [81]:
my_dict.values()


Out[81]:
dict_values(['July 13', 4, 'Nick', 'Mateo'])

Note that if you're using Python 3.5 or earlier, the order that you insert key/value pairs into the dictionary doesn't correspond to the order they're stored in by default (we inserted favorite_restaurant after years_in_durham!). This default behavior was just recently changed in Python 3.6 (released in December 2016).

Control structures

As data scientists, we're data-driven people, and we want our code to be data-driven, too. Control structures are a way of adding a logical flow to your programs, making them reactive to different conditions. These concepts are largely the same as in other programming languages, so I'll quickly introduce the syntax here for reference without much comment.

if-elif-else

Like most programming languages, Python provides a way of conditionally evaluating lines of code.


In [82]:
x = 3

if x < 2:
    print('x less than 2')
elif x < 4:
    print('x less than 4, greater than or equal to 2')
else:
    print('x greater than or equal to 4')


x less than 4, greater than or equal to 2

For loops

In Python, a for loop iterates over the contents of a container like a list. For example:


In [83]:
my_list = ['a', 'b', 'c']
for element in my_list:
    print(element)


a
b
c

To iterate for a specific number of times, you can create an iterator object with the range function:


In [84]:
for i in range(5):   # iterate over all integers (starting at 0) less than 5
    print(i)


0
1
2
3
4

In [85]:
for i in range(2, 6, 3):    # iterate over integers (starting at 2) less than 6, increasing by 3
    print(i)


2
5

While loops

Python also has the concept of while loops. From a stylistic reasons, while loops are used somewhat less often than for loops. For example, compare the two following blocks of code:


In [86]:
my_list = ['a', 'b', 'c']
idx = 0
while idx < len(my_list):
    print(my_list[idx])
    idx += 1


a
b
c

In [87]:
my_list = ['a', 'b', 'c']
for element in my_list:
    print(element)


a
b
c

There are occasionally other reasons for using while loops (waiting for an external input, for example), but we won't make extensive use of them in this course.

Functions

Of course, as data scientists, one of our most important jobs is to manipulate data in a way that provides insight. In other words, we need ways of taking raw data, doing some things to it, and returning nice, clean, processed data back. This is the job of functions!

Built-in Python functions

It turns out that Python has a ton of functions built in already. When we have a task that can be accomplished by a built-in function, it's almost always a good idea to use them. This is because many of the Python built-in functions are actually written in C, not Python, and C tends to be much faster for certain tasks.

https://docs.python.org/3.5/library/functions.html


In [88]:
my_list = range(1000000)

In [89]:
%%timeit
sum(my_list)


10 loops, best of 3: 20.8 ms per loop

In [90]:
%%timeit
my_sum = 0
for element in my_list:
    my_sum += element
my_sum


10 loops, best of 3: 65.2 ms per loop

Some common mathematical functions that are built into Python:

  • sum
  • divmod
  • round
  • abs
  • max
  • min

And some other convenience functions, some of which we've already seen:

  • int, float, str, set, list, dict: for converting between data structures
  • len: for finding the number of elements in a data structure
  • type: for finding the type that an object belongs to

Custom functions

Of course, there are plenty of times we want to do something that isn't provided by a built-in. In that case, we can define our own functions. The syntax is quite simple:


In [91]:
def double_it(x):
    return x * 2

In [92]:
double_it(5)


Out[92]:
10

Python has dynamic typing, which (in part) means that the arguments to functions aren't assigned a specific type:


In [93]:
double_it('hello')   # remember 'hello' * 2 from before?


Out[93]:
'hellohello'

In [94]:
double_it({'a', 'b'})    # but there's no notion of multiplication for sets


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-94-4d7ca36e3636> in <module>()
----> 1 double_it({'a', 'b'})    # but there's no notion of multiplication for sets

<ipython-input-91-fb8ccbb7bf08> in double_it(x)
      1 def double_it(x):
----> 2     return x * 2

TypeError: unsupported operand type(s) for *: 'set' and 'int'

Required arguments vs optional arguments

When defining a function, you can add defaults to arguments that you want to be optional. When defining and providing arguments, required arguments always go first, and the order they're provided in matters. Optional arguments follow, and can be passed by their keyword in any order.


In [95]:
def multiply_them(x, y, extra_arg1=None, extra_arg2=None):
    if extra_arg1 is not None:
        print(extra_arg1)
    if extra_arg2 is not None:
        print(extra_arg2)
    
    print('multiplying {} and {}...'.format(x, y))
    return x * y

In [96]:
multiply_them(3, 5)


multiplying 3 and 5...
Out[96]:
15

In [97]:
multiply_them(3, 5, extra_arg1='hello')


hello
multiplying 3 and 5...
Out[97]:
15

In [98]:
multiply_them(3, 5, extra_arg2='world', extra_arg1='hello')


hello
world
multiplying 3 and 5...
Out[98]:
15

In [99]:
multiply_them(extra_arg2='world', extra_arg1='hello', 3, 5)


  File "<ipython-input-99-89490f4161a8>", line 1
    multiply_them(extra_arg2='world', extra_arg1='hello', 3, 5)
                                                         ^
SyntaxError: positional argument follows keyword argument

Modules

Knowing how to create your own functions can be a rabbit hole - once you know that you can make Python do whatever you want it to do, it can be easy to go overboard. Good data scientists are efficient data scientists - you shouldn't reinvent the wheel by reimplementing a bunch of functionality that someone else worked hard on. Doing anything nontrivial can take a ton of time, and without spending even more time to write tests, squash bugs, and address corner cases, your code can easily end up being much less reliable than code that someone else has spent time perfecting.

Python has a very robust standard library of external modules that come with every Python installation. For even more specialized work, the Python community has also open-sourced tens of thousands of packages, any of which is a simple pip install away.

The standard library

The Python standard library is a collection of packages that ships with Python itself. In other words, it contains a bunch of code that you can import into code you're writing, but that you don't have to download separately after downloading Python.

Here are a few examples -


In [100]:
import random    # create (pseudo-) random numbers
random.random()


Out[100]:
0.9662408468878749

In [101]:
import math    # common mathematical functions that aren't built into base Python
print(math.factorial(5))


120

In [102]:
math.log10(100)


Out[102]:
2.0

In [103]:
import statistics    # some basic summary statistics
my_list = [1, 2, 3, 4, 5]
statistics.mean(my_list)


Out[103]:
3

In [104]:
statistics.median(my_list)


Out[104]:
3

In [105]:
statistics.stdev(my_list)


Out[105]:
1.5811388300841898

There are dozens of packages in the standard library, so if you find yourself writing a function for something lots of other people might want to do, it's definitely worth checking whether that function is already implemented in the Python standard library.

We'll use a handful of packages from the standard library in this course, which I'll introduce as they appear.

Third party libraries and the Python Package Index

Nonetheless, the standard library can't contain functionality that covers everything people use Python for. For more specialized packages, the Python Software Foundation runs the Python Package Index (PyPI, pronounced pie-pee-eye). PyPI is a package server that is free to upload and download from - anyone can create a package and upload it to PyPI, and anyone can download any package from PyPI at any time.

To download and install a package from PyPI, you typically use a program called pip (pip installs packages) by running the command pip install <package name> from the command line.

Wrapping up

This notebook is fairly information-dense, especially if you haven't used Python before. Keep it close by for reference as the course goes along! Thankfully, Python syntax is fairly friendly toward beginners, so picking up the basics usually doesn't take too long. I hope you'll find as the course goes along that the Python syntax starts to feel more natural. Don't get discouraged; know when to ask for help, and look online for resources. And remember - the Python ecosystem is deep, and it can take years to master!

Other resources


In [ ]: