Welcome!

This is a 4 week (8 hour) course that will introduce you to the basics of handling, manipulating, exploring, and modeling data with Python.

About me

I'm a data scientist at Automated Insights. Previously, I was PhD student in Physics at Duke, doing research in machine learning and complex systems. I like running, cooking, and curling.

About you

  • Name
  • Current job/school
  • Favorite non-technical hobby/activity

About this class

The goal of this class is to introduce you to some concepts that form the foundations of modern data science, and to put those concepts to use using the Python data science ecosystem. I'm expecting that you know the basics of programming, but not necessarily that you've programmed in Python before. In other words, I'm going to introduce how to write a for loop in Python, but I won't explain what a for loop is.

This class is going to:

  • focus on teaching important high-level concepts.
  • give an introduction to the common tools for doing data science in Python.
  • be a bit of a whirlwind.

This class is not going to:

  • be comprehensive introduction to data science
  • make you an expert in data science.
  • teach you much about machine learning.

This class is meant to be interactive. Instead of me lecturing for the full 8 hours, we'll alternate between walking through materials together and working in small groups on prompts that will solidify concepts. At the end of each week, there will be a few take-home prompts for you to work on before the next class.

The syllabus

  • Week 1: Intro to Python and Jupyter
  • Week 2: What is data? + doing fast operations with NumPy
  • Week 3: Exploratory data analysis
  • Week 4: Building models (maybe)

The Jupyter notebook

The environment you're in right now is called a Jupyter notebook. Project Jupyter is an interactive environment that data scientists use for collaboration and communication. Each cell in the notebook can either contain text or code (often Python, but R, Julia, and lots of other languages are supported). This allows you to seamlessly weave explanations and plots into your code.

The Jupyter front-end is called the notebook or dashboard. This is the part that you interact with directly. The back-end, where your code is actually run, is called the kernel. In particular, this notebook uses a kernel for executing Python code, but kernels for many other languages also exist. Since Jupyter is an open-source project, anyone with the time and dedication can make a kernel for executing code in their favorite language.

Each cell in a notebook can be executed independently, but declarations persist across cells. For example, I can define a variable in one cell...


In [ ]:
my_variable = 10

... and then access that variable in a later cell:


In [ ]:
print(my_variable)

Jupyter has two fundamental modes: command mode and edit mode. In edit mode, we can make changes to the content of specific cells. When you're in edit mode, the cell you're currently working in will be surrounded by a green box. Press Enter or double click on a cell to enter edit mode.

Command mode is used to switch between cells, or to make changes to the notebook structure. For example, if you want to add a new cell to your notebook, you do this from command mode. Press Esc to leave edit mode and enter command mode.

As I mentioned above, there are two fundamental types of cells in a notebook - text (i.e. markdown) and code. When you click on a code cell, you should see a cursor appear in the cell that allows you to edit the code in that cell. A cell can have multiple lines - to begin a new line, press Enter. When you want to run the cell's code, press Shift+Enter.

Try changing the values of the numbers that are added together in the cell below, and observe how the output changes:


In [ ]:
a = 11
b = 19
print(a + b)

You can also edit the text in markdown cells. To display the editable, raw markdown in a text cell, double click on the cell. You can now put your cursor in the cell and edit it directly. When you're done editing, press Shift+Enter to render the cell into a more readable format.

Try editing text cell below with your name:

Make some edits here -> Hello, my name is Nick Haynes!

To change whether a cell contains text or code, use the drop-down in the toolbar. When you're in a code cell, it will look like this:

and when you're in a text cell, it will look like this:

Keyboard shortcuts

Good programmers are efficient programmers! Jupyter has a large number of idiomatic keyboard shortcuts that are helpful to know. A few of my favorite are:

Command mode

  • a, b: insert a cell above or below the current one, respectively.
  • Esc: exit cell editor mode
  • dd: delete the current cell
  • m: change cell type to markdown
  • y: change cell type to code

Edit mode

  • Tab: code completion
  • Shift+Tab: documentation tool tip

There's a full list of Jupyter's keyboard shortcuts here.


In [ ]:

Your turn

Without using your mouse:

  • Enter command mode in this cell and enter your name here:
  • Insert a cell below this one and print a string using the Python interpreter
  • Change that cell to markdown and render it. What changes?
  • Delete that cell.

In [ ]:
%lsmagic

Magic commands

Jupyter gives you access to so-called "magic" commands that aren't part of official Python syntax, but can make your life a lot easier. All magic commands are preceded with a % (a single % for single-line expressions, double %% for multi-line expressions). For example, many of the common bash commands are built in:


In [ ]:
%ls    # list the files and folders in the current directory

In [ ]:
%cd images

In [ ]:
%ls

Another very helpful magic command we'll use quite a bit is the %timeit command:


In [ ]:
%cd ..

In [ ]:
%%timeit
my_sum = 0
for i in range(100000):
    my_sum += i

What is Python?

This is actually a surprisingly tricky question! There are (at least) two answers:

  • A language specification
  • A program on your computer that interprets and executes code written to that language specification

Python (the language)

Python is an open source programming language that is extremely popular in the data science and web development communities. The roots of its current popularity in data science and scientific computing have an interesting history, but suffice to say that it's darn near impossible to be a practicing data scientist these days without at least being familiar with Python.

The guiding principles behind the design of the Python language specification are described in "The Zen of Python", which you can find here or by executing:


In [ ]:
import this

Python syntax should be easy to write, but most importantly, well-written Python code should be easy to read. Code that follows these norms is called Pythonic. We'll touch a bit more on what it means to write Pythonic code in class.

What's the deal with whitespace?

A unique feature of Python is that whitespace matters, because it defines scope. Many other programming languages use braces or begin/end keywords to define scope. For example, in Javascript, you write a for loop like this:

var count;
for(count = 0; count < 10; count++){
               console.log(count);
               console.log("<br />");
            }

The curly braces here define the code executed in each iteration of the for loop. Similarly, in Ruby you write a for loop like this:

for count in 0..9
   puts "#{count}"
end

In this snippet, the code executed in each iteration of the for loop is whatever comes between the first line and the end keyword.

In Python, for loops look a bit different:


In [ ]:
print('Entering the for loop:\n')
a = 0
for count in range(10):
    print(count)
    a += count
    print('Still in the for loop.')

print("\nNow I'm done with the for loop.")
print(a)

Note that there is no explicit symbol or keyword that defines the scope of code executed during each iteration - it's the indentation that defines the scope of the loop. When you define a function or class, or write a control structure like a for look or if statement, you should indent the next line (4 spaces is customary). Each subsequent line at that same level of indentation is considered part of the scope. You only escape the scope when you return to the previous level of indentation.

Python (the interpreter)

If you open up the terminal on your computer and type python, it runs a program that looks something like this:

This is a program called CPython (written in C, hence the name) that parses, interprets, and executes code written to the Python language standard. CPython is known as the "reference implementation" of Python - it is an open source project (you can download and build the source code yourself if you're feeling adventurous) run by the Python Software Foundation and led by Guido van Rossum, the original creator and "Benevolent Dictator for Life" of Python.

When you type simply python into the command line, CPython brings up a REPL (Read Execute Print Loop, pronounced "repple"), which is essentially an infinite loop that takes lines as you write them, interprets and executes the code, and prints the result.

For example, try typing

>>> x = 'Hello world"
>>> print(x)

in the REPL. After you hit Enter on the first line, the interpreter assigns the value "Hello world" to a string variable x. After you hit Enter on the second line, it prints the value of x.

We can accomplish the same result by typing the same code

x = "Hello world"
print(x)

into a file called test.py and running python test.py from the command line. The only difference is that when you provide the argument test.py to the python command, the REPL doesn't appear. Instead, the CPython interpreter interprets the contents of test.py line-by-line until it reaches the end of the file, then exits. We won't use the REPL much in this course, but it's good to be aware that it exists. In fact, behind the pretty front end, this Jupyter notebook is essentially just wrapping the CPython interpreter, executing commands line by line as we enter them.

So to review, "Python" sometimes refers to a language specification and sometimes refers to an interpreter that's installed on your computer. We will use the two definitions interchangeably in this course; hopefully, it should be obvious from context which definition we're referring to.

Idiomatic Python

Above, we talked about the concept of Pythonic code, which emphasizes an explicit, readable coding style. In practice, there are also a number of conventions codified in a document called PEP 8 (PEP = Python Enhancement Proposal, a community suggestion for possible additions to the Python language). These conventions make Python code written by millions of developers easier to read and comprehend, so sticking to them as closely as is practical is a very good idea.

A few useful conventions that we'll see in this class are:

  • Indentation is 4 spaces. Python 3 does not allow mixing tabs and spaces for indentation.
  • Single and double quotes (' and ") are interchangeable, but neither is preferred. Instead, pick a single style and stick to it.
  • Functions and variables are named with snake_case (all lowercase letters, words separated by underscores).
  • Python doesn't have a sense of strict constants, but variables intended to be uses as constants should be named like UPPERCASE_VARIABLE (all uppercase letters, words separated by underscores.

I'll introduce other conventions as they arise.

Python 2 vs Python 3

As you may have heard, there's a bit of a rift in the Python community between Python 2 and Python 3.

Python 3.0 was released in 2008, introducing a few new features that were not backwards compatible with Python 2.X. Since then, the core Python developers have released several new versions of 3.X (3.6 is the most recent, released in December 2016), and they have announced that Python 2.X will no longer be officially supported after 2020. We'll be using Python 3.5 for this class:

  • The differences between Python 2 and Python 3 are relatively small. See the official FAQ here, another good explainer here.
  • Python 3.X is under active development, Python 2.X has been deprecated and will not be supported after 2020.
  • The way that Python 3 handles Unicode strings (which we'll talk about next week) is much easier to use than in Python 2.
  • As of 2017, the vast majority of major libraries support both 2 and 3, and a number of major Python projects have pledged to drop support for 2.X by 2020..

Long story short - I firmly believe that 3.X is the clear choice for anyone who isn't supporting a legacy project.

Variables, Objects, Operators, and Naming

One fundamental idea in Python is that everything is an object. This is different than some other languages like C and Java, which have fundamental, primitive data types like int and char. This means that things like integers and strings have attributes and methods that you can access. For example, if you want to read some documentation about an object my_thing, you can access its __doc__ attribute like this:


In [ ]:
thing_1 = 47    # define an int object
print(thing_1.__doc__)

In [ ]:
thing_1 = 'blah'    # reassign thing_1 to an string object
print(thing_1.__doc__)

In [ ]:
print(thing_1)

To learn more about what attributes and methods a given object has, you can call dir(my_object):


In [ ]:
dir(thing_1)

That's interesting - it looks like the string object has a method called __add__. Let's see what it does -


In [ ]:
thing_2 = 'abcd'
thing_3 = thing_1.__add__(thing_2)
print(thing_3)

So calling __add__ with two strings creates a new string that is the concatenation of the two originals. As an aside, there are a lot more methods we can call on strings - split, upper, find, etc. We'll come back to this.

The + operator in Python is just syntactic sugar for the __add__ method:


In [ ]:
thing_4 = thing_1 + thing_2
print(thing_4)
print(thing_3 == thing_4)

Any object you can add to another object in Python has an __add__ method. With integer addition, this works exactly as we would expect:


In [ ]:
thing_1 = '1'
thing_2 = '2'
int(thing_1) + int(thing_2)

In [ ]:
int_1 = 11
int_2 = 22
sum_1 = int_1.__add__(int_2)
sum_2 = int_1 + int_2
print(sum_1)
print(sum_2)
print(sum_1 == sum_2)

But it's unclear what to do when someone tries to add an int to a str:


In [ ]:
thing_1 + int_1

Data types

There are a few native Python data types, each of which we'll use quite a bit. The properties of these types work largely the same way as they do in other languages. If you're ever confused about what type a variable my_var is, you can always call type(my_var).

Booleans

Just like in other languages, bools take values of either True or False. All of the traditional Boolean operations are present:


In [ ]:
bool_1 = True
type(bool_1)

In [ ]:
dir(bool_1)

In [ ]:
bool_2 = False

In [ ]:
bool_1 == bool_2

In [ ]:
type(bool_1 + bool_2)

In [ ]:
type(bool_1 and bool_2)

In [ ]:
bool_1 * bool_2

Integers

Python ints are whole (positive, negative, or 0) numbers implemented as long objects of arbitrary size. Again, all of the standard operations are present:


In [ ]:
int_1 = 2
type(int_1)

In [ ]:
dir(int_1)

In [ ]:
int_2 = 3
print(int_1 - int_2)

In [ ]:
int_1.__pow__(int_2)

In [ ]:
int_1 ** int_2

One change from Python 2 to Python 3 is the default way that integers are divided. In Python 2, the result of 2/3 is 0, the result of 4/3 is 1, etc. In other words, dividing integers in Python 2 always returned an integer with any remainder truncated. In Python 3, the result of the division of integers is always a float, with a decimal approximation of the remainder included. For example:


In [ ]:
int_1 / int_2

In [ ]:
type(int_1 / int_2)

In [ ]:
int_1.__truediv__(int_2)

In [ ]:
int_1.__divmod__(int_2)

In [ ]:
int_1 % int_2

Floats

Python floats are also consistent with other languages:


In [ ]:
float_1 = 23.46
type(float_1)

In [ ]:
dir(float_1)

In [ ]:
float_2 = 3.0
type(float_2)

In [ ]:
float_1 / float_2

With ints and floats, we can also do comparison operators like in other languages:


In [ ]:
int_1 < int_2

In [ ]:
float_1 >= int_2

In [ ]:
float_1 == float_2

In [ ]:
int_1 = 1
float_1 = 1.0

Strings


In [ ]:
str_1 = 'hello'
type(str_1)

In [ ]:
dir(str_1)

We already saw that the + operator concatenates two strings. Generalizing from this, what do you expect the * operator to do?


In [ ]:
a = 'Hi'
print(a*5)

There are a number of very useful methods built into Python str objects. A few that you might find yourself needing to use when dealing with text data include:


In [ ]:
# count the number of occurances of a sub-string
"Hi there I'm Nick".count('i')

In [ ]:
# Find the next index of a substring
"Hi there I'm Nick".find('i')

In [ ]:
"Hi there I'm Nick".find('i', 2)

In [ ]:
# Insert variables into a string
digit = 7
'The digit "7" should appear at the end of this sentence: {}.'.format(digit)

In [ ]:
another_digit = 15
'This sentence will have two digits at the end: {} and {}.'.format(digit, another_digit)

In [ ]:
# Replace a sub-string with another sub-string
my_sentence = "Hi there I'm Nick"
my_sentence.replace('e', 'E')

In [ ]:
my_sentence.replace('N', '')

There are plenty more useful string functions - use either the dir() function or Google to learn more about what's available.

None

Python also has a special way of representing missing values, called None. This value behaves similarly to NULL or nil in other languages like SQL, Javascript, and Ruby.


In [ ]:
missing_val = None
type(missing_val)

In [ ]:
missing_val is None

In [ ]:
print(missing_val and True)

In [ ]:
missing_val + 1

None is helpful for passing optional values in function arguments, or to make it explicitly clear that you're not passing data that has any value. None is a different concept than NaN, which we'll see next week.

So, to sum it up - basic data types like bool, int, float, and str are all objects in Python. The methods in each of these object classes define what operations can be done on them and how those operations are performed. For the sake of readability, however, many of the common operations like + and < are provided as syntactic sugar.

An aside: What's the deal with those underscores???

When we were looking at the methods in the various data type classes above, we saw a bunch of methods like __add__ and __pow__ with double leading underscores and double trailing underscores (sometimes shorted to "dunders"). As it turns out, underscores are a bit of a thing in Python. Idiomatic use dictates a few important uses of underscores in variable and function names:

  • Underscores are used to separate words in names. That is, idiomatic Python uses snake_case (my_variable) rather than camelCase (myVariable).
  • A single leading underscore (_my_function or _my_variable) denotes a function or variable that is not meant for end users to access directly. Python doesn't have a sense of strong encapsulation, i.e. there are no strictly "private" methods or variables like in Java, but a leading underscore is a way of "weakly" signaling that the entity is for private use only.
  • A single trailing underscore (type_) is used to avoid conflict with Python built-in functions or keywords. In my opinion, this is often poor style. Try to come up with a more descriptive name instead.
  • Double leading underscore and double trailing underscore (__init__, __add__) correspond to special variables or methods that correspond to some sort of "magic" syntax. As we saw above, the __add__ method of an object describes what the result of some_object + another_object is.

For lots more detail on the use of underscores in Python, check out this post.

Collections of objects

Single variables can only take us so far. Eventually, we're going to way to have ways of storing many individual variables in a single, structured format.

Lists

The list is one of the most commonly used Python data structures. A list is an ordered collection of (potentially heterogeneous) objects. Similar structures that exist in other languages are often called arrays.


In [ ]:
my_list = ['a', 'b', 'c', 'a']

In [ ]:
len(my_list)

In [ ]:
my_list.append(1)
print(my_list)

To access individual list elements by their position, use square brackets:


In [ ]:
my_list[0]    # indexing in Python starts at 0!

In [ ]:
my_list[4]

In [ ]:
my_list[-1]    # negative indexes count backward from the end of the list

Lists can hold arbitrary objects!


In [ ]:
type(my_list[0])

In [ ]:
type(my_list[-1])

In [ ]:
# let's do something crazy
my_list.append(my_list)
type(my_list[-1])

In [ ]:
my_list

In [ ]:
my_list[-1]

In [ ]:
my_list[-1][-1]

In [ ]:
my_list[-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1]

Lists are also mutable objects, meaning that any part of them can be changed at any time. This makes them very flexible objects for storing data in a program.


In [ ]:
my_list = ['a', 'b', 1]

In [ ]:
my_list[0] = 'c'
my_list

In [ ]:
my_list.remove(1)
my_list

Tuples

A tuple in Python is very similar to a list, except that tuples are immutable. This means that once they're defined, they can't be changed. Otherwise, they act very much like lists.


In [ ]:
my_tuple = ('a', 'b', 1, 'a')
print(my_tuple)

In [ ]:
my_tuple

In [ ]:
my_tuple[2]

In [ ]:
my_tuple[0] = 'c'

In [ ]:
my_tuple.append('c')

In [ ]:
my_tuple.remove(1)

In [ ]:
list_1 = [1, 2, 3]
list_2 = [4, 5, 6]
list_1 + list_2

Sets

A set in Python acts somewhat like a list that contains only unique objects.


In [ ]:
my_set = {'a', 'b', 1, 'a'}
print(my_set)    # note that order

In [ ]:
my_set.add('c')
print(my_set)

Note above that the order of items in a set doesn't have the same meaning as in lists and tuples.


In [ ]:
my_set[0]

Sets are used for a couple reasons. Sometimes, finding the number of unique items in a list or tuple is important. In this case, we can convert the list/tuple to a set, then call len on the new set. For example,


In [ ]:
my_list = ['a', 'a', 'a', 'a', 'b', 'b', 'b']
my_list

In [ ]:
my_set = set(my_list)
len(my_set)

The other reason is that the in keyword for testing a collection for membership of an object is much faster for a list than a set.


In [ ]:
my_list = list(range(1000000))    # list of numbers 0 - 999,999
my_set = set(my_list)

In [ ]:
%%timeit
999999 in my_list

In [ ]:
%%timeit
999999 in my_set

Any idea why there's such a discrepancy?

Dictionaries

The final fundamental data structure we'll cover is the Python dictionary (aka "hash" in some other languages). A dictionary is a map of keys to values.


In [ ]:
my_dict = {'name': 'Nick',
           'birthday': 'July 13',
           'years_in_durham': 4}
my_dict

In [ ]:
my_dict['name']

In [ ]:
my_dict['years_in_durham']

In [ ]:
my_dict['favorite_restaurants'] = ['Mateo', 'Piedmont']
my_dict['favorite_restaurants']

In [ ]:
my_dict['age']    # hey, that's personal. Also, it's not a key in the dictionary.

In addition to accessing values by keys, you can retrieve the keys and values by themselves as lists:


In [ ]:
my_dict.keys()

In [ ]:
my_dict.values()

Note that if you're using Python 3.5 or earlier, the order that you insert key/value pairs into the dictionary doesn't correspond to the order they're stored in by default (we inserted favorite_restaurant after years_in_durham!). This default behavior was just recently changed in Python 3.6 (released in December 2016).

Control structures

As data scientists, we're data-driven people, and we want our code to be data-driven, too. Control structures are a way of adding a logical flow to your programs, making them reactive to different conditions. These concepts are largely the same as in other programming languages, so I'll quickly introduce the syntax here for reference without much comment.

if-elif-else

Like most programming languages, Python provides a way of conditionally evaluating lines of code.


In [ ]:
x = 3

if x < 2:
    print('x less than 2')
elif x < 4:
    print('x less than 4, greater than or equal to 2')
else:
    print('x greater than or equal to 4')

For loops

In Python, a for loop iterates over the contents of a container like a list. For example:


In [ ]:
my_list = ['a', 'b', 'c']
for element in my_list:
    print(element)

To iterate for a specific number of times, you can create an iterator object with the range function:


In [ ]:
for i in range(5):   # iterate over all integers (starting at 0) less than 5
    print(i)

In [ ]:
for i in range(2, 6, 3):    # iterate over integers (starting at 2) less than 6, increasing by 3
    print(i)

While loops

Python also has the concept of while loops. From a stylistic reasons, while loops are used somewhat less often than for loops. For example, compare the two following blocks of code:


In [ ]:
my_list = ['a', 'b', 'c']
idx = 0
while idx < len(my_list):
    print(my_list[idx])
    idx += 1

In [ ]:
my_list = ['a', 'b', 'c']
for element in my_list:
    print(element)

There are occasionally other reasons for using while loops (waiting for an external input, for example), but we won't make extensive use of them in this course.

Your turn

  • Using the following dictionary:
    my_dict = {
      'a': 3, 
      'b': 2, 
      'c': 10, 
      'd': 7, 
      'e': 9, 
      'f' : 12, 
      'g' : 13
      }
    Print out:
    • the keys of all values that are even.
    • the key with the maximum value.
    • the sum of all the values.

In [ ]:
my_dict = {
  'a': 3, 
  'b': 2, 
  'c': 10, 
  'd': 7, 
  'e': 9, 
  'f' : 12, 
  'g' : 13
  }

for key, val in my_dict.items():
    print(key, val)
  • If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 or 5 below 1000.

In [ ]:

Functions

Of course, as data scientists, one of our most important jobs is to manipulate data in a way that provides insight. In other words, we need ways of taking raw data, doing some things to it, and returning nice, clean, processed data back. This is the job of functions!

Built-in Python functions

It turns out that Python has a ton of functions built in already. When we have a task that can be accomplished by a built-in function, it's almost always a good idea to use them. This is because many of the Python built-in functions are actually written in C, not Python, and C tends to be much faster for certain tasks.

https://docs.python.org/3.5/library/functions.html


In [1]:
my_list = list(range(1000000))

In [2]:
%%timeit
sum(my_list)


100 loops, best of 3: 8.71 ms per loop

In [3]:
%%timeit
my_sum = 0
for element in my_list:
    my_sum += element
my_sum


10 loops, best of 3: 58 ms per loop

Some common mathematical functions that are built into Python:

  • sum
  • divmod
  • round
  • abs
  • max
  • min

And some other convenience functions, some of which we've already seen:

  • int, float, str, set, list, dict: for converting between data structures
  • len: for finding the number of elements in a data structure
  • type: for finding the type that an object belongs to

Custom functions

Of course, there are plenty of times we want to do something that isn't provided by a built-in. In that case, we can define our own functions. The syntax is quite simple:


In [4]:
def double_it(x):
    return x * 2

In [5]:
double_it(5)


Out[5]:
10

Python has dynamic typing, which (in part) means that the arguments to functions aren't assigned a specific type:


In [6]:
double_it('hello')   # remember 'hello' * 2 from before?


Out[6]:
'hellohello'

In [7]:
double_it({'a', 'b'})    # but there's no notion of multiplication for sets


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-4d7ca36e3636> in <module>()
----> 1 double_it({'a', 'b'})    # but there's no notion of multiplication for sets

<ipython-input-4-fb8ccbb7bf08> in double_it(x)
      1 def double_it(x):
----> 2     return x * 2

TypeError: unsupported operand type(s) for *: 'set' and 'int'

Required arguments vs optional arguments

When defining a function, you can add defaults to arguments that you want to be optional. When defining and providing arguments, required arguments always go first, and the order they're provided in matters. Optional arguments follow, and can be passed by their keyword in any order.


In [8]:
def multiply_them(x, y, extra_arg1=None, extra_arg2=None):
    if extra_arg1 is not None:
        print(extra_arg1)
    if extra_arg2 is not None:
        print(extra_arg2)
    
    print('multiplying {} and {}...'.format(x, y))
    return x * y

In [9]:
multiply_them(3, 5)


multiplying 3 and 5...
Out[9]:
15

In [10]:
multiply_them(3, 5, extra_arg1='hello')


hello
multiplying 3 and 5...
Out[10]:
15

In [11]:
multiply_them(3, 5, extra_arg2='world', extra_arg1='hello')


hello
world
multiplying 3 and 5...
Out[11]:
15

In [12]:
multiply_them(extra_arg2='world', extra_arg1='hello', 3, 5)


  File "<ipython-input-12-89490f4161a8>", line 1
    multiply_them(extra_arg2='world', extra_arg1='hello', 3, 5)
                                                         ^
SyntaxError: positional argument follows keyword argument

Your turn

  • Write a function that finds the number of elements in a list (without using the built-in len function). Now, use %%timeit to compare the speed to len for a list of 100,000 elements.

In [16]:
my_list = [1, 2, 3]
for el in my_list:
    print(el)


1
2
3

In [15]:
def count_elements(my_list):
    counter = 0
    for el in my_list:
        counter += 1
    
    return counter

In [17]:
count_elements(my_list)


Out[17]:
3

In [21]:
my_list = list(range(1000))

In [22]:
%%timeit
count_elements(my_list)


10000 loops, best of 3: 43.7 µs per loop

In [23]:
%%timeit
len(my_list)


10000000 loops, best of 3: 77.8 ns per loop
  • Write a function that finds the minimum value in a list of numbers (without using the built-in min function). Include an optional argument that specifies whether to take the absolute values of the number first, with a default value of False.

In [24]:
def get_min(my_list):
    potential_min = my_list[0]
    for el in my_list[1:]:
        if el < potential_min:
            potential_min = el
    
    return potential_min

In [25]:
my_list = [3, 2, 1]
print(get_min(my_list))


1

Modules

Knowing how to create your own functions can be a rabbit hole - once you know that you can make Python do whatever you want it to do, it can be easy to go overboard. Good data scientists are efficient data scientists - you shouldn't reinvent the wheel by reimplementing a bunch of functionality that someone else worked hard on. Doing anything nontrivial can take a ton of time, and without spending even more time to write tests, squash bugs, and address corner cases, your code can easily end up being much less reliable than code that someone else has spent time perfecting.

Python has a very robust standard library of external modules that come with every Python installation. For even more specialized work, the Python community has also open-sourced tens of thousands of packages, any of which is a simple pip install away.

The standard library

The Python standard library is a collection of packages that ships with Python itself. In other words, it contains a bunch of code that you can import into code you're writing, but that you don't have to download separately after downloading Python.

Here are a few examples -


In [30]:
import random    # create (pseudo-) random numbers|
random.random()  # choose a float between 0 and 1 (uniformly)


Out[30]:
0.8578264351462442

In [31]:
import math    # common mathematical functions that aren't built into base Python
print(math.factorial(5))


120

In [32]:
math.log10(100)


Out[32]:
2.0

In [33]:
import statistics    # some basic summary statistics
my_list = [1, 2, 3, 4, 5]
statistics.mean(my_list)


Out[33]:
3

In [34]:
statistics.median(my_list)


Out[34]:
3

In [35]:
statistics.stdev(my_list)


Out[35]:
1.5811388300841898

In [36]:
dir(statistics)


Out[36]:
['Decimal',
 'Fraction',
 'StatisticsError',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_coerce',
 '_convert',
 '_counts',
 '_decimal_to_ratio',
 '_exact_ratio',
 '_isfinite',
 '_ss',
 '_sum',
 'collections',
 'groupby',
 'math',
 'mean',
 'median',
 'median_grouped',
 'median_high',
 'median_low',
 'mode',
 'pstdev',
 'pvariance',
 'stdev',
 'variance']

There are dozens of packages in the standard library, so if you find yourself writing a function for something lots of other people might want to do, it's definitely worth checking whether that function is already implemented in the Python standard library.

We'll use a handful of packages from the standard library in this course, which I'll introduce as they appear.

Third party libraries and the Python Package Index

Nonetheless, the standard library can't contain functionality that covers everything people use Python for. For more specialized packages, the Python Software Foundation runs the Python Package Index (PyPI, pronounced pie-pee-eye). PyPI is a package server that is free to upload and download from - anyone can create a package and upload it to PyPI, and anyone can download any package from PyPI at any time.

To download and install a package from PyPI, you typically use a program called pip (pip installs packages) by running the command pip install <package name> from the command line.

How does import work?

Above, we saw some examples of importing external modules to use in our code. In general, a single Python file or a directory of files can be imported.

When Python sees the import my_module command, it first searches in the current working directory. If the working directory contains a Python script my_module.py or a directory of Python files my_module/, the functions and classes in those files are loaded into the current namespace, accessible under my_module. If nothing in the working directory is called my_module, Python checks the directory on your computer where external modules from PyPI are installed. If it doesn't find anything there, it returns ImportError.

There are several ways of arranging the namespace for imports:


In [37]:
import statistics
statistics.median(my_list)


Out[37]:
3

In [40]:
import statistics as nick
nick.median(my_list)


Out[40]:
3

In [41]:
from statistics import median
median(my_list)


Out[41]:
3

In [43]:
mean(my_list)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-43-69d9613d7984> in <module>()
----> 1 mean(my_list)

NameError: name 'mean' is not defined

In [48]:
from statistics import *
def median(x):
    return x
median(my_list)


Out[48]:
[1, 2, 3, 4, 5]

In [45]:
mean(my_list)


Out[45]:
3

from * imports are almost always a bad idea and should be avoided at all costs. Can you think of why that is?

Your turn

Write a function that calculates the median of a list of numbers (without using statistics). Use the randint function from the random module to create a list of integers to test your function.


In [ ]:

Wrapping up

This notebook is fairly information-dense, especially if you haven't used Python before. Keep it close by for reference as the course goes along! Thankfully, Python syntax is fairly friendly toward beginners, so picking up the basics usually doesn't take too long. I hope you'll find as the course goes along that the Python syntax starts to feel more natural. Don't get discouraged; know when to ask for help, and look online for resources. And remember - the Python ecosystem is deep, and it can take years to master!

Other resources

Take-home exercises

Write a function that takes as its arguments a string and a character, and for each occurrence of that character in the string, replaces the character with '#' and returns the new string. For example, replace_chars('this sentence starts with "t"', 't') should return #his sen#ence s#ar#s with "#"'. Try doing this by hand as well as using a built-in Python function.


In [ ]:

Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.


In [ ]:

Using Python's random module, write a program that rolls a die (i.e. generates a random integer between 1 and 6) 100,000 times. Write pure Python functions to calculate the mean and variance (look up the formulas if you can't remember them) of the rolls.


In [ ]: