Working with data 2017. Class 1

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

  1. About Python
  2. Data types, structures and code
  3. Read csv files to dataframes
  4. Basic operations with dataframes
  5. My first plots
  6. Debugging python
  7. Summary

In [13]:
pd.read_

In [ ]:
"../class2/"

In [ ]:
"data/Fatality.csv"

In [1]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))


1. About python

  • Python: Easy programming language, great for text analysis = English. (R = Dutch)
  • iPython notebook: where you write Python = Word
  • Python packages: extensions = Tabs in word

In [1]:
#Using the symbol "#" you write comments
#printing something to the screen is easy:
print("Hello World") 

#Now click on the play button in the toolbar above (or click Ctrl + Enter)


Hello World

1.1 Jupyter notebook

Let's check the different screens of the jupyter notebook

DASHBOARD


In [9]:
Image("./images/dashboard_files_tab.png",width=500)


Out[9]:

NEW NOTEBOOK


In [10]:
Image("./images/dashboard_files_tab_new.png",width=200)


Out[10]:

BUTTONS TO REMOVE AND RENAME


In [11]:
Image("./images/dashboard_files_tab_btns.png",width=400)


Out[11]:

CELLS IN JUPYTER NOTEBOOKS

Two types of cells: Markdown and Code cells

  • Markdown cells (like this one): For text
  • Code cells (like the previous one): For python

They have two modes:

  • Edit mode
  • Command mode: For shortcuts

EDIT MODE


In [12]:
Image("./images/edit_mode.png")


Out[12]:

COMMAND MODE


In [13]:
Image("./images/command_mode.png")


Out[13]:

RUN PYTHON Write some code in a code cell (the default one) and click the "play button" (shortcut Ctrl+Enter)


In [4]:
Image("./images/menubar_toolbar.png")


Out[4]:

WHY JUPYTER NOTEBOOKS

  • Allow for interactive use
  • Can combine text, code and plots easily
  • Jupyter notebooks allow you to do fancy things. For instance:
    • Autocomplete if you press shift
    • Help if you write "?" (example in next cell)
    • Magic cells like "%matplotlib inline", which makes the plot inside the notebooks

In [5]:
#Let's say that a = 5, and ask jupyter with help with a. We'll see more on this later.
#Select this cell and run it (Ctrl + Enter)
a = 5.3
a?

1.2 Python packages

  • Packages in python are extras, giving more functionalities.
  • They are the equivalent to the tabs in excel, you have one for plotting, one for sorting the data, etc.
  • Before using them you usually need to install them (but they are already installed in the server) and then import them to python

In [16]:
## HOW TO IMPORT PACKAGES AND READ A CSV (we'll learn this in one hour)
#Standard mode
import pandas
spreadsheet = pandas.read_csv("data/class1_test_csv.csv") 

#Standard mode with packages that have long names
import pandas as pd
spreadsheet = pd.read_csv("data/class1_test_csv.csv") 

#Standard mode when you only want to import one function
from pandas import read_csv
spreadsheet = read_csv("data/class1_test_csv.csv") 

#Import everything, DO NOT USE! It's against the Zen of Python (https://www.python.org/dev/peps/pep-0020/) 
from pandas import *
spreadsheet = read_csv("data/class1_test_csv.csv")

To install new packages you can use pip. For example run the code cell below


In [17]:
#Let's install the package pandas, which is used to plot
!pip install pandas


Requirement already satisfied: pandas in /opt/anaconda/anaconda3/lib/python3.5/site-packages
Requirement already satisfied: python-dateutil>=2 in /opt/anaconda/anaconda3/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: pytz>=2011k in /opt/anaconda/anaconda3/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: numpy>=1.7.0 in /opt/anaconda/anaconda3/lib/python3.5/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /opt/anaconda/anaconda3/lib/python3.5/site-packages (from python-dateutil>=2->pandas)

1.3 Python

  • Programming language created by Guido van Rossum (graduated from UvA in 1982, created Python in 1989)
  • Named after the Monthy Python's flying circus.
  • Emphasizes readability => It's easy
  • Code written in Python2 doesn't work in Python3 due to tiny differences.
  • Many people still use Python2, but Python3 is much better for data analysis and we will use it here.

The Zen of Python

  • Beautiful is better than ugly.
  • Simple is better than complex.
  • Readability counts.
  • If the implementation is hard to explain, it's a bad idea.
  • There should be one -- and preferably only one -- obvious way to do it.
  • Although that way may not be obvious at first unless you're Dutch.

2. PYTHON: Variables and code

Python uses variables and code.

2.1 Variables

Variables tell the computer to save something (a number, a string, a spreadsheet) with a name. For instance, if you write variable_name = 3, the computer knows that variable_name is 3.

  • Data types: Numbers, strings and others
  • Data structures:
    • Lists, tables...

2.2 Code

  • Instructions to modify variables
  • Can be organized in functions

2.1 Variables: Data Types and Structures

2.1.1 Data Types and operations

  • number
    • int: -2, 0, 1
    • float: 3.5, 4.23
  • string: "I'm a string" => Can use either single quotes or double quotes. Better to use double quotes.
  • boolean: False/True
  • None: None

You can do operations with them, such as multiplication for numbers, concatenation for strings, etc


In [18]:
print(type(3))
print(type(3.5))
print(type("I'm a string"))
print(type(False))
print(type(None))


<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>
<class 'NoneType'>

2.1.1.1 Numbers

Python3 converts ints into floats when you try to do something that requires decimals.


In [19]:
##Using python as a calculator
print(5+2) #5+2
print(5*2) #5x2
print(5/2) #5/2
print(5**2) #5^2, 3 to the power of two
print(5%2) #This is called modulo, and gives you the remainder when you divide 5 by 2 (5/2 = 5*2 + 1)


7
10
2.5
25
1

In [20]:
##We can also "assign" the number to a "variable". 
#The variable name can be whatever you want, but cannot start with a number and CANNOT spaces. 
#Please use variable names that describe what they represent
#"grades_hw_1" is much better than "a14"
var1 = 5
var2 = 2
print(var1+var2) #5+2
print(var1*var2) #5x2
print(var1/var2) #5/2
print(var1**var2) #5^2, 3 to the power of two
print(var1%var2) #This is called modulo, and gives you the remainder when you divide 5 by 2 (5/2 = 5*2 + 1)


7
10
2.5
25
1

2.1.1.2 Strings:

Strings are a series of characters, such as "eggs and bacon".

Beware of the encoding

  • The computer uses 0s and 1s to encode strings
  • We used to use ASCII encoding, that reads blocks of 7 binary numbers (0/1). This is enough to represent 128 characters (2^7). Enough for lower and upper case letters and some puntuation, but not for weird symbols (e.g. é,ó,í). It's the default of python2 (bad for text analysis).
  • Nowadays we use UTF-8 encoding, that can handle all symbols in any language. It's the default of python3.
  • But some programs use UTF-16, ASCII or ISO-8859-1, which can make your code break. If at some point you're reading a file and the content are weird symbols this is likely the problem. Look for an "encoding" option when reading the file.

In [21]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/c/c4/Utf8webgrowth.svg")


Out[21]:

In [5]:
v = 5.321233258340857891

print("The value was {}".format(v))


The value was 5.3212332583408575

In [3]:
print("eggs")
print("eggs" + "and" + "bacon") #concatenating strings
print("eggs" + " and " + "bacon") #concatenating strings with spaces
print("eggs and bacon".upper()) #upper case lower() for lower case

##String formatting. Each element inside format() is added in the place of each {}
print("{} {} and {} = diabetes".format(5,"sausages","bacon")) #used to format strings. 

##Checking if string is contained
print("bacon" in "eggs and bacon") #checks if the string "bacon" is part of the string "eggs and bacon"


eggs
eggsandbacon
eggs and bacon
EGGS AND BACON
5 sausages and bacon = diabetes
True

In [6]:
## We can also use variables
var1 = "eggs"
var2 = "bacon"
print(var1)
print(var1 + "and" + var2) #concatenating strings
print(var1 + " and " + var2) #concatenating strings with spaces

var_combined = var1 + " and " + var2
print(var_combined.upper()) #upper case lower() for lower case

##String formatting. Each element inside format() is added in the place of each {}
print("{} {} and {} = diabetes".format(5,var1,var2)) #used to format strings. 

##Checking if string is contained
print("bacon" in var_combined) #checks if the string "bacon" is part of the string "eggs and bacon"


eggs
eggsandbacon
eggs and bacon
EGGS AND BACON
5 eggs and bacon = diabetes
True

In [7]:
var_combined


Out[7]:
'eggs and bacon'

In [10]:
#lower and upper case are different characters
print("bacon" in var_combined)


True

2.1.1.3 Booleans (True/False)

Useful when you are comparing variables to data (variable1 == variable2)

Careful, True (capitalized) is a boolean, true (not capitalized) is nothing.

Common mistake

  • We compare variables using "==". variable1 == variable2 asks the computer if variable1 is equal to variable2. The computer answers with a boolean (True/False)
  • If you write variable1 = variable2, this tells the computer that variable 1 is equal to variable2

In [24]:
print("bacon in var_combined: ", "bacon" in var_combined)
print("bacon == var1: ","bacon" == var1) ##look at the == symol
print("bacon == var2: ", "bacon" == var2)
print("3 > 5: ", 3 > 5) 
print("3 < 5: ", 3 < 5)


bacon in var_combined:  True
bacon == var1:  False
bacon == var2:  True
3 > 5:  False
3 < 5:  True

2.* How Python reads your code

THE COMPUTER READS IT LINE BY LINE, UP_DOWN


In [12]:
## OPERATIONS ON DATA TYPES
#Tells the computer that b = 3
b = 3

#Tells the computer that b = 5
b = 5
#Asks if b is equal to 3
print(b)


5

Tell the computer exactly what you want to do on what variable

  • a = 3
  • print(a) #this is good, you are telling the computer to print a
  • print() #the computer has no idea what to print, it doesn't care that one line before you were talking about a

In [26]:
#The computer prints a (3)
a = 3
print(a)


3

In [27]:
#The computer doesn't print anything
a = 3
print()




We learned about data types and some basic operations (*,%,etc) and how to print them.

But now we want to combine them, which is convenient when you have many variables. For instance, you may want to read all the numbers in a csv file and do not have thousands of variables. We combine them in DATA STRUCTURES (next notebook)