Working with data 2017. Class 2

Contact

Javier Garcia-Bernardo garcia@uva.nl

0. Structure

  1. Data types, structures and code II
  2. Merging and concatenating dataframes
  3. My second plots
  4. Summary

In [1]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

#Usual imports
import pandas as pd
import numpy as np
import pylab as plt


PYTHON: Variables and code

Python uses variables and code.

Variables

Variables tell the computer to save something (a number, a string, a spreadsheet) with a name. For instance, if you write variable_name = 3, the computer knows that variable_name is 3.

  • Data types: Numbers, strings and others
  • 1.1 Data structures:
    • Lists, tables... (full of data types)

Code

  • Instructions to modify variables
  • 1.2 Can be organized in functions
  • Variables can be seen for all or part of the code: 1.3 Scope of variables
  • 1.4. For loops: Repeat a similar statement many times
  • 1.5 Control-flow: if-else statements, try-except statement and for-loops
  • 1.6 Try-except: error catching

1.1 Dictionary (type of data structure)

  • Like in a index, finds a page in the book very very fast.
  • It combiens keys (word in the index) with values (page number associated to the word): {key1: value2, key2: value2}
  • The keys can be numbers, strings or tuples, but NOT lists (if you try Python will give the error unhashable key)

In [3]:
#Dictionary
this_is_a_dict = {"Javier": "garcia@uva.nl", "Friend1": "f1@uva.nl", "Friend2": "f2@uva.nl"}
print(this_is_a_dict)
print(type(this_is_a_dict))


{'Friend1': 'f1@uva.nl', 'Javier': 'garcia@uva.nl', 'Friend2': 'f2@uva.nl'}
<class 'dict'>

OPERATIONS IN DICT

Get


In [4]:
#Get an element
print(this_is_a_dict["Friend2"])
print(this_is_a_dict.get("Friend2"))


f2@uva.nl
f2@uva.nl

In [6]:
#The difference between the two is that while the first line gives an error if "Friends2" 
#is not part of the dictionary, the second one answers with None**
print(this_is_a_dict.get("Friend5")) #not enough friends


None

Add


In [7]:
#Create an element
this_is_a_dict["Friend3"] = "f3@uva.nl"

In [12]:
this_is_a_dict


Out[12]:
{'Friend1': 'f1@uva.nl',
 'Friend2': 'f2@uva.nl',
 'Friend3': 'f3@uva.nl',
 'Javier': 'garcia@uva.nl'}

In [9]:
#Print the keys
print(this_is_a_dict.keys())


dict_keys(['Friend1', 'Javier', 'Friend3', 'Friend2'])

In [10]:
#Print the values
print(this_is_a_dict.values())


dict_values(['f1@uva.nl', 'garcia@uva.nl', 'f3@uva.nl', 'f2@uva.nl'])

Remove


In [13]:
del this_is_a_dict["Friend3"]
print(this_is_a_dict)


{'Friend1': 'f1@uva.nl', 'Javier': 'garcia@uva.nl', 'Friend2': 'f2@uva.nl'}

Creating a dictionary from two lists: ZIP


In [15]:
#Creating dictionary using two lists
list_names = ["Javier", "Friend1", "Friend2"]
list_numbers = ["garcia@uva.nl","f1@uva.nl","f2@uva.nl"]

#Put both together using zip
this_is_a_dict = dict(zip(list_names,list_numbers))
print(this_is_a_dict)


{'Friend1': 'f1@uva.nl', 'Javier': 'garcia@uva.nl', 'Friend2': 'f2@uva.nl'}

In [16]:
#The zip object is another strange data structure that we cannot see (like range)
print(zip(list_names,list_numbers))


<zip object at 0x7fb1ca22df88>

In [17]:
#But we can convert it to a list to see how it looks (like range)
print(list(zip(list_names,list_numbers)))


[('Javier', 'garcia@uva.nl'), ('Friend1', 'f1@uva.nl'), ('Friend2', 'f2@uva.nl')]

Why to use dict? Because it's much much faster than a list, it always takes the same time to find an element in a dict, that's not the case in a list

  • With 10 elements, finding an element in a dict is 2x faster than finding it in a list
  • With 1000 elements, finding an element in adict is 100x faster than finding it in a list
  • With 1 million elements, finding an element in a dict is 100000x faster than finding it in a list

Useful to assing values to words for instance

1.2-3 Code: Operations, functions, control flow and loops

  • We have the data in data structures, composed of several data types.
  • We need code to edit everything

1.2 Functions

  • A fragment of code that takes some standard input (arguments) and returns some standard output.
  • Example: The mean function. Gets a list of numbers as input, gives the mean as output. Gives an error if you try to calculate the mean of some strings.
  • We have already seen many functions. Add, mean...

In [18]:
## Our own functions
def mean_ours(list_numbers): #list_numbers is the arguments
    """
    This is called the docstring, it is a comment describing the function. In this case the function calculates the mean of a list of numbers.
    
    input 
    list_numbers: a list of numbers
    
    output: the mean of the input   
    
    """
    #what gives back
    return sum(list_numbers)/len(list_numbers)

##INDENTATION!!
##Two points after the "def"

In [19]:
mean_ours?

In [20]:
aList = [2,3,4]
print(mean_ours(aList)) #this is how you call the funciton


3.0

How the arguments of a function work

If there are many arguments, the first value that you pass is matched to the first argument of the function, the second to the second, etc.

For instance, these are the arguments of the function pd.read_csv()

`pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',...)`

Writing

`pd.read_csv("data/ams_green.csv","\t",None,0)`

matches

`filepath_or_buffer TO "data/ams_green.csv",
sep TO "\t",
delimiter TO None, 
header TO 0`

You can also pass the arguments by name. For instance

`pd.read_csv("data/ams_green.csv",header= 0, sep="\t",delimiter=None)`

is identical to the line before. In this case the values you pass do not have to be in the same order as the arguments.

1.3 Scope: Global vs local variables

  • Variables inside functions are only seen by that function
  • Variables outside functions are seen and can be modified by all functions (dangerous)

In [21]:
def f(): 
    local_var1 = 2
    local_var2 = 3
    local_var = local_var1*local_var2
    print(local_var)

#Call the function
f()


6

Variables created inside functions are only seen within the function


In [22]:
def f(): 
    local_var1 = 2
    local_var2 = 2
    local_var = local_var1*local_var2

#Call the function
f()
#We haven't created local_var
print(local_var)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-22-2e798224af18> in <module>()
      7 f()
      8 #We haven't created local_var
----> 9 print(local_var)

NameError: name 'local_var' is not defined

In [23]:
def f(): 
    local_var1 = 2
    local_var2 = 2
    local_var = local_var1*local_var2
    return local_var

#Call the function
gvar = f()
#Now we have local_var (but generally it is not a good idea to use the same name)
print(gvar)


4

Variables created outside functions are seen by all the code (be careful!)


In [24]:
local_var = "python"

def f(): 
    print(local_var) #this can read the variable outside, but NOT CHANGE IT (except .pop() and .append())
    #it's okay for functions not to return anything, by default they return None
    
#Call the function
f()
#We can also see it from outside the function
print(local_var)


python
python

1.4 For-Loops

  • Iterate over a list (or an array, or a set, or the keys of a dictionary..), like this:

for element in [1,2,3,4,5]: print(element)

The computer:

  • Reads the first line (for element in [1,2,3,4,5]) and realizes it is a for loop
  • It then assigns element = 1 (the first element in the list) and does whatever it is inside the for loop (print(element))
  • Then it assigns element = 2 (the second element) and does whatever it is inside the loop
  • It continues like this until the last element
  • When there are no more elements in the list it exits the loop and continues reading the code behind the loop (if any)

You can write anything instead of element (for i in range(10) for instance)

The indentation and the colon are important, you get SyntaxError without them.


In [9]:
for x in ["Adam","Utercht"]:
    print(x)


Adam
Utercht

In [12]:
for i,x in enumerate(["Adam","Utercht"]):
    print(i,x)


0 Adam
1 Utercht

In [13]:
i = 0
for x in ["Adam","Utercht"]:
    print(i,x)
    i = i + 1


0 Adam
1 Utercht

In [ ]:


In [ ]:


In [28]:
print("python" in list_articles[1])


False

In [27]:
#Imagine we want to find what some articles are talking about, we could do it like this,
#but it's unfeasible when you have more than a dozen articles

list_articles = ["article 1: blah python",
                 "article 2: blah Trump",
                 "article 3: blah Trump",
                 "article 4: blah Trump"]#many article

print("python" in list_articles[0])
print("python" in list_articles[1])
print("python" in list_articles[2])
print("python" in list_articles[3])
#...


True
False
False
False

In [30]:
#but we can use for loops 
for a in list_articles:
    print("python" in a)


True
False
False
False

In [31]:
#this is very common as well (especially in other programming languages)
for index in [0,1,2,3]:
    print("python" in list_articles[index])


True
False
False
False

In [34]:
list(enumerate(list_articles))


Out[34]:
[(0, 'article 1: blah python'),
 (1, 'article 2: blah Trump'),
 (2, 'article 3: blah Trump'),
 (3, 'article 4: blah Trump')]

In [32]:
#this is sometimes useful when we want both the article and the index 
for index,article in enumerate(list_articles):
    print(index, "python" in article)


0 True
1 False
2 False
3 False

what if we want to stop a loop? Then we can use break


In [68]:
for index,article in enumerate(list_articles):
    if index == 2: break
    print(index, "python" in article)


0 True
1 False

what if we want to skip some rounds? Then we use continue


In [69]:
for index,article in enumerate(list_articles):
    if index%2 == 0: 
        continue #this skips the rest of the code below if the number is even
    print(index, "python" in article)


1 False
3 False

1.5 Control flow = if-else statements

  • Controls the flow. If something happens, then do something. Else, do another thing. Like this

`

article = "Trump is going to make America great"

if "python" in article:
    print("python",article)
elif "climate change" in article:
    print("climate change",article)
else:
    print("no python", article)

`

The computer:

  • Reads the first line (if "python" in article) and realizes it is an if-else statement
  • It then checks if python" in article is True.
    • If it is True, it reads whatever is inside the if statement (in this case print("python",article)) and goes to the end of all the if-elif-else.
    • If it is False, it goes to the elif (else if), and checks if elif "climate change" in article is True.
      • If it is True, it reads whatever it is inside and goes to the end
      • If it is False, it goes to the else and prints whatever it is inside

You only need the if, the elif and else are optional. For instance without else the code above wouldn't print anything.

You can have as many elifs as you want.

The indentation and the colon are important, you get SyntaxError without them.

Let's write code that tells us if an article is about python or Trump


In [2]:
article = "article 2: blah Trump python"
if "python" in article:
    print("Article refering to Python")
    
if "Trump" in article:
    print("Article refering to Trump")


Article refering to Python

1.x1 Let's combine all we learned so far

Write a function that prints if an article is related to "python" or "Trump", and run it for many cases

We can wrap it into a function


In [39]:
def python_or_trump(article):
    """
    prints if an article is related to python or trump
    
    input
    article: string with words
    
    """
    
    if "python" in article:
        print("Article refering to Python")
    elif "Trump" in article:
        print("Article refering to Trump")
    else:
        print("Article not refering to Python or Trump")

In [40]:
article = "article 2: blah Trump"
print(article)
#this is how you call the function
python_or_trump(article)


article 2: blah Trump
Article refering to Trump

In [41]:
#stops when python is found, never check for trump
article = "article 2: blah Trump python"
print(article)
python_or_trump(article)


article 2: blah Trump python
Article refering to Python

In [42]:
article = "article 2: blah blah"
print(article)
python_or_trump(article)


article 2: blah blah
Article not refering to Python or Trump

Now we do it for many articles


In [43]:
list_articles = ["article 1: blah python",
                 "article 2: blah Trump",
                 "article 3: blah Trump",
                 "article 4: blah Trump"]#many articles

for article in list_articles:
    python_or_trump(article)


Article refering to Python
Article refering to Trump
Article refering to Trump
Article refering to Trump

1.x2 Let's combine all we learned so far

Write a function that counts the number of articles with the word python and "Trump"


In [44]:
def count_words(list_articles):
    """
    input: list of articles
    
    output: number of articles with the word trump and with the word pythoon
    """
    count_trump = 0
    count_python = 0
    for article in list_articles:
        if "python" in article.lower():
            count_python = count_python + 1 #count_python += 1
        if "trump" in article.lower():
            count_trump = count_trump + 1 #count_trump += 1
    
    return count_trump,count_python

In [47]:
import numpy as np
list_articles = ["article 1: blah python",
                 "article 2: blah Trump",
                 "article 3: blah Trump",
                 "article 4: blah Trump"]#many articles
    
g_count_trump,g_count_python =  count_words(list_articles)
print(g_count_python)
print(g_count_trump)
print("python articles: ", g_count_python)
print("trump_articles: ", g_count_trump)


1
3
python articles:  1
trump_articles:  3

In [50]:
[0]*10


Out[50]:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Let's make it a bit more flexible


In [53]:
#Let's use a list of numbers instead of two separate variables for the counter

list_articles = ["article 1: blah python",
                 "article 2: blah Trump",
                 "article 3: blah Trump",
                 "article 4: blah Trump"]#many articles

def count_words(list_articles):
    counters = [0]*2 # [0,0]
    for article in list_articles:
        if "python" in article:
            counters[0] += 1 #count_python += 1
            #counters[0] = counters[0] + 1
        if "Trump" in article:
            counters[1] +=  1 #count_python += 1
    
    return counters
    
counters =  count_words(list_articles)
print("python articles: ")
print(counters[0])
print("trump_articles: ")
print(counters[1])


python articles: 
1
trump_articles: 
3

In [55]:
# And allow for any two words, not just python or Trump
list_articles = ["article 1: blah python",
                 "article 2: blah Trump",
                 "article 3: blah Trump",
                 "article 4: blah Trump"]#many articles

def count_words(list_articles,words):
    counters = [0]*2
    for article in list_articles:
        if words[0] in article:
            counters[0] += 1 #count_python += 1
        if words[1] in article:
            counters[1] +=  1 #count_python += 1
    
    return counters
    
counters =  count_words(list_articles,words=["python","blah"])
print("python articles: ", counters[0])
print("blah_articles: ", counters[1])


python articles:  1
blah_articles:  4

In [67]:
words = ["python","Trump","blah"]
list(enumerate(words))


Out[67]:
[(0, 'python'), (1, 'Trump'), (2, 'blah')]

In [63]:
list(range(len(words))),words


Out[63]:
([0, 1, 2], ['python', 'Trump', 'blah'])

In [66]:
enumerate(words)
zip(range(len(words)),words)


Out[66]:
<zip at 0x7fb1ca1904c8>

In [59]:
# And allow for any number of words, not just two
list_articles = ["article 1: blah python",
                 "article 2: blah Trump",
                 "article 3: blah Trump",
                 "article 4: blah Trump"]#many articles

def count_words(list_articles,words):
    counters = [0] * len(words)
    for article in list_articles:
        
        for i,word in enumerate(words):
        
            if word in article:
                counters[i] += 1
            
        
    return counters
    
words = ["python","Trump","blah"]
counters =  count_words(list_articles,words)
print(words)
print(counters)


['python', 'Trump', 'blah']
[1, 3, 4]

In [60]:
#We can make a dictionary out of it
d_word2counter = dict(zip(words,counters))
d_word2counter["Trump"]


Out[60]:
3

what if we want a loop but we don't know when we need to stop?

Then we can use the while loop:

while condition: do something update condition #otherwise the loop is infinitei

However in python is not too common.

1.6 Try - except

Exception handling. Sometimes the code tries something that can result in an error, and you want to catch the error and react to it.


In [74]:
#For instance this fails, because we don't have more than 2 friends
this_is_a_dict = {"Javier": "garcia@uva.nl", "Friend1": "f1@uva.nl", "Friend2": "f2@uva.nl"}

In [75]:
print(this_is_a_dict["Friend5"])


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-75-70999cd0a563> in <module>()
----> 1 print(this_is_a_dict["Friend5"])

KeyError: 'Friend5'

In [77]:
f5 = this_is_a_dict.get("Friend5")
if f5 is None: #f5 == None 
    print("Not enough friends")


Not enough friends

In [72]:
#example how to fix it
#the indents are important, as well as the colons
try:
    print(this_is_a_dict["Friend5"])
except KeyError:
    print("Not enough friends")


Not enough friends

In [71]:
#but this one is very common and we have a function that does it for us
print(this_is_a_dict.get("Friend5"))


None

2. Writing and reading from disk

You can also write files line by line

"r": read "w": write "w+": write and if doesn't exist, create it


In [78]:
with open("data/file_to_write.csv","w+") as f:
    f.write("I'm line number {}".format(0))
    f.write("I'm line number {}".format(1))
    f.write("I'm line number {}".format(2))
    f.write("I'm line number {}".format(3))
    f.write("I'm line number {}".format(4))

But remember to add a "return character" (\n)


In [79]:
with open("data/file_to_write.csv","w+") as f:
    f.write("I'm line number {}\n".format(0))
    f.write("I'm line number {}\n".format(1))
    f.write("I'm line number {}\n".format(2))
    f.write("I'm line number {}\n".format(3))
    f.write("I'm line number {}\n".format(4))

There are 3 ways to read a file

We won't be reading the files like this too often, but sometimes you need to read them line by line (instead of loading all the files like we do with pandas)

  • Read it all as a string

In [80]:
#Ways to read files
with open("data/file_to_write.csv","r") as f:
    #way 1
    all_file = f.read()
print(all_file)


I'm line number 0
I'm line number 1
I'm line number 2
I'm line number 3
I'm line number 4

  • Read it breaking in the "\n"

In [81]:
with open("data/file_to_write.csv") as f:
    #way 2
    all_file_by_line = f.readlines()
print(all_file_by_line)


["I'm line number 0\n", "I'm line number 1\n", "I'm line number 2\n", "I'm line number 3\n", "I'm line number 4\n"]
  • Read it line by line

In [82]:
with open("data/file_to_write.csv") as f:
    #way 3
    for line in f:
        print(line)


I'm line number 0

I'm line number 1

I'm line number 2

I'm line number 3

I'm line number 4


In [83]:
print("Hi")
print("Hi again")


Hi
Hi again

you can delete the "\n" at the end of the string with .rstrip()


In [ ]:
with open("data/file_to_write.csv") as f:
    #way 3
    for line in f:
        print(line.rstrip())

In-class exercises


In [ ]:
with open("data/file_to_write.csv","w+") as f:
    f.write("I'm line number {}\n".format(0))
    f.write("I'm line number {}\n".format(1))
    f.write("I'm line number {}\n".format(2))
    f.write("I'm line number {}\n".format(3))
    f.write("I'm line number {}\n".format(4))

1. Use a loop to do the same than above (write 5 lines to a file)


In [87]:
with open("data/file_to_write.csv","w+") as f:
    for test in range(5):
        f.write("I'm line number {}\n".format(test))

2. Use an if-else statement to write only if the number is larger than 3


In [95]:
with open("data/file_to_write.csv","w+") as f:
    for test in range(5):
        if test > 3:
            f.write("I'm line number {}\n".format(test))

In [96]:
with open("data/file_to_write.csv","r") as f:
    print(f.read())


I'm line number 4

3. Encapsulate everything in a function, and call the function


In [97]:
def makesomethingup():
    with open("data/file_to_write.csv","w+") as f:
        for test in range(5):
            if test > 3:
                f.write("I'm line number {}\n".format(test))
    return None

makesomethingup()

Everything in your computer/phone/online is based on these things you have already know:

  • data types: numbers
  • data variables: lists and dictionaries
  • code: if-else and for-loops

Using these blocks you can create anything


In [84]:
#A character is a special type of number
ord("b")


Out[84]:
98

In [85]:
#A string is very similar to a list of characters
"abdc"[3]


Out[85]:
'c'

In [86]:
#A boolean is a number
print(True == 1)


True

In [ ]:
#A numpy array is a special type of list

#A pandas dataframe is a list of numpy arrays

#A set is a dictionary without values {"d":1,"e":3} vs {"d","e"}

In [ ]:


In [ ]: