20171025 - My data exploration lab notes

Code Like a Pythonista

Iterating over list/set/dict


In [1]:
# Iterating over set
animals = {'cat', 'dog', 'fish', 'monkey'}
for animal in animals:
    print('{}'.format(animal))


cat
monkey
dog
fish

In [2]:
# Iterating over set (and generating an index)
animals = {'cat', 'dog', 'fish', 'monkey'}
for idx, animal in enumerate(animals):
    print('{}: {}'.format(idx,animal))


0: cat
1: monkey
2: dog
3: fish

List/Set/Dict Comprehensions

Python supports list comprehensions, that can be used to construct lists in a very natural way, similar to mathematic construction.

[ output_expression() for(set of values to iterate) if(conditional filtering) ]


In [3]:
#List comprehension
list1 = [x**2 for x in range(10)]
print(list1)

list2 = [x for x in list1 if x % 2 == 0]
print(list2)

#Set comprehension
nums = {x for x in range(10)}
print(nums)  # Prints "{0, 1, 2, 3, 4, 5}"

#Dict comprehension
mcase = {'a':10, 'b': 34, 'A': 7, 'Z':3}
mcase_frequency = { k.lower() : mcase.get(k.lower(), 0) + mcase.get(k.upper(), 0) for k in mcase.keys() }
print(mcase_frequency)

#Set comprehension from list
names = [ 'Bob', 'JOHN', 'alice', 'bob', 'ALICE', 'J', 'Bob' ]
names_set = { name[0].upper() + name[1:].lower() for name in names if len(name) > 1 }
print(names_set)

#Nested list comprehension
matrix = [ [ 1 if item_idx == row_idx else 0 for item_idx in range(0, 3) ] for row_idx in range(0, 3) ]
print(matrix)


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
[0, 4, 16, 36, 64]
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
{'a': 17, 'b': 34, 'z': 3}
{'Alice', 'John', 'Bob'}
[[1, 0, 0], [0, 1, 0], [0, 0, 1]]

The power of comprehension


In [4]:
numbers = range(20)

numbers_doubled_odds = []
for n in numbers:
    if n%2 == 1:
        numbers_doubled_odds.append(n*2)
        
print(numbers_doubled_odds)

#vs

numbers_doubled_odds = [n*2 for n in numbers if n%2==1]
print(numbers_doubled_odds)


[2, 6, 10, 14, 18, 22, 26, 30, 34, 38]
[2, 6, 10, 14, 18, 22, 26, 30, 34, 38]

In [5]:
# Calculating prime numbers
noprimes = [j for i in range(2, 8) for j in range(i*2, 100, i)]
primes = [x for x in range(2, 100) if x not in noprimes]
print(primes)


[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]

Basic Numpy data structures


In [6]:
import numpy as np

array1D = np.array([1,2,3,4,5,6,7,8, 9, 10])

#Standard print
print('Data in arr1D:\n', array1D)

#The last line is evaluated
array1D


Data in arr1D:
 [ 1  2  3  4  5  6  7  8  9 10]
Out[6]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [7]:
array2D = np.array([[1,2,3,4],[5,6,7,8]])

print('Data in arr2D:\n', array2D)

array2D


Data in arr2D:
 [[1 2 3 4]
 [5 6 7 8]]
Out[7]:
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [8]:
# Slicing works the same as in standard Python

array2D = np.array([[1,2,3,4],[5,6,7,8]])
print(array2D)

mini_array2D = array2D[:2, 1:3]
print(mini_array2D)


[[1 2 3 4]
 [5 6 7 8]]
[[2 3]
 [6 7]]

Creating arrays


In [9]:
array2D = np.zeros((2,4))
print(array2D)

array2D = np.ones((2,4))
print(array2D)

array2D = np.full((2,4),0.8)
print(array2D)

array2D = np.random.random((2,4))
print(array2D)

print(type(array2D))
print(array2D.shape)


[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]
[[ 0.8  0.8  0.8  0.8]
 [ 0.8  0.8  0.8  0.8]]
[[ 0.32501545  0.64297193  0.04874627  0.06266212]
 [ 0.11432724  0.95436057  0.62908533  0.3934264 ]]
<class 'numpy.ndarray'>
(2, 4)

Changing the shape of arrays


In [10]:
array1D = np.arange(12)
print(array1D,'\n')

array2D = array1D.reshape(2,6) 
print(array2D,'\n')

array2D = array1D.reshape(6,2) 
print(array2D,'\n')

array3D = array1D.reshape(2,2,3) 
print(array3D,'\n')

array3D = array1D.reshape(3,2,2) 
print(array3D,'\n')


[ 0  1  2  3  4  5  6  7  8  9 10 11] 

[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]] 

[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]] 

[[[ 0  1  2]
  [ 3  4  5]]

 [[ 6  7  8]
  [ 9 10 11]]] 

[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]] 

Statistics


In [11]:
print('\n Numpy 2-dim array')
tab10n5 = np.random.randn(10,5)

print(tab10n5)
print('\n Standard deviation of array')
print(np.std(tab10n5))


 Numpy 2-dim array
[[-1.91676919 -0.82903614  0.09831266 -0.26945091 -0.30614273]
 [-0.64949701  0.15896014  0.62038264 -0.50294463 -0.28051979]
 [-0.92855886 -0.7160973   1.77312128 -0.55484851  0.39640845]
 [ 0.53648903 -1.94113288  0.69364212  0.70478747  0.55072403]
 [-0.4904765   0.04574453  1.5269125  -1.16301585 -0.16958285]
 [-1.67551136 -1.19950876  1.48486584 -0.21099859  0.2339799 ]
 [-1.32631813 -0.71617488 -0.61762391 -1.06785925  1.5826729 ]
 [-0.08690198  0.53616966 -0.62694456  0.81235137 -1.43032088]
 [-0.41919501  0.25580246 -1.56213368  2.12270945 -1.49628911]
 [ 1.28077043 -1.0620417   0.57734305 -0.02029614  0.5593769 ]]

 Standard deviation of array
0.978572096479

In [12]:
#TODO Statistics functions

Stacking together different arrays

Take a quick look at tutorial and fill the next cell


In [13]:
#TODO stacking arrays

Plotting


In [14]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 3*np.pi, 50)
plt.plot(x, np.sin(x**2))
plt.title('A simple chirp');


Importing data


In [15]:
import csv
import numpy as np

data = np.genfromtxt('country-of-birth-london-min.csv', skip_header=1, delimiter=';')
print(data[:5,:])


[[      nan  7363000.  7736000.  7864000.  7984000.  8121000.  8233000.
   8330000.  8441000.]
 [      nan  5187000.  5171000.  5199000.  5243000.  5176000.  5270000.
   5337000.  5359000.]
 [      nan   191000.   209000.   229000.   272000.   277000.   285000.
    279000.   290000.]
 [      nan    47000.   113000.   111000.   127000.   151000.   145000.
    143000.   178000.]
 [      nan    56000.    97000.    85000.    91000.   104000.   123000.
    135000.   134000.]]

In [16]:
raw_data = np.genfromtxt('country-of-birth-london-min.csv', delimiter=';', dtype=None)
names = raw_data[:5,:1].astype(str)
print(names)


[['Country']
 ['Total']
 ['United Kingdom']
 ['India']
 ['Poland']]

In [17]:
#TODO - Calculate summary statistics

In [18]:
#TODO - Visualize summary using matplotlib charts

Miniproject - data exploration

Global Terrorism Database

Instructions:

  1. Download data set, Global Terrorism Database, from https://www.kaggle.com/START-UMD/gtd
  2. Take a quick look at the data set. Check what's inside, how the data is structured, and where the data is corrupted (missing values, bad structure, etc).
  3. Think and create 5 questions to the data. Try to ask yourself what's really interesting in the data set. What's not so obvious. E.g. some trends, patterns, correlations.
  4. Create a jupyter notebook and use python, numpy, pandas, matplotlib (at least) to provide all the answers to your questions.
  5. Create a new github repository, and put your jupyter notebook there.
  6. Create readme.md file as well in your github root directory with all necessary instructions (what is in the repo, what libs are necessary to run the code, where to find data set and where to save it - this is necessary because the dataset is too big for github repo).
  7. Provide the necessary documentation and introduction in your notebook using markdown language, at least: data source description, data structure, importing process, data processing process.
  8. Put some data visualization in your notebook. Sometimes it's much easier to present the answer using a chart rather than numbers
  9. Check if your notebook run smoothly - use 'Reset & Run All' command from the menu. Save it.
  10. Export the notebook as HTML as well, and save the file in the repo.
  11. Do not forget to commit/push all the changes to your repo on hithub.
  12. Smile :) You did a good job!

FAQ:

  1. Can I take a look at different solution provided at kaggle? Yes, you can. But check more than one solution. Try to understand what the authors are trying to solve, and how could it be used in your project. Try to find really good examples - easy to understand and not so complicated. Remember - you create the notebook as an instruction to someone else! Try to not complicate the process.
  2. Can I take a look at my friend's solution, that he/she has just put on github? Yes, you can. But it's the smart way of solving the project. I'm sure that you want to be smarter in the next semester - so try to create a better solution and your own one :)
  3. Jupyter notebook provide R kernel, so can I use R instead? Nope, R sucks. Even if you love R, try to solve the project using Python.

In [ ]: