To get up and running: install Anaconda: https://www.continuum.io/downloads get the spyder ide anaconda includes SciPy and scikit-learn then you can install packages using conda or pip

Install pip for easy installations

First install pip by downloading from here to some default folder http://pip.readthedocs.io/en/stable/installing/

in cmd (or use the command prompt in Spyder under Tools), point it to the directory where get-pip.py is and run this

python get-pip.py

then install each package from the command prompt like this:

pip install numpy

pip install yahoo_finance

See what version of python you have installed by opening cmd and type: python --version

confirm installation of scipy, numpy, matplotlib, pandas. This is the python ecosystem.


In [6]:
#scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
#numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib 
import matplotlib 
print('matplotlib: {}'.format(matplotlib.__version__)) 
# pandas 
import pandas 
print('pandas: {}'.format(pandas.__version__))
# scikit-learn 
import sklearn 
print('sklearn: {}'.format(sklearn.__version__))


scipy: 0.18.1
numpy: 1.12.1
matplotlib: 2.0.0
pandas: 0.19.2
sklearn: 0.18.1

Python crash course


In [1]:
# Strings 
data = 'hello world' 
print(data[0]) 
print(len(data)) 
print(data)


h
11
hello world

number counts start with 0 in python. so the first letter in hello world is in the 0 place.


In [2]:
# Numbers 
value = 123.1 
print(value) 
value = 10 
print(value)


123.1
10

In [3]:
# Boolean 
a = True 
b = False 
print(a, b)


True False

In [4]:
# Multiple Assignment 
a, b, c = 1, 2, 3 
print(a, b, c)


1 2 3

In [5]:
# No value 
a = None 
print(a)


None

Flow Control

There are three main types of flow control that you need to learn: If-Then-Else conditions, For-Loops and While-Loops.


In [6]:
# If-Then-Else Conditional
value = 99 
if value == 99: 
    print('That is fast') 
elif value > 200: 
    print('That is too fast') 
else: 
    print('That is safe')


That is fast

Notice the colon (:) at the end of the condition and the meaningful tab intend for the code block under the condition.


In [7]:
# For-Loop 
for i in range(10): 
    print(i)


0
1
2
3
4
5
6
7
8
9

In [8]:
# While-Loop 
i = 0 
while i < 10: 
    print(i) 
    i += 1


0
1
2
3
4
5
6
7
8
9

Data Structures

There are three data structures in Python that you will find the most used and useful. They are tuples, lists and dictionaries.


In [9]:
# Tuple
# Tuples are read-only collections of items.
a = (1, 2, 3) 
print(a)


(1, 2, 3)

In [10]:
# List
# Lists use the square bracket notation and can be index using array notation.
mylist = [1, 2, 3] 
print("Zeroth Value: %d" % mylist[0]) 
mylist.append(4) 
print("List Length: %d" % len(mylist)) 
for value in mylist: 
    print(value)


Zeroth Value: 1
List Length: 4
1
2
3
4

In [11]:
# Dictionary
# Dictionaries are mappings of names to values, like key-value pairs. Note the use of the curly bracket and colon notations when defining the dictionary.
mydict = {'a': 1, 'b': 2, 'c': 3} 
print("A value: %d" % mydict['a']) 
mydict['a'] = 11 
print("A value: %d" % mydict['a']) 
print("Keys: %s" % mydict.keys()) 
print("Values: %s" % mydict.values()) 
for key in mydict.keys(): 
    print(mydict[key])


A value: 1
A value: 11
Keys: dict_keys(['a', 'b', 'c'])
Values: dict_values([11, 2, 3])
11
2
3

Functions The biggest gotcha with Python is the whitespace. Ensure that you have an empty new line after indented code. The example below defines a new function to calculate the sum of two values and calls the function with two arguments.


In [12]:
# Sum function 
def mysum(x, y): 
    return x + y

# Test sum function 
result = mysum(1, 3) 
print(result)


4

NumPy Crash Course

NumPy provides the foundation data structures and operations for SciPy. These are arrays (ndarrays) that are efficient to define and manipulate.


In [13]:
# define an array 
import numpy 
mylist = [1, 2, 3] 
myarray = numpy.array(mylist) 
print(myarray) 
print(myarray.shape)


[1 2 3]
(3,)

Notice how we easily converted a Python list to a NumPy array.


In [14]:
# Access Data
# Array notation and ranges can be used to efficiently access data in a NumPy array.
# access values 
import numpy 
mylist = [[1, 2, 3], [3, 4, 5]] 
myarray = numpy.array(mylist) 
print(myarray) 
print(myarray.shape) 
print("First row: %s" % myarray[0]) 
print("Last row: %s" % myarray[-1]) 
print("Specific row and col: %s" % myarray[0, 2]) 
print("Whole col: %s" % myarray[:, 2])


[[1 2 3]
 [3 4 5]]
(2, 3)
First row: [1 2 3]
Last row: [3 4 5]
Specific row and col: 3
Whole col: [3 5]

In [15]:
# Arithmetic
# NumPy arrays can be used directly in arithmetic.
import numpy 
myarray1 = numpy.array([2, 2, 2]) 
myarray2 = numpy.array([3, 3, 3]) 
print("Addition: %s" % (myarray1 + myarray2)) 
print("Multiplication: %s" % (myarray1 * myarray2))


Addition: [5 5 5]
Multiplication: [6 6 6]

Matplotlib Crash Course

Matplotlib can be used for creating plots and charts. The library is generally used as follows:

Call a plotting function with some data (e.g. .plot()). Call many functions to setup the properties of the plot (e.g. labels and colors). Make the plot visible (e.g. .show()).


In [17]:
# Line Plot
# example below creates a simple line plot from one dimensional data
# basic line plot 
import matplotlib.pyplot as plt 
import numpy 
myarray = numpy.array([1, 2, 3]) 
plt.plot(myarray) 
plt.xlabel('some x axis') 
plt.ylabel('some y axis') 
plt.show()



In [18]:
# Scatter Plot
# a simple example of creating a scatter plot from two dimensional data
# basic scatter plot 
import matplotlib.pyplot as plt 
import numpy 
x = numpy.array([1, 2, 3]) 
y = numpy.array([2, 4, 6]) 
plt.scatter(x,y) 
plt.xlabel('some x axis') 
plt.ylabel('some y axis') 
plt.show()


Pandas Crash Course

Pandas provides data structures and functionality to quickly manipulate and analyze data. The key to understanding Pandas for machine learning is understanding the Series and DataFrame data structures Pandas is a very powerful tool for slicing and dicing you data


In [19]:
# Series
# A series is a one dimensional array of data where the rows are labeled using a time axis.
import numpy 
import pandas 
myarray = numpy.array([1, 2, 3]) 
rownames = ['a', 'b', 'c'] 
myseries = pandas.Series(myarray, index=rownames)
print(myseries)


a    1
b    2
c    3
dtype: int32

In [20]:
#You can access the data in a series like a NumPy array and like a dictionary, for example
print(myseries[0]) 
print(myseries['a'])


1
1

In [21]:
# DataFrame
# A data frame is a multi-dimensional array where the rows and the columns can be labeled.
import numpy 
import pandas 
myarray = numpy.array([[1, 2, 3], [4, 5, 6]]) 
rownames = ['a', 'b'] 
colnames = ['one', 'two', 'three'] 
mydataframe = pandas.DataFrame(myarray, index=rownames, columns=colnames) 
print(mydataframe)


   one  two  three
a    1    2      3
b    4    5      6

In [22]:
# Data can be index using column names.
print("method 1:") 
print("one column:\n%s" % mydataframe['one']) 
print("method 2:") 
print("one column:\n%s" % mydataframe.one)


method 1:
one column:
a    1
b    4
Name: one, dtype: int32
method 2:
one column:
a    1
b    4
Name: one, dtype: int32

Load CSV Files with Pandas

You can load your CSV data using Pandas and the pandas.read csv() function. This function is very flexible The function returns a pandas.DataFrame http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html


In [ ]:
# Load CSV using Pandas
# https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
from pandas import read_csv
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(filename, names=names)
print(data.shape)

In [26]:
# Load CSV using Pandas from URL
from pandas import read_csv
url = 'https://goo.gl/vhm1eU'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(url, names=names)
print(data.shape)


(768, 9)

In [27]:
#look at the raw data
# 1st column is row number
peek = data.head(20)
print(peek)


    preg  plas  pres  skin  test  mass   pedi  age  class
0      6   148    72    35     0  33.6  0.627   50      1
1      1    85    66    29     0  26.6  0.351   31      0
2      8   183    64     0     0  23.3  0.672   32      1
3      1    89    66    23    94  28.1  0.167   21      0
4      0   137    40    35   168  43.1  2.288   33      1
5      5   116    74     0     0  25.6  0.201   30      0
6      3    78    50    32    88  31.0  0.248   26      1
7     10   115     0     0     0  35.3  0.134   29      0
8      2   197    70    45   543  30.5  0.158   53      1
9      8   125    96     0     0   0.0  0.232   54      1
10     4   110    92     0     0  37.6  0.191   30      0
11    10   168    74     0     0  38.0  0.537   34      1
12    10   139    80     0     0  27.1  1.441   57      0
13     1   189    60    23   846  30.1  0.398   59      1
14     5   166    72    19   175  25.8  0.587   51      1
15     7   100     0     0     0  30.0  0.484   32      1
16     0   118    84    47   230  45.8  0.551   31      1
17     7   107    74     0     0  29.6  0.254   31      1
18     1   103    30    38    83  43.3  0.183   33      0
19     1   115    70    30    96  34.6  0.529   32      1

In [30]:
# Dimensions of your data
# You can review the shape and size of your dataset by printing the shape property on the Pandas DataFrame.
shape = data.shape
print(shape)


(768, 9)

shows 768 rows, 9 columns


In [31]:
# Data Type For Each Attribute
#The type of each attribute is important. 
#Strings may need to be converted to floating point values or 
#integers to represent categorical or ordinal values. You can get an idea 
#of the types of attributes by peeking at the raw data, as above. 
#You can also list the data types used by the DataFrame to 
#characterize each attribute using the dtypes property.
types = data.dtypes
print(types)


preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

Stopped on page 33.