In [1]:
# A first function. Find the length of a list.
a_list = [1, 2, 3]
len(a_list)
Out[1]:
In [1]:
len({"a": [1, 2, 3], "b": 4})
Out[1]:
len()
Takes an object as input and returns an integer.
range()
Takes an integer as input and outputs a list of successive integers starting with 0 to the length specified as an argument.
In [2]:
range(3)
Out[2]:
Notice that Python (in the newest versions, e.g. 3+) has an object type that is a range. This saves memory and speeds up calculations vs. an explicit representation of a range as a list - but it can be automagically converted to a list on the fly by Python. To show the contents as a list we can use the type case like with the tuple above.
Sometimes, in older Python docs, you will see xrange. This used the range object back in Python 2 and range returned an actual list. Beware of this!
In [2]:
# Experiment with the builtin function all
all([1, "first", 3.4])
Out[2]:
In [6]:
any([False, False])
Out[6]:
In [ ]:
list(range(3))
In [7]:
fd = open("t.txt", "w")
In [9]:
fd.write("a line")
Out[9]:
In [10]:
fd.close()
In [11]:
!ls -l
In [8]:
help(fd)
In [14]:
data = [1, 2, 3, 4]
In [15]:
import numpy as np
np.mean(data)
Out[15]:
In [16]:
np.median(data)
Out[16]:
In [17]:
arr = np.array(data)
arr
Out[17]:
In [18]:
np.reshape(arr, (2,2))
Out[18]:
In [7]:
np.std(data)
Out[7]:
How do you find out what's in a package?
In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are: numpy: Numerical Python
Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar. scipy: Scientific Python
Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course. pandas: Labeled Data Manipulation in Python
Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar. matplotlib: Visualization in Python
Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).
In [19]:
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC