OverviewNotes_PYTDS


PYT-DS SAISOFT

DATA SCIENCE WITH PYTHON

Where Have We Been, What Have We Seen?

My focus in this course is two track:

  • develop high level intuitions about statistical and machine learning concepts

  • practice with nuts and bolts tools of the trade, namely pandas, matplotlib, numpy, other visualization tools (seaborn, bokeh...), specialized versions of pandas (geopandas, basemap).

However, these two tracks are not strictly distinct, as navigating one's way through the extensive APIs associated with nuts and bolts tools, requires developing high level intuitions. These tracks are complementary and require each other.

HIGH LEVEL INTUITIONS

What are some examples of high level intuitions?

I talked at some length about long-raging debates between two schools of thought in statistics: frequentist and Bayesian. Some of these debates have been concealed from us, as the successes of Bayesian thinking, also known as subjectivist, tend to feature early electronic computers and prototypical examples of machine learning, as these were emergent in the UK and US during WW2 especially, and highly classified.

Here in 2018, we're getting more of a picture of what went on at Bletchley Park. Neal Stephenson's Cryptonomicon, a work of historical science fiction, helped break the ice around sharing these stories. I learned a lot about cryptography simply from reading about the history of RSA.

Frequentists focus on sampling sufficiently to make reliable estimates regarding a larger population, deemed approachable in the asymptote but with diminishing returns. Why sample a million people if choosing the right few hundred gives the same predictions? Find out what sampling techniques give the most bang for the buck and then consider yourself ready to predict what will happen on the larger scale. The focus is on finding correlating patterns, whether or not causation might be implied.

Infrequent!

Alan Turing, famously the feature of the much fictionalized The Imitation Game, was tasked with cracking military grade encryption and enlisted the aid of machines to brute force through more possible permutations in a shorter time, without mistakes, than human computers could match.

However this was not brute force in an entirely mindless sense. They had several clues, more as time went on. One begins with prior or a priori knowledge (axioms if you will, subject to revision), and during the search process itself (at runtime) the process might spontaneously choose the more promising branches for exploration.

Chess and Go players may do something similar, as it's naturally impractical to explore the entire tree of possible moves many moves ahead. A way of culling or limiting ones search, today sometimes called "back propagation" or "response to feedback" makes the term "brute force" too coarse. And yet the raw horsepower that machines bring to computation cannot be denied either.

Turing's machines were sensitive to feedback, in the sense of the children's game, where we say "warmer" or "colder" depending on whether the guided search is getting closer or further from a target. Today we hear a lot about "gradient descent" which, similar to Newton's Method, is a way of finding a local or perhaps global minimum, according to what the rates of change say. "This may be as good as it gets" in terms of face recognition. But then you may always feed in more faces.

Bayesians see themselves working on a belief system or "model" of a system, fine tuning it to match incoming data. To the extent expectations are met, so is the model considered reliable.

Icing on the cake is when a model predicts something no one expected. This proves, or at least adds credibility to the belief, that one's model might be ahead of the curve, in terms of its predictive powers.

The power of machines to respond to "warmer" and "colder" is the basis for many recent advances in data science, as is the faster speed of GPUs.

Our crystal balls have become better at recognizing things, for example faces in pictures, written characters, spoken words, word patterns in Tweets or scientific articles.

One might say these crystal balls are "predictive" in the sense that we anticipate they'll guess correctly when confronted with new sample data.

However, in English, we don't ordinarily consider "recognizing a dog to be a dog" as anything like divination, as in "seeing the future". From a machine's point of view, "getting it right" is a form a prediction.

PRACTICAL TOOLS

The two tools we started with were the numpy ndarray and the pandas Series type. A Series is really a 2D (or dim 2) addressing scheme, but with only a single vertical vector or column. A DataFrame lines these vertical vectors together, left to right, giving us spreadsheet-like structures already familiar from centuries of working with tabular arrangements of rows and columns, defined to form "cells".

Lets create a kind of chess board with labels reminiscent of standard chess notation. The cells or values will be Unicode, meaning we might use the appropriate chess piece glyphs.


In [1]:
import numpy as np
import pandas as pd

In [2]:
squares = np.array(list(64 * " "), dtype = np.str).reshape(8,8)

In [3]:
squares


Out[3]:
array([[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']], dtype='<U1')

In [4]:
print('♔♕♖')


♔♕♖

In [5]:
squares[0][0] = '♖'
squares[7][0] = '♖'
squares[0][7] = '♖'
squares[7][7] = '♖'

In [6]:
squares


Out[6]:
array([['♖', ' ', ' ', ' ', ' ', ' ', ' ', '♖'],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '],
       ['♖', ' ', ' ', ' ', ' ', ' ', ' ', '♖']], dtype='<U1')

In [7]:
chessboard = pd.DataFrame(squares, index=range(1,9), 
                          columns = ['wR','wKn', 'wB', 'K',
                                     'Q','eB', 'eKn', 'eR' ] )

In [8]:
chessboard


Out[8]:
wR wKn wB K Q eB eKn eR
1
2
3
4
5
6
7
8

I emphasize this is not really chess. Chess masters do not think in terms of west and east, nor is the checkerboard pattern sufficiently accessible.

Our original ndarray (8 x 8) is now "framed" much as a canvas is framed by a possibly elaborately carved metadata apparatus. Indexes may be hierarchical, such as with years divided into months down the side, and animals into phyla across the top.

Suppose we want to stretch our chessboard values back into a string-like spaghetti strand, not even 2D? The string-like state is a base serialization of what is perhaps meant to be much higher dimensional data.

Applying the rule that if dimensions intermultiply to the same total, we may reshape between them, I will take my 8 x 8 chessboard and turn it into some binary tree like 2 x 2 x 2 x 2 x 2 x 2 data structure. Finding a Rook might take patience. I leave it for you in the Lab.


In [9]:
string_board = chessboard.values
binary_tree = string_board.reshape(2,2,2,2,2,2)

LAB:

Find the rooks, or at least one of them. The first one is easy and I give you it for free:


In [10]:
binary_tree[0][0][0][0][0][0][0][0]


Out[10]:
'♖'

At the time of this writing, pandas is having a tad of an identity crisis, as the Panel type, for which the initial p stands, is being phased out. The added overhead of a stack of DataFrames proved more overhead than the law of dimishing returns would support. I'm sure you'll be able to dig up more research about all the decision-making that went on.