Copyright (C) 2017 J. Patrick Hall, jphall@gwu.edu
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
In [4]:
print('Hello World!') # Python 3
print 'Hello World!' # Python 2
In [5]:
# An object with no functions or operators is also printed to the console
x = 'Hello World!'
x
Out[5]:
Python contains many libraries, often called modules, for different purposes
Modules are:
conda
, readily available through the Anaconda release of Python (https://www.continuum.io/downloads) - is often a good solution for installing and managing packages/modules import
statement
In [6]:
# import packages
import string # module with string utilities
import pandas as pd # large module with many utilities for dataframes, here aliased as 'pd'
import numpy as np # large module with many numeric and mathematical utilities, here aliased as 'np'
import matplotlib.pyplot as plt # module for plotting
# "magic" syntax to display matplotlib graphics in a notebook
# magic statements start with '%' and are often used to control notebook behavior
%matplotlib inline
In [7]:
n_rows = 1000
n_vars = 2
In [8]:
# list comprehension
# str() converts to string
# range() creates a list of values from arg1 to arg2
num_col_names = ['numeric' + str(i+1) for i in range(0, n_vars)]
num_col_names
Out[8]:
In [9]:
type(num_col_names) # type() can be used to determine the class of an object in Python
Out[9]:
In [10]:
# anonymous functions
# the lamba statement is used to define simple anonymous functions
# map() is very similar to to lapply() in R - it applies a function to the elements of a list
char_col_names = map(lambda j: 'char' + str(j+1), range(0, n_vars))
char_col_names
Out[10]:
In [11]:
# string.ascii_uppercase is a string constant of uppercase letters
print(string.ascii_uppercase)
# another list comprehension
# slice first seven letters of the string
text_draw = [(letter * 8) for letter in string.ascii_uppercase[:7]]
text_draw
Out[11]:
In [12]:
randoms = np.random.randn(n_rows, n_vars)
randoms[0:5]
Out[12]:
In [13]:
type(randoms)
Out[13]:
In [14]:
num_cols = pd.DataFrame(randoms, columns=num_col_names)
num_cols.head()
Out[14]:
In [15]:
type(num_cols)
Out[15]:
In [16]:
char_cols = pd.DataFrame(np.random.choice(text_draw, (n_rows, n_vars)),
columns=char_col_names)
char_cols.head()
Out[16]:
In [17]:
scratch_df = pd.concat([num_cols, char_cols], axis=1)
scratch_df.head()
Out[17]:
In [18]:
# Pandas alllows slicing by dataframes index using ix[]
# ix[:, 0] means all rows of the 0th column - or numeric1
scratch_df.ix[:, 0].plot.hist(title='Histogram of Numeric1')
Out[18]:
In [19]:
scratch_df.plot.scatter(x='numeric1', y='numeric2',
title='Numeric1 vs. Numeric2')
Out[19]:
In [20]:
# one column returns a Pandas series
# a Pandas series is like a single column vector
scratch_df.iloc[:, 0].head()
Out[20]:
In [21]:
type(scratch_df.iloc[:, 0])
Out[21]:
In [22]:
# more than one columns makes a dataframe
# iloc enables location by index
scratch_df.iloc[:, 0:2].head()
Out[22]:
In [23]:
type(scratch_df.iloc[:, 0:2])
Out[23]:
In [24]:
scratch_df['numeric1'].head()
Out[24]:
In [25]:
scratch_df.numeric1.head()
Out[25]:
In [26]:
# loc[] allows for location by column or row label
scratch_df.loc[:, 'numeric1'].head()
Out[26]:
In [27]:
# loc can accept lists as an input
scratch_df.loc[:, ['numeric1', 'numeric2']].head()
Out[27]:
In [28]:
scratch_df[0:3]
Out[28]:
In [29]:
# Selecting by index
scratch_df.iloc[0:5, :]
Out[29]:
In [30]:
# select by row label
# here index/key values 0:5 are returned
scratch_df.loc[0:5, :]
Out[30]:
In [31]:
scratch_df[scratch_df.numeric2 > 0].head()
Out[31]:
In [32]:
scratch_df[scratch_df.char1 == 'AAAAAAAA'].head()
Out[32]:
In [33]:
scratch_df[scratch_df.char1.isin(['AAAAAAAA', 'BBBBBBBB'])].head()
Out[33]:
In [34]:
scratch_df[scratch_df.numeric2 > 0].loc[5:10, 'char2']
Out[34]:
In [35]:
# must use .copy() or this will be a symbolic link
scratch_df2 = scratch_df.copy()
# Pandas supports in place overwrites of data
# overwrite last 500 rows of char1 with ZZZZZZZZ
scratch_df2.loc[500:, 'char1'] = 'ZZZZZZZZ'
scratch_df2.tail()
Out[35]:
In [36]:
# iat[] allows for fast location of specific indices
scratch_df2.iat[0, 0] = 1000
scratch_df2.head()
Out[36]:
In [37]:
scratch_df2.sort_values(by='char1').head()
Out[37]:
In [38]:
scratch_df3 = scratch_df2.sort_values(by=['char1', 'numeric1'],
ascending=[False, True]).copy()
scratch_df3.head()
Out[38]:
In [39]:
scratch_df2.sort_index().head()
Out[39]:
In [40]:
# create a toy dataframe to join/merge onto scratch_df
scratch_df3 = scratch_df3.drop(['numeric1', 'numeric2'] , axis=1)
scratch_df3.columns = ['char3', 'char4']
scratch_df3.tail()
Out[40]:
In [41]:
# default outer join on matching indices
# this will create 2000 row × 6 column dataset because indices are not in identical order
scratch_df4 = pd.concat([scratch_df, scratch_df3])
scratch_df4
Out[41]:
In [42]:
# outer join on matching columns
# axis=1 specificies to join on matching columns
scratch_df5 = pd.concat([scratch_df, scratch_df3], axis=1)
scratch_df5.head()
Out[42]:
In [43]:
scratch_df5.shape
Out[43]:
In [44]:
# append
scratch_df6 = scratch_df.append(scratch_df)
scratch_df6.shape
Out[44]:
In [45]:
scratch_df.equals(scratch_df)
Out[45]:
In [46]:
scratch_df.equals(scratch_df.sort_values(by='char1'))
Out[46]:
In [47]:
scratch_df.equals(scratch_df2)
Out[47]:
In [48]:
scratch_df.mean()
Out[48]:
In [49]:
scratch_df.mode()
Out[49]:
In [50]:
scratch_df.describe()
Out[50]:
In [51]:
# use summary function size() on groups created by groupby()
counts = scratch_df.groupby('char1').size()
plt.figure()
counts.plot.bar(title='Frequency of char1 values (Histogram of char1)')
Out[51]:
In [52]:
# groupby the values of more than one variable
group_means = scratch_df.groupby(['char1', 'char2']).mean()
group_means
Out[52]:
In [53]:
# Pandas .T performs a transpose
scratch_df.T.iloc[:, 0:5]
Out[53]:
Often, instead of simply transposing, a data set will need to be reformatted in a melt/stack - column split - cast action described in Hadley Wickham's Tidy Data: https://www.jstatsoft.org/article/view/v059i10
See the stack
and unstack
methods for Pandas dataframes
In [54]:
# export to csv
scratch_df.to_csv('scratch.csv')
In [55]:
# import from csv
scratch_df7 = pd.read_csv('scratch.csv')