Many statistical Python packages can deal with numpy Arrays.
Numpy Arrays however are not always easy to use.
Pandas is a package that provides a dataframe interface, similar to what R uses as the main data structure. Since Pandas has become so popular, many packages accept both pd.DataFrames and numpy Arrays.
In [1]:
import os
from dotenv import load_dotenv, find_dotenv
# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()
# load up the entries as environment variables
load_dotenv(dotenv_path)
Out[1]:
In [2]:
PROJECT_DIR = os.path.dirname(dotenv_path)
RAW_DATA_DIR = PROJECT_DIR + os.environ.get("RAW_DATA_DIR")
INTERIM_DATA_DIR = PROJECT_DIR + os.environ.get("INTERIM_DATA_DIR")
files=os.environ.get("FILES").split()
print("Project directory is : {0}".format(PROJECT_DIR))
print("Raw data directory is : {0}".format(RAW_DATA_DIR))
print("Interim directory is : {0}".format(INTERIM_DATA_DIR))
In [3]:
# The following jupyter notebook magic makes the plots appear in the notebook.
# If you run in batch mode, you have to save your plots as images.
%matplotlib inline
# matplotlib.pyplot is traditionally imported as plt
import matplotlib.pyplot as plt
# Pandas is traditionaly imported as pd.
import pandas as pd
from pylab import rcParams
# some display options to size the figures. feel free to experiment
pd.set_option('display.max_columns', 25)
rcParams['figure.figsize'] = (17, 7)
Reading a CSV file is really easy in Pandas. There are several formats that Pandas can deal with.
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google Big Query | read_gbq | to_gbq |
We will use pd.read_csv()
.
As you will see, the Jupyter notebook prints out a very nice rendition of the DataFrame object that is the result
In [4]:
family=pd.read_csv(RAW_DATA_DIR+'/familyxx.csv')
persons=pd.read_csv(RAW_DATA_DIR+'/personsx.csv')
samadult=pd.read_csv(RAW_DATA_DIR+'/samadult.csv')
In [5]:
househld=pd.read_csv(RAW_DATA_DIR+'/househld.csv')
househld
Out[5]:
Pandas uses two main data structures:
- Series
- DataFrames
Series are lists of elements with an index. The index can be simple (0,1,2,3,...) or complex (01-Jan-17, 02-Jan-17, ...)
DataFrames can be seen as a dictionary of Series.
This means you can access a column by name.
In [6]:
persons_in_household=househld['ACPT_PER']
You can index the result like you would index a list
In [7]:
print(persons_in_household[:5])
But the type is a Series
In [8]:
print(type(persons_in_household))
and that means it has additional attributes, like name, and summary stats
In [9]:
print(persons_in_household.name)
In [10]:
print(persons_in_household.describe())
In [11]:
accepted=househld[['ACPT_PER','ACPTCHLD']]
print(type(accepted))
In [12]:
print(accepted.head())
In [13]:
accepted.describe()
Out[13]:
In [14]:
accepted.plot(kind='box')
Out[14]:
In [15]:
househld['REGION'].plot(kind='hist')
Out[15]:
What we see is that the number of responses is not equally divided. But the data also has a weight for the household that allows us to extrapolate to the overal US population.
We will use Group By to add up the weights.
In [16]:
by_region=househld[['REGION','WTFA_HH']].groupby('REGION').sum()
by_region
Out[16]:
Add a column
In [17]:
responses=househld[['REGION','WTFA_HH']].groupby('REGION').count()
responses.name = "Responses"
by_region['Responses']=responses
by_region
Out[17]:
And we will change the index to a more complex one, based on the documentation of the household file.
In [18]:
by_region.index=['Northeast','Midwest','South','West']
by_region
Out[18]:
In [19]:
print(by_region.to_json())
In [20]:
non_response_code=househld['NON_INTV']
In [21]:
import math
# If the value Is Not A Number math.isnan() will return True.
responded=[math.isnan(x) for x in non_response_code]
notresponded=[not math.isnan(x) for x in non_response_code]
resp=househld[responded]
nonresp=househld[notresponded]
print("Total size: {}".format(househld.shape))
print("Responses: {}".format(resp.shape))
print("Non responses: {}".format(nonresp.shape))
Now we create a group by the reason code, why people did not respond
In [22]:
non_intv_group=nonresp.groupby('NON_INTV')
In [23]:
non_intv_group.size()
Out[23]:
In [24]:
househld['INTV_MON'].hist(by=househld['NON_INTV'].fillna(0))
Out[24]: