Pandas is an open source Python Library. Like must coding languages you can manipulate data in a very easy manner. This means that my days almost manually puckering data are behind me, since I moved in to the green fields of programing. But after cleansing the data, you still needed a program to load it. Proprietary good old Stata for me was very user friendly, but it is expensive! Must user friendly options for programs are expensive too. And after going through many months of “trial” and hacking my way in to getting several trials… After even considering buying a pirated copy of the program I decided that it was going to be much easier to just learn Python.
Pandas has 2 data structures that are built on top of numpy, this makes them faster.
| Section | Description |
|---|---|
| Series | One dimensional Object, simillar to an array. It assigns label indexes to each item |
| Data Frame | Tabular data structure with rows and coluns |
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline
In [4]:
series = pd.Series([1, "number", 6, "Happy Series!"])
series
Out[4]:
In [5]:
dictionary = {'Favorite Food': 'mexican', 'Favorite city': 'Portland', 'Hometown': 'Mexico City'}
favorite = pd.Series(dictionary)
favorite
Out[5]:
In [7]:
favorite['Favorite Food']
Out[7]:
In [10]:
favorite[favorite=='mexican']
Out[10]:
In [16]:
favorite.notnull()
Out[16]:
In [18]:
favorite[favorite.notnull()]
Out[18]:
In [19]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
football
Out[19]: