Crash Course in Pandas

Pandas is a very popular Python package that provides data structures and data analysis tools. It includes tools for reading and writing various data formats, processing data sets in an efficient DataFrame object, and the ability to reshape, filter, index, and subset data easily.

This notebook is a quick introduction to some of the features of the Pandas package.

Reading Data into DataFrames

The primary data structure in the Pandas package is the DataFrame. A DataFrame is a table-like object that contains named columns of data. Pandas provides readers for many different file types including CSV, Excel, SAS, HTML, etc. Since CSV is such a common format, let's look at reading a CSV file into a Pandas DataFrame.

First, we need to import the Pandas package. A very common convention is to import Pandas using a shortcut named of 'pd'. This is done with the following line.


In [1]:
import pandas as pd

The function that reads CSV files into DataFrames is called read_csv. In the simplest form, you just supply it with a filename or URL. We have the SAS cars data set stored in github that we can point to directly. We'll use the head method to display just the first few records.


In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/'
                 'sassoftware/sas-viya-programming/master/data/cars.csv')
df.head()


Out[2]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

There are many more Pandas data readers that you read about on the Pandas web site.

Displaying Information about DataFrames

We have displayed the DataFrame above. We can get more information about the DataFrame using various properties and methods.

The list of column names can be displayed using the columns property.


In [3]:
df.columns


Out[3]:
Index(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'],
      dtype='object')

The data types of the columns can be displayed using the dtypes property.


In [4]:
df.dtypes


Out[4]:
Make            object
Model           object
Type            object
Origin          object
DriveTrain      object
MSRP           float64
Invoice        float64
EngineSize     float64
Cylinders      float64
Horsepower     float64
MPG_City       float64
MPG_Highway    float64
Weight         float64
Wheelbase      float64
Length         float64
dtype: object

For general information about the DataFrame as a whole, you can use the info method.


In [5]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 15 columns):
Make           428 non-null object
Model          428 non-null object
Type           428 non-null object
Origin         428 non-null object
DriveTrain     428 non-null object
MSRP           428 non-null float64
Invoice        428 non-null float64
EngineSize     428 non-null float64
Cylinders      426 non-null float64
Horsepower     428 non-null float64
MPG_City       428 non-null float64
MPG_Highway    428 non-null float64
Weight         428 non-null float64
Wheelbase      428 non-null float64
Length         428 non-null float64
dtypes: float64(10), object(5)
memory usage: 50.2+ KB

Now that we know more about the columns and their data types, we can move on to subsetting DataFrames into other DataFrames or columns.

Subsetting DataFrames

You can select columns from a DataFrame to create a DataFrame with fewer columns, or you can select a single column of data into a Series. The syntax used for this operation is Python's indexing syntax [...].

If you index into a DataFrame using just the name of a column, you will get a Series back. If you index into a DataFrame using a list of names, you will get a DataFrame back.


In [6]:
subdf = df[['Make', 'Model', 'Horsepower']]
subdf.head()


Out[6]:
Make Model Horsepower
0 Acura MDX 265.0
1 Acura RSX Type S 2dr 200.0
2 Acura TSX 4dr 200.0
3 Acura TL 4dr 270.0
4 Acura 3.5 RL 4dr 225.0

In [7]:
model = df['Model']
model.head()


Out[7]:
0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

In addition to selecting columns of data, DataFrames have more advanced ways of selecting data. We'll look at that in the next section.

Row Indexing

Pandas DataFrames have an index associated both with rows and columns. We have seen the column index in the columns property above. We have also indexed into that collection using the column names. You can also index rows by position and label using the iloc and loc properties, respectively.

Currently, our cars DataFrame only has numeric row indexes, it doesn't have labels, so we'll start with iloc.


In [8]:
df.iloc[3:7]


Out[8]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
5 Acura 3.5 RL w/Navigation 4dr Sedan Asia Front 46100.0 41100.0 3.5 6.0 225.0 18.0 24.0 3893.0 115.0 197.0
6 Acura NSX coupe 2dr manual S Sports Asia Rear 89765.0 79978.0 3.2 6.0 290.0 17.0 24.0 3153.0 100.0 174.0

As you can see above, we selected the row indexs from 3 to 7. In Python, numeric slices such as this include the first index, but do not include the last index. That is why you see row indexes 3, 4, 5, and 6 only in the result.

You can also select rows using a list of row indexes.


In [9]:
df.iloc[[4, 5, 10]]


Out[9]:
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
5 Acura 3.5 RL w/Navigation 4dr Sedan Asia Front 46100.0 41100.0 3.5 6.0 225.0 18.0 24.0 3893.0 115.0 197.0
10 Audi A4 3.0 Quattro 4dr manual Sedan Europe All 33430.0 30366.0 3.0 6.0 220.0 17.0 26.0 3583.0 104.0 179.0

Arguably a more useful way of indexing Pandas DataFrames is using row labels. We can set a column as a row index using the set_index method. Then we can use those labels in the loc property for easy data selection.


In [10]:
df = df.set_index('Model')

In [11]:
df.head()


Out[11]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
MDX Acura SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
RSX Type S 2dr Acura Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
TSX 4dr Acura Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
TL 4dr Acura Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
3.5 RL 4dr Acura Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

You'll see in the output above that the Model column is now displayed as a row index. We can use the values in that index in the loc property. This includes both single values, lists of values, or slices.


In [12]:
df.loc['MDX']


Out[12]:
Make           Acura
Type             SUV
Origin          Asia
DriveTrain       All
MSRP           36945
Invoice        33337
EngineSize       3.5
Cylinders          6
Horsepower       265
MPG_City          17
MPG_Highway       23
Weight          4451
Wheelbase        106
Length           189
Name: MDX, dtype: object

In [13]:
df.loc[['TSX 4dr', 'TL 4dr']]


Out[13]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
TSX 4dr Acura Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
TL 4dr Acura Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0

In [14]:
df.loc['MDX':'TL 4dr']


Out[14]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
MDX Acura SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
RSX Type S 2dr Acura Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
TSX 4dr Acura Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
TL 4dr Acura Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0

Note that when using slicing on labels, the endpoint for the slice is included. This is different than for row indexes.

Indexing Both Rows and Columns

The iloc and loc properties can be used to subset both rows and columns simultaneously. However, the iloc property must only contain positional row and column index information, and loc must only contain row and column labels.


In [15]:
df.iloc[1:5, 3:6]


Out[15]:
DriveTrain MSRP Invoice
Model
RSX Type S 2dr Front 23820.0 21761.0
TSX 4dr Front 26990.0 24647.0
TL 4dr Front 33195.0 30299.0
3.5 RL 4dr Front 43755.0 39014.0

In [16]:
df.loc['MDX':'TL 4dr', 'MSRP':'Horsepower']


Out[16]:
MSRP Invoice EngineSize Cylinders Horsepower
Model
MDX 36945.0 33337.0 3.5 6.0 265.0
RSX Type S 2dr 23820.0 21761.0 2.0 4.0 200.0
TSX 4dr 26990.0 24647.0 2.4 4.0 200.0
TL 4dr 33195.0 30299.0 3.2 6.0 270.0

If you want to mix positional and label information between rows and columns, you need to use the ix property.


In [17]:
df.ix[2:6, 'MSRP':'Cylinders']


Out[17]:
MSRP Invoice EngineSize Cylinders
Model
TSX 4dr 26990.0 24647.0 2.4 4.0
TL 4dr 33195.0 30299.0 3.2 6.0
3.5 RL 4dr 43755.0 39014.0 3.5 6.0
3.5 RL w/Navigation 4dr 46100.0 41100.0 3.5 6.0

Of course, the ix property also supports single values, lists of values, and slices for each component.


In [18]:
df.ix[[1,2,5], ['Make', 'MSRP', 'Horsepower']]


Out[18]:
Make MSRP Horsepower
Model
RSX Type S 2dr Acura 23820.0 200.0
TSX 4dr Acura 26990.0 200.0
3.5 RL w/Navigation 4dr Acura 46100.0 225.0

Boolean Indexing

A more dynamic way of indexing DataFrames is to use boolean indexing. Rather than specifying explicit values for the index, we will use an expression to indicate which rows we want to select. The expression creates a boolean Series which indicates the rows to keep. Here is an example of a boolean series created by comparing the MSRP column to 40000.


In [19]:
df['MSRP'] > 40000


Out[19]:
Model
MDX                                 False
RSX Type S 2dr                      False
TSX 4dr                             False
TL 4dr                              False
3.5 RL 4dr                           True
3.5 RL w/Navigation 4dr              True
NSX coupe 2dr manual S               True
A4 1.8T 4dr                         False
A41.8T convertible 2dr              False
A4 3.0 4dr                          False
A4 3.0 Quattro 4dr manual           False
A4 3.0 Quattro 4dr auto             False
A6 3.0 4dr                          False
A6 3.0 Quattro 4dr                  False
A4 3.0 convertible 2dr               True
A4 3.0 Quattro convertible 2dr       True
A6 2.7 Turbo Quattro 4dr             True
A6 4.2 Quattro 4dr                   True
A8 L Quattro 4dr                     True
S4 Quattro 4dr                       True
RS 6 4dr                             True
TT 1.8 convertible 2dr (coupe)      False
TT 1.8 Quattro 2dr (convertible)    False
TT 3.2 coupe 2dr (convertible)       True
A6 3.0 Avant Quattro                 True
S4 Avant Quattro                     True
X3 3.0i                             False
X5 4.4i                              True
325i 4dr                            False
325Ci 2dr                           False
                                    ...  
Tundra Regular Cab V6               False
Tundra Access Cab V6 SR5            False
Matrix XR                           False
Touareg V6                          False
Golf GLS 4dr                        False
GTI 1.8T 2dr hatch                  False
Jetta GLS TDI 4dr                   False
New Beetle GLS 1.8T 2dr             False
Jetta GLI VR6 4dr                   False
New Beetle GLS convertible 2dr      False
Passat GLS 4dr                      False
Passat GLX V6 4MOTION 4dr           False
Passat W8 4MOTION 4dr               False
Phaeton 4dr                          True
Phaeton W12 4dr                      True
Jetta GL                            False
Passat GLS 1.8T                     False
Passat W8                            True
XC90 T6                              True
S40 4dr                             False
S60 2.5 4dr                         False
S60 T5 4dr                          False
S60 R 4dr                           False
S80 2.9 4dr                         False
S80 2.5T 4dr                        False
C70 LPT convertible 2dr              True
C70 HPT convertible 2dr              True
S80 T6 4dr                           True
V40                                 False
XC70                                False
Name: MSRP, dtype: bool

As you can see, the values of the Series are all True or False. If we use that condition as an index value of a DataFrame, we'll get a new DataFrame containing only the rows where the condition was true.


In [20]:
df[df['MSRP'] > 40000].head()


Out[20]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
3.5 RL 4dr Acura Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
3.5 RL w/Navigation 4dr Acura Sedan Asia Front 46100.0 41100.0 3.5 6.0 225.0 18.0 24.0 3893.0 115.0 197.0
NSX coupe 2dr manual S Acura Sports Asia Rear 89765.0 79978.0 3.2 6.0 290.0 17.0 24.0 3153.0 100.0 174.0
A4 3.0 convertible 2dr Audi Sedan Europe Front 42490.0 38325.0 3.0 6.0 220.0 20.0 27.0 3814.0 105.0 180.0
A4 3.0 Quattro convertible 2dr Audi Sedan Europe All 44240.0 40075.0 3.0 6.0 220.0 18.0 25.0 4013.0 105.0 180.0

You can combine conditions using & for "and" and | for "or". For example, if we wanted cars that had an MSRP greater than $40,000, but also had 8 or more Cylinders, we could do the following. Note that due to the order of precedence of comparison and combination operatiors, you need to surround your comparions with parentheses.


In [21]:
df[(df['MSRP'] > 40000) & (df['Cylinders'] > 8)].head()


Out[21]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
Viper SRT-10 convertible 2dr Dodge Sports USA Rear 81795.0 74451.0 8.3 10.0 500.0 12.0 20.0 3410.0 99.0 176.0
Excursion 6.8 XLT Ford SUV USA All 41475.0 36494.0 6.8 10.0 310.0 10.0 13.0 7190.0 137.0 227.0
CL600 2dr Mercedes-Benz Sedan Europe Rear 128420.0 119600.0 5.5 12.0 493.0 13.0 19.0 4473.0 114.0 196.0
SL600 convertible 2dr Mercedes-Benz Sports Europe Rear 126670.0 117854.0 5.5 12.0 493.0 13.0 19.0 4429.0 101.0 179.0
Phaeton W12 4dr Volkswagen Sedan Europe Front 75000.0 69130.0 6.0 12.0 420.0 12.0 19.0 5399.0 118.0 204.0

Sorting

Sorting in Pandas DataFrames can be done according to the index or column values. The methods used to sort a DataFrame are sort_index for sorting by the index and sort_values for sorting by the data values.

Since we still have our index set as Model, let's sort by that first.


In [22]:
df.sort_index().head()


Out[22]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
3.5 RL 4dr Acura Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
3.5 RL w/Navigation 4dr Acura Sedan Asia Front 46100.0 41100.0 3.5 6.0 225.0 18.0 24.0 3893.0 115.0 197.0
300M 4dr Chrysler Sedan USA Front 29865.0 27797.0 3.5 6.0 250.0 18.0 27.0 3581.0 113.0 198.0
300M Special Edition 4dr Chrysler Sedan USA Front 33295.0 30884.0 3.5 6.0 255.0 18.0 27.0 3650.0 113.0 198.0
325Ci 2dr BMW Sedan Europe Rear 30795.0 28245.0 2.5 6.0 184.0 20.0 29.0 3197.0 107.0 177.0

Sorting by values is done by specify the columns that you want to sort by.


In [23]:
df.sort_values(['MSRP', 'Horsepower']).head()


Out[23]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
Rio 4dr manual Kia Sedan Asia Front 10280.0 9875.0 1.6 4.0 104.0 26.0 33.0 2403.0 95.0 167.0
Accent 2dr hatch Hyundai Sedan Asia Front 10539.0 10107.0 1.6 4.0 103.0 29.0 33.0 2255.0 96.0 167.0
Echo 2dr manual Toyota Sedan Asia Front 10760.0 10144.0 1.5 4.0 108.0 35.0 43.0 2035.0 93.0 163.0
Ion1 4dr Saturn Sedan USA Front 10995.0 10319.0 2.2 4.0 140.0 26.0 35.0 2692.0 103.0 185.0
Rio 4dr auto Kia Sedan Asia Front 11155.0 10705.0 1.6 4.0 104.0 25.0 32.0 2458.0 95.0 167.0

You can also use the ascending= option to specify the order in which the columns should be sorted.


In [24]:
df.sort_values(['MSRP', 'Horsepower'], ascending=[False, True]).head()


Out[24]:
Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Model
911 GT2 2dr Porsche Sports Europe Rear 192465.0 173560.0 3.6 6.0 477.0 17.0 24.0 3131.0 93.0 175.0
CL600 2dr Mercedes-Benz Sedan Europe Rear 128420.0 119600.0 5.5 12.0 493.0 13.0 19.0 4473.0 114.0 196.0
SL600 convertible 2dr Mercedes-Benz Sports Europe Rear 126670.0 117854.0 5.5 12.0 493.0 13.0 19.0 4429.0 101.0 179.0
SL55 AMG 2dr Mercedes-Benz Sports Europe Rear 121770.0 113388.0 5.5 8.0 493.0 14.0 21.0 4235.0 101.0 179.0
CL500 2dr Mercedes-Benz Sedan Europe Rear 94820.0 88324.0 5.0 8.0 302.0 16.0 24.0 4085.0 114.0 196.0

Keep in mind that the sort_index and sort_values methods return a new DataFrame with the sorting applied. If you want to sort a DataFrame in place, you need to specify the inplace=True option.

Let's reset the index before moving on to the next sections.


In [25]:
df.reset_index(inplace=True)

Doing Simple Statistics

Pandas DataFrames have many builtin methods for doing simple statistics. Probably the most common one is describe. It computes various basic statistics for the entire DataFrame.


In [26]:
df.describe()


/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[26]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428.000000 428.000000 428.000000 426.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
mean 32774.855140 30014.700935 3.196729 5.807512 215.885514 20.060748 26.843458 3577.953271 108.154206 186.362150
std 19431.716674 17642.117750 1.108595 1.558443 71.836032 5.238218 5.741201 758.983215 8.311813 14.357991
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
25% 20334.250000 18866.000000 2.375000 NaN 165.000000 17.000000 24.000000 3104.000000 103.000000 178.000000
50% 27635.000000 25294.500000 3.000000 NaN 210.000000 19.000000 26.000000 3474.500000 107.000000 187.000000
75% 39205.000000 35710.250000 3.900000 NaN 255.000000 21.250000 29.000000 3977.750000 112.000000 194.000000
max 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000

The describe method has a couple of options to select different columns of data (include=) and the percentiles to compute (percentiles=). In the example below, we are selecting all of the columns and changing the displayed percentiles to 30% and 70% (50% is always displayed).


In [27]:
df.describe(include='all', percentiles=[0.3, 0.7])


/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[27]:
Model Make Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
count 428 428 428 428 428 428.000000 428.000000 428.000000 426.000000 428.000000 428.000000 428.000000 428.000000 428.000000 428.000000
unique 425 38 6 3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
top G35 4dr Toyota Sedan Asia Front NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
freq 2 28 262 158 226 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN NaN NaN NaN 32774.855140 30014.700935 3.196729 5.807512 215.885514 20.060748 26.843458 3577.953271 108.154206 186.362150
std NaN NaN NaN NaN NaN 19431.716674 17642.117750 1.108595 1.558443 71.836032 5.238218 5.741201 758.983215 8.311813 14.357991
min NaN NaN NaN NaN NaN 10280.000000 9875.000000 1.300000 3.000000 73.000000 10.000000 12.000000 1850.000000 89.000000 143.000000
30% NaN NaN NaN NaN NaN 22001.000000 20288.100000 2.400000 NaN 172.100000 18.000000 25.000000 3217.000000 104.000000 178.000000
50% NaN NaN NaN NaN NaN 27635.000000 25294.500000 3.000000 NaN 210.000000 19.000000 26.000000 3474.500000 107.000000 187.000000
70% NaN NaN NaN NaN NaN 35938.000000 32987.500000 3.600000 NaN 240.000000 21.000000 29.000000 3835.300000 111.000000 192.000000
max NaN NaN NaN NaN NaN 192465.000000 173560.000000 8.300000 12.000000 500.000000 60.000000 66.000000 7190.000000 144.000000 238.000000

You can also select individual statistics using their own methods.


In [28]:
df.min()


Out[28]:
Model          3.5 RL 4dr
Make                Acura
Type               Hybrid
Origin               Asia
DriveTrain            All
MSRP                10280
Invoice              9875
EngineSize            1.3
Cylinders               3
Horsepower             73
MPG_City               10
MPG_Highway            12
Weight               1850
Wheelbase              89
Length                143
dtype: object

In [29]:
df.median()


Out[29]:
MSRP           27635.0
Invoice        25294.5
EngineSize         3.0
Cylinders          6.0
Horsepower       210.0
MPG_City          19.0
MPG_Highway       26.0
Weight          3474.5
Wheelbase        107.0
Length           187.0
dtype: float64

All of the above methods also work on individual columns, or subsets of columns.


In [30]:
df['MSRP'].describe()


Out[30]:
count       428.000000
mean      32774.855140
std       19431.716674
min       10280.000000
25%       20334.250000
50%       27635.000000
75%       39205.000000
max      192465.000000
Name: MSRP, dtype: float64

In [31]:
df['MSRP'].min()


Out[31]:
10280.0

In [32]:
df[['MSRP', 'Horsepower']].describe()


Out[32]:
MSRP Horsepower
count 428.000000 428.000000
mean 32774.855140 215.885514
std 19431.716674 71.836032
min 10280.000000 73.000000
25% 20334.250000 165.000000
50% 27635.000000 210.000000
75% 39205.000000 255.000000
max 192465.000000 500.000000

Grouping Data

Another common operation in analyzing data is grouping by variable values. This is primarily done using the groupby method of DataFrames.


In [33]:
grpdf = df.groupby('Origin')
grpdf


Out[33]:
<pandas.core.groupby.DataFrameGroupBy object at 0x7faed59673c8>

You'll notice that in this case the returned value is a DataFrameGroupBy object. Many of the methods available on a DataFrame will also work on the DataFrameGroupBy object.


In [34]:
grpdf.describe()


/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[34]:
MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
Origin
Asia count 158.000000 158.000000 158.000000 156.000000 158.000000 158.000000 158.000000 158.000000 158.000000 158.000000
mean 24741.322785 22602.177215 2.774051 5.185897 190.702532 22.012658 28.265823 3319.316456 105.886076 182.816456
std 11321.069675 9842.984880 0.902310 1.269008 59.392627 6.733307 6.770503 717.842132 7.735249 12.564148
min 10280.000000 9875.000000 1.300000 3.000000 73.000000 13.000000 17.000000 1850.000000 89.000000 153.000000
25% 17208.000000 16271.500000 2.000000 NaN 142.000000 18.000000 25.000000 2751.500000 102.000000 175.000000
50% 23032.500000 20949.500000 2.600000 NaN 187.500000 20.500000 27.000000 3280.000000 105.000000 183.500000
75% 28787.500000 26645.000000 3.500000 NaN 233.750000 24.000000 31.000000 3706.750000 110.000000 191.000000
max 89765.000000 79978.000000 5.600000 8.000000 340.000000 60.000000 66.000000 5590.000000 140.000000 224.000000
Europe count 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000 123.000000
mean 48349.796748 44395.081301 3.206504 6.235772 251.894309 18.731707 26.008130 3680.723577 106.447154 181.845528
std 25318.600464 23080.368888 1.037157 1.679585 80.738884 3.289509 4.167588 579.576871 6.690099 11.479870
min 16999.000000 15437.000000 1.600000 4.000000 100.000000 12.000000 14.000000 2524.000000 93.000000 143.000000
25% 33837.500000 31287.500000 2.500000 5.000000 195.500000 17.000000 24.000000 3285.000000 102.000000 177.000000
50% 40590.000000 37575.000000 3.000000 6.000000 225.000000 19.000000 26.000000 3585.000000 107.000000 182.000000
75% 56382.500000 51042.500000 4.200000 8.000000 302.000000 20.000000 28.500000 3959.500000 110.000000 190.000000
max 192465.000000 173560.000000 6.000000 12.000000 493.000000 38.000000 46.000000 5423.000000 123.000000 204.000000
USA count 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000
mean 28377.442177 25949.340136 3.642857 6.108844 212.823129 19.074830 26.013605 3769.952381 112.020408 193.952381
std 11711.982506 10518.722194 1.194221 1.531090 63.748618 3.982992 5.396582 855.305524 8.788590 15.305265
min 10995.000000 10319.000000 1.600000 4.000000 103.000000 10.000000 12.000000 2348.000000 93.000000 150.000000
25% 20315.000000 18927.000000 2.700000 5.000000 160.000000 17.000000 22.000000 3234.000000 106.000000 185.000000
50% 25520.000000 23217.000000 3.600000 6.000000 200.000000 18.000000 26.000000 3606.000000 111.000000 194.000000
75% 33767.500000 30778.000000 4.600000 8.000000 250.000000 21.000000 29.000000 4235.000000 115.000000 201.500000
max 81795.000000 74451.000000 8.300000 10.000000 500.000000 29.000000 37.000000 7190.000000 144.000000 238.000000

In [35]:
grpdf[['MSRP', 'Horsepower']].describe(percentiles=[0.2, 0.8])


Out[35]:
MSRP Horsepower
Origin
Asia count 158.000000 158.000000
mean 24741.322785 190.702532
std 11321.069675 59.392627
min 10280.000000 73.000000
20% 15680.800000 130.000000
50% 23032.500000 187.500000
80% 30995.000000 240.000000
max 89765.000000 340.000000
Europe count 123.000000 123.000000
mean 48349.796748 251.894309
std 25318.600464 80.738884
min 16999.000000 100.000000
20% 32506.000000 184.000000
50% 40590.000000 225.000000
80% 64248.000000 315.000000
max 192465.000000 493.000000
USA count 147.000000 147.000000
mean 28377.442177 212.823129
std 11711.982506 63.748618
min 10995.000000 103.000000
20% 19099.000000 150.000000
50% 25520.000000 200.000000
80% 36079.000000 275.000000
max 81795.000000 500.000000

It is also possible to loop through all of the groupings using iteration in Python.


In [36]:
for name, group in grpdf:
    print('=' * 35, name, '=' * 35)
    print('')
    print(group.head())
    print('')


=================================== Asia ===================================

            Model   Make   Type Origin DriveTrain     MSRP  Invoice  \
0             MDX  Acura    SUV   Asia        All  36945.0  33337.0   
1  RSX Type S 2dr  Acura  Sedan   Asia      Front  23820.0  21761.0   
2         TSX 4dr  Acura  Sedan   Asia      Front  26990.0  24647.0   
3          TL 4dr  Acura  Sedan   Asia      Front  33195.0  30299.0   
4      3.5 RL 4dr  Acura  Sedan   Asia      Front  43755.0  39014.0   

   EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  \
0         3.5        6.0       265.0      17.0         23.0  4451.0   
1         2.0        4.0       200.0      24.0         31.0  2778.0   
2         2.4        4.0       200.0      22.0         29.0  3230.0   
3         3.2        6.0       270.0      20.0         28.0  3575.0   
4         3.5        6.0       225.0      18.0         24.0  3880.0   

   Wheelbase  Length  
0      106.0   189.0  
1      101.0   172.0  
2      105.0   183.0  
3      108.0   186.0  
4      115.0   197.0  

=================================== Europe ===================================

                        Model  Make   Type  Origin DriveTrain     MSRP  \
7                 A4 1.8T 4dr  Audi  Sedan  Europe      Front  25940.0   
8      A41.8T convertible 2dr  Audi  Sedan  Europe      Front  35940.0   
9                  A4 3.0 4dr  Audi  Sedan  Europe      Front  31840.0   
10  A4 3.0 Quattro 4dr manual  Audi  Sedan  Europe        All  33430.0   
11    A4 3.0 Quattro 4dr auto  Audi  Sedan  Europe        All  34480.0   

    Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  \
7   23508.0         1.8        4.0       170.0      22.0         31.0  3252.0   
8   32506.0         1.8        4.0       170.0      23.0         30.0  3638.0   
9   28846.0         3.0        6.0       220.0      20.0         28.0  3462.0   
10  30366.0         3.0        6.0       220.0      17.0         26.0  3583.0   
11  31388.0         3.0        6.0       220.0      18.0         25.0  3627.0   

    Wheelbase  Length  
7       104.0   179.0  
8       105.0   180.0  
9       104.0   179.0  
10      104.0   179.0  
11      104.0   179.0  

=================================== USA ===================================

                 Model   Make   Type Origin DriveTrain     MSRP  Invoice  \
46             Rainier  Buick    SUV    USA        All  37895.0  34357.0   
47       Rendezvous CX  Buick    SUV    USA      Front  26545.0  24085.0   
48  Century Custom 4dr  Buick  Sedan    USA      Front  22180.0  20351.0   
49  LeSabre Custom 4dr  Buick  Sedan    USA      Front  26470.0  24282.0   
50        Regal LS 4dr  Buick  Sedan    USA      Front  24895.0  22835.0   

    EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  \
46         4.2        6.0       275.0      15.0         21.0  4600.0   
47         3.4        6.0       185.0      19.0         26.0  4024.0   
48         3.1        6.0       175.0      20.0         30.0  3353.0   
49         3.8        6.0       205.0      20.0         29.0  3567.0   
50         3.8        6.0       200.0      20.0         30.0  3461.0   

    Wheelbase  Length  
46      113.0   193.0  
47      112.0   187.0  
48      109.0   195.0  
49      112.0   200.0  
50      109.0   196.0  

There are far too many grouping features in Pandas to cover here. For more information, see the Pandas Group By documentation.

Plotting

There are several packages for creating plots in Python. These include matplotlib, seaborn, bokeh, plot.ly, or even Pandas itself. Many of these packages such as seaborn and the Pandas plotting features use matplotlib in the background. Packages like bokeh and plot.ly are primarily focused on graphics that are rendered in a web browser.

The most basic plotting features can be accessed in the plot method of the DataFrame. To start, we can create a scatter plot of the MSRP values.


In [37]:
%matplotlib inline

In [38]:
df.plot(kind='scatter', x='MSRP', y='Horsepower', figsize=(12,6))


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faed529c748>

Using the cufflinks package, which is a DataFrame-style interface to plot.ly, we can create a graph that is rendered in the browser and includes interactive features.


In [39]:
import cufflinks

# Configure cufflinks to render the graph locally.
cufflinks.go_offline()

df.iplot(kind='scatter', x='MSRP', y='Horsepower', mode='markers', 
         size=8, dimensions=(750, 400))