Pandas

Fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

In this section we will be looking at some of the basics functions that Pandas can perform

Panda data structures

  • Series - 1D array
  • DataFrame - 2D array
  • Panel - 3D array

Data types - dtypes

Types in pandas objects:

  • float
  • int
  • bool
  • datetime64[ns] and datetime64[ns, tz] (in >= 0.17.0)
  • timedelta[ns]
  • category (in >= 0.15.0)
  • object

dtypes have item sizes, e.g. int64 and int32

Standard libraries


In [ ]:
import pandas as pd
import numpy as np

Series

Creating a Series by passing a list of values, letting pandas create a default integer index:


In [ ]:
s = pd.Series([1,3,5,np.nan,6,8])
s

DataFrame

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:


In [ ]:
dates = pd.date_range('20130101', periods=6)
dates

In [ ]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

Creating a DataFrame by passing a dict of objects that can be converted to series-like.


In [ ]:
df2 = pd.DataFrame({ 'A' : 1.,
   'B' : pd.Timestamp('20130102'),
   'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   'D' : np.array([3] * 4,dtype='int32'),
   'E' : pd.Categorical(["test","train","test","train"]),
   'F' : 'foo' })
df2

In [ ]:
df2.dtypes

Panel

  • items: axis 0, each item corresponds to a DataFrame contained inside
  • major_axis: axis 1, it is the index (rows) of each of the DataFrames
  • minor_axis: axis 2, it is the columns of each of the DataFrames

In [ ]:
wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
   major_axis=pd.date_range('1/1/2000', periods=5),
   minor_axis=['A', 'B', 'C', 'D'])
wp

Viewing data


In [ ]:
df.head()

In [ ]:
df.tail(3)

See NumPy data


In [ ]:
df.index

In [ ]:
df.columns

In [ ]:
df.values

Statistic Summary


In [ ]:
df.describe()

Transposing data


In [ ]:
df.T

Sorting


In [ ]:
df.sort_index(axis=1, ascending=False)

Selection

Get a column


In [ ]:
df['A']

Get rows


In [ ]:
# By index
df[0:3]

In [ ]:
#By Value
df['20130102':'20130104']

Selection by Label


In [ ]:
df.loc[dates[0]]

In [ ]:
# Limit columns
df.loc[:,['A','B']]

In [ ]:
df_stock = pd.DataFrame({'Stocks': ["AAPL","CA","CTXS","FIS","MA"],
                'Values': [126.17,31.85,65.38,64.08,88.72]})
df_stock

Adding data


In [ ]:
df_stock = df_stock.append({"Stocks":"GOOG", "Values":523.53}, ignore_index=True)
df_stock

Boolean indexing


In [ ]:
df_stock[df_stock["Values"]>65]

Stats operations


In [ ]:
df_stock.mean()

In [ ]:
# Per column
df.mean()

In [ ]:
# Per row
df.mean(1)

Optimized pandas data access

It is recommended to use the optimized pandas data access methods .at, .iat, .loc, .iloc and .ix.


In [ ]:
big_dates = pd.date_range('20130101', periods=60000)
big_dates
big_df = pd.DataFrame(np.random.randn(60000,4), index=big_dates, columns=list('ABCD'))
big_df

In [ ]:
big_df['20200102':'20200104']

In [ ]:
big_df.loc['20130102':'20130104']

In [ ]:
%timeit big_df['20200102':'20200104']
%timeit big_df.loc['20200102':'20200104']

In [ ]:
big_df[30000:30003]

In [ ]:
big_df.iloc[30000:30003]

In [ ]:
%timeit big_df[30000:30003]
%timeit big_df.iloc[30000:30003]

In [ ]: