Pandas

Source: https://github.com/jdhp-docs/notebooks/blob/master/python_pandas_en.ipynb

Official documentation: http://pandas.pydata.org/pandas-docs/stable/

Getting started:

Import directives



In [ ]:

    
%matplotlib inline
#%matplotlib notebook

from IPython.display import display

import matplotlib
matplotlib.rcParams['figure.figsize'] = (9, 9)
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import datetime
import pandas as pd
import numpy as np



In [ ]:

    
pd.__version__

Make data

Series (1D data)

C.f. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

With automatic indices



In [ ]:

    
data_list = [1, 3, np.nan, 7]
series = pd.Series(data_list)
series



In [ ]:

    
data_array = np.array(data_list)
series = pd.Series(data_array)
series

With defined indices



In [ ]:

    
indices = pd.Series([1, 3, 5, 7])
series = pd.Series([10, 30, 50, 70], index=indices)
series



In [ ]:

    
indices = pd.Series(['A', 'B', 'C', 'D'])
series = pd.Series([10, 30, 50, 70], index=indices)
series



In [ ]:

    
data_dict = {'A': 10, 'B': 30, 'C': 50, 'D': 70}
series = pd.Series(data_dict)
series

Get information about a series



In [ ]:

    
series.index



In [ ]:

    
series.values



In [ ]:

    
series.shape



In [ ]:

    
series.dtypes



In [ ]:

    
series.describe()



In [ ]:

    
type(series.describe())



In [ ]:

    
series.memory_usage()

Date ranges



In [ ]:

    
dates = pd.date_range('20130101', periods=6)
dates



In [ ]:

    
dates = pd.date_range(start='2013-01-01', end='2013-01-08')
dates



In [ ]:

    
dates = pd.date_range('2013-01-01', periods=4, freq='M')
dates



In [ ]:

    
num_days = 7

data = np.random.random(num_days)
index = pd.date_range('2017-01-01', periods=num_days)

series = pd.Series(data, index)
series

Frames (2D data)

C.f. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

With automatic indices and columns



In [ ]:

    
data_list = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(data_array)
df



In [ ]:

    
data_array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data_array)
df

With defined indices and columns

Using lists:



In [ ]:

    
data = [[1, 2, 3], [4, 5, 6]]
index = [10, 20]
columns = ['A', 'B', 'C']

df = pd.DataFrame(data, index, columns)
df

Using numpy arrays:



In [ ]:

    
data = np.array([[1, 2, 3], [4, 5, 6]])
index = np.array([10, 20])
columns = np.array(['A', 'B', 'C'])

df = pd.DataFrame(data, index=index, columns=columns)
df

Using Series:



In [ ]:

    
data = np.array([[1, 2, 3], [4, 5, 6]])
index = pd.Series([10, 20])
columns = pd.Series(['A', 'B', 'C'])

df = pd.DataFrame(data, index=index, columns=columns)
df

With columns from dict

Dictionary keys define columns label.



In [ ]:

    
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3}
df = pd.DataFrame(data_dict)
df

To define index as well:



In [ ]:

    
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3}
df = pd.DataFrame(data_dict, index=[10, 20, 30])
df

Get information about a dataframe



In [ ]:

    
df.index



In [ ]:

    
df.columns



In [ ]:

    
df.values



In [ ]:

    
df.shape



In [ ]:

    
df.dtypes



In [ ]:

    
df.info()



In [ ]:

    
df.describe()



In [ ]:

    
type(df.describe())



In [ ]:

    
df.memory_usage()

More details about dtype

DataFrame's columns can have different types. But what about rows ?

What append when a DataFrame with columns having different type is transposed ?



In [ ]:

    
data_dict = {'A': 'foo',
             'B': [10, 20, 30],
             'C': 3}
df = pd.DataFrame(data_dict)
df



In [ ]:

    
df.dtypes



In [ ]:

    
df2 = df.T
df2



In [ ]:

    
df2.dtypes

Panels (3D data)

Panels are deprecated.

Pandas now focuses on 1D (Series) and 2D (DataFrame) data structures.

The recommended alternative to work with 3-dimensional data is the xarray python library.

An other workaround: one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

See http://pandas.pydata.org/pandas-docs/stable/dsintro.html#deprecate-panel.

Panel4D and PanelND (ND data)

Panel4D and PanelND are deprecated.

Pandas now focuses on 1D (Series) and 2D (DataFrame) data structures.

The recommended alternative to work with n-dimensional data is the xarray python library.

An other workaround: one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

See http://pandas.pydata.org/pandas-docs/stable/dsintro.html#panel4d-and-panelnd-deprecated.

Export/import data (write/read files)

See http://pandas.pydata.org/pandas-docs/stable/io.html

Reader functions are accessibles from the top level pd object.

Writer functions are accessibles from data objects (i.e. Series, DataFrame or Panel objects).



In [ ]:

    
data_array = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data_array, index=[10, 20], columns=[100, 200, 300])
df

HDF5 files

See python_pandas_hdf5_en.ipynb...

CSV files

See http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files

Write CSV files

See http://pandas.pydata.org/pandas-docs/stable/io.html#io-store-in-csv

Simplest version:



In [ ]:

    
df.to_csv(path_or_buf="python_pandas_io_test.csv")



In [ ]:

    
!cat python_pandas_io_test.csv

Setting more options:



In [ ]:

    
# FYI, many other options are available
df.to_csv(path_or_buf="python_pandas_io_test.csv",
          sep=',',
          columns=None,
          header=True,
          index=True,
          index_label=None,
          compression=None,  # allowed values are 'gzip', 'bz2' or 'xz'
          date_format=None)



In [ ]:

    
!cat python_pandas_io_test.csv

Read CSV files

See http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table

Simplest version:



In [ ]:

    
df = pd.read_csv("python_pandas_io_test.csv")
df

Setting more options:



In [ ]:

    
df = pd.read_csv("python_pandas_io_test.csv",
                 sep=',',
                 delimiter=None,
                 header='infer',
                 names=None,
                 index_col=0,
                 usecols=None,
                 squeeze=False,
                 prefix=None,
                 mangle_dupe_cols=True,
                 dtype=None,
                 engine=None,
                 converters=None,
                 true_values=None,
                 false_values=None,
                 skipinitialspace=False,
                 skiprows=None,
                 nrows=None,
                 na_values=None,
                 keep_default_na=True,
                 na_filter=True,
                 verbose=False,
                 skip_blank_lines=True,
                 parse_dates=False,
                 infer_datetime_format=False,
                 keep_date_col=False,
                 date_parser=None,
                 dayfirst=False,
                 iterator=False,
                 chunksize=None,
                 compression='infer',
                 thousands=None,
                 decimal=b'.',
                 lineterminator=None,
                 quotechar='"',
                 quoting=0,
                 escapechar=None,
                 comment=None,
                 encoding=None,
                 dialect=None,
                 #tupleize_cols=False,
                 error_bad_lines=True,
                 warn_bad_lines=True,
                 skipfooter=0,
                 #skip_footer=0,
                 doublequote=True,
                 delim_whitespace=False,
                 #as_recarray=False,
                 #compact_ints=False,
                 #use_unsigned=False,
                 low_memory=True,
                 #buffer_lines=None,
                 memory_map=False,
                 float_precision=None)
df



In [ ]:

    
!rm python_pandas_io_test.csv

JSON files

See http://pandas.pydata.org/pandas-docs/stable/io.html#json



In [ ]:

    
import io

Write JSON files

See http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-writer

Simplest version



In [ ]:

    
df.to_json(path_or_buf="python_pandas_io_test.json")



In [ ]:

    
!cat python_pandas_io_test.json

Setting `orient`="split"



In [ ]:

    
df.to_json(path_or_buf="python_pandas_io_test_split.json",
           orient="split")



In [ ]:

    
!cat python_pandas_io_test_split.json

Setting `orient`="records"



In [ ]:

    
df.to_json(path_or_buf="python_pandas_io_test_records.json",
           orient="records")



In [ ]:

    
!cat python_pandas_io_test_records.json

Setting `orient`="index" (the default option for `Series`)



In [ ]:

    
df.to_json(path_or_buf="python_pandas_io_test_index.json",
           orient="index")



In [ ]:

    
!cat python_pandas_io_test_index.json

Setting `orient`="columns" (the default option for `DataFrame`) (for `DataFrame` only)



In [ ]:

    
df.to_json(path_or_buf="python_pandas_io_test_columns.json",
           orient="columns")



In [ ]:

    
!cat python_pandas_io_test_columns.json

Setting `orient`="values" (for `DataFrame` only)



In [ ]:

    
df.to_json(path_or_buf="python_pandas_io_test_values.json",
           orient="values")



In [ ]:

    
!cat python_pandas_io_test_values.json

Setting more options



In [ ]:

    
# FYI, many other options are available
df.to_json(path_or_buf="python_pandas_io_test.json",
           orient='columns',     # For DataFrame: 'split','records','index','columns' or 'values'
           date_format=None,     # None, 'epoch' or 'iso'
           double_precision=10,
           force_ascii=True,
           date_unit='ms')



In [ ]:

    
!cat python_pandas_io_test.json

Read JSON files

See http://pandas.pydata.org/pandas-docs/stable/io.html#io-json-reader

Using `orient`="split"

Dict like data {index -> [index], columns -> [columns], data -> [values]}



In [ ]:

    
!cat python_pandas_io_test_split.json



In [ ]:

    
df = pd.read_json("python_pandas_io_test_split.json",
                  orient="split")
df

Using `orient`="records"

List like [{column -> value}, ... , {column -> value}]



In [ ]:

    
!cat python_pandas_io_test_records.json



In [ ]:

    
df = pd.read_json("python_pandas_io_test_records.json",
                  orient="records")
df

Using `orient`="index"

Dict like {index -> {column -> value}}



In [ ]:

    
!cat python_pandas_io_test_index.json



In [ ]:

    
df = pd.read_json("python_pandas_io_test_index.json",
                  orient="index")
df

Using `orient`="columns"

Dict like {column -> {index -> value}}



In [ ]:

    
!cat python_pandas_io_test_columns.json



In [ ]:

    
df = pd.read_json("python_pandas_io_test_columns.json",
                  orient="columns")
df

Using `orient`="values" (for `DataFrame` only)

Just the values array



In [ ]:

    
!cat python_pandas_io_test_values.json



In [ ]:

    
df = pd.read_json("python_pandas_io_test_values.json",
                  orient="values")
df

Setting more options



In [ ]:

    
df = pd.read_json("python_pandas_io_test.json",
                  orient=None,
                  typ='frame',
                  dtype=True,
                  convert_axes=True,
                  convert_dates=True,
                  keep_default_dates=True,
                  numpy=False,
                  precise_float=False,
                  date_unit=None,
                  encoding=None,
                  lines=False)
df



In [ ]:

    
!rm python_pandas_io_test*.json

YAML

See https://stackoverflow.com/a/35968457



In [ ]:

    
!echo "- {A: 1, B: 2}"     > python_pandas_io_test.yaml
!echo "- {A: 3}" >> python_pandas_io_test.yaml
!echo "- {B: 4}" >> python_pandas_io_test.yaml



In [ ]:

    
!cat python_pandas_io_test.yaml



In [ ]:

    
try:
    import yaml

    with open('python_pandas_io_test.yaml', 'r') as f:
        df = pd.io.json.json_normalize(yaml.load(f))

    print(df)
except:
    pass



In [ ]:

    
!rm python_pandas_io_test.yaml

Other file formats

Many other file formats can be used to import or export data with JSON.

See the following link for more information: http://pandas.pydata.org/pandas-docs/stable/io.html

Select columns



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df

Select a single column

The following instructions return a Series.

Label based selection



In [ ]:

    
df.B



In [ ]:

    
df["B"]



In [ ]:

    
df.loc[:,"B"]

Index based selection



In [ ]:

    
df.iloc[:,1]

Select multiple columns

Label based selection



In [ ]:

    
df[['A','B']]



In [ ]:

    
df.loc[:,['A','B']]

Index based selection



In [ ]:

    
df.iloc[:,0:2]

Select rows



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=["i" + str(i+1) for i in range(9)],
                  columns=['A', 'B', 'C'])
df

Select a single row

The following instructions return a Series.

Label based selection



In [ ]:

    
df.loc["i3"]



In [ ]:

    
df.loc["i3",:]

Index based selection



In [ ]:

    
df.iloc[2]      # Select over index



In [ ]:

    
df.iloc[2,:]      # Select over index

Select multiple rows

Label based selection



In [ ]:

    
df.loc[["i3", "i4"],:]

Index based selection



In [ ]:

    
df.iloc[2:4,:]      # Select over index

Select rows based on values



In [ ]:

    
df.B < 50.



In [ ]:

    
type(df.B < 50.)



In [ ]:

    
df[[True, True, True, True, False, False, False, False, False]]



In [ ]:

    
series_mask = pd.Series({'i1': True,
                         'i2': True,
                         'i3': True,
                         'i4': True,
                         'i5': False,
                         'i6': False,
                         'i7': False,
                         'i8': False,
                         'i9': False})
df[series_mask]



In [ ]:

    
df[df.B < 50.]



In [ ]:

    
df[df['B'] < 50.]



In [ ]:

    
df[(df.A >= 2) & (df.B < 50)]

This can be written:



In [ ]:

    
df.loc[(df.A >= 2) & (df.B < 50)]

This could be written df[df.A >= 2][df.B < 50] but this is a bad practice (named "chained indexing").

"When setting values in a pandas object, care must be taken to avoid what is called chained indexing".

See:

Select rows and columns



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df



In [ ]:

    
df[(df.A >= 2) & (df.B < 50)]



In [ ]:

    
df[(df.B < 20) | (df.B > 50)]



In [ ]:

    
df.loc[(df.B < 20) | (df.B > 50), 'C']



In [ ]:

    
df[(df['A'] >= 2) & (df['B'] < 50)]



In [ ]:

    
df.loc[(df.A >= 2) & (df.B < 50), ['A','B']]

Setting values

Apply a function to selected colunms values



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df



In [ ]:

    
df.B *= 2.
df



In [ ]:

    
df.B = pow(df.B, 2)
df

Apply a function to selected rows values



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df



In [ ]:

    
df[df.B < 50.] *= -1.
df



In [ ]:

    
# df['B'][df['B'] < 50.] = 0    # OK but chain indexing is bad...
# df.A[df.B < 50.] = 0          # OK but chain indexing is bad...
df.loc[df.B < 50., 'A'] = 0
df

WARNING: df[df.B < 50.].A = 0 does NOT work even if df.A[df.B < 50.] and df[df.B < 50.].A seems to produce the same result...

"When setting values in a pandas object, care must be taken to avoid what is called chained indexing".

See:



In [ ]:

    
df.loc[(df.B < 50.) & (df.B > 20), 'C'] = 0
df



In [ ]:

    
df.loc[(df.B < 20) | (df.B > 50), 'C'] = -1
df



In [ ]:

    
df[df.B < 50.] = pow(df[df.B < 50.], 2)
df

Sample rows



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df

With replacement

Draw 3 samples:



In [ ]:

    
df.sample(n=30, replace=True)

Sample 90% of the rows:



In [ ]:

    
df.sample(frac=0.9, replace=True)

Without replacement

Draw 3 samples:



In [ ]:

    
df.sample(n=3)

Sample 90% of the rows:



In [ ]:

    
df.sample(frac=0.9)

Weighted sampling



In [ ]:

    
df.sample(n=30, replace=True, weights=np.arange(len(df)))

Shuffle/permute rows

See:



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df



In [ ]:

    
df = df.sample(frac=1)
df

To reset indexes too:



In [ ]:

    
df = df.sample(frac=1).reset_index(drop=True)
df

Sort a DataFrame



In [ ]:

    
NROWS = 7

col1 = np.arange(1., NROWS, 1)
col2 = np.arange(10., NROWS*10, 10)
col3 = np.arange(100., NROWS*100, 100)
np.random.shuffle(col1)
np.random.shuffle(col2)
np.random.shuffle(col3)

data = np.array([col1,
                 col2,
                 col3]).T
index = np.arange(1, NROWS, 1)
columns = np.array(['A', 'B', 'C'])

np.random.shuffle(index)
np.random.shuffle(data)
np.random.shuffle(columns)

df = pd.DataFrame(data,
                  index=index,
                  columns=columns)
df

Sorting by row index or column label

Sorting by rows



In [ ]:

    
df.sort_index()



In [ ]:

    
df.sort_index(axis=0)    # axis=0 -> sort by row index



In [ ]:

    
df.sort_index(ascending=False)

Sorting by columns



In [ ]:

    
df.sort_index(axis=1)    # axis=1 -> sort by column label



In [ ]:

    
df.sort_index(axis=1, ascending=False)

Sorting by values



In [ ]:

    
df.sort_values(by='B')



In [ ]:

    
df.sort_values(by='B', ascending=False)



In [ ]:

    
df.sort_values(by='B', inplace=True)
df

Missing data

See http://pandas.pydata.org/pandas-docs/stable/10min.html#missing-data



In [ ]:

    
a = np.array([[3, np.nan, 5, np.nan, 7],
              [2, 4, np.nan, 3, 1],
              [3, 4, 5, 6, 1]]).T
df = pd.DataFrame(a,
                  columns=['A', 'B', 'C'])

df

Get the boolean mask where values are nan



In [ ]:

    
df.isnull()

Drop any rows that have missing data



In [ ]:

    
df.dropna()



In [ ]:

    
df.dropna(how='any')      # but 'any' is the default value...

Drop any rows that have missing data in a given column

See https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-certain-columns-is-nan



In [ ]:

    
df.dropna(subset=['B'])



In [ ]:

    
df.dropna(subset=['B', 'C'])

Drop any columns that have missing data



In [ ]:

    
df.dropna(axis=1)



In [ ]:

    
df.dropna(axis=1, how='any')      # but 'any' is the default value...

Drop any columns that have missing data in a given row



In [ ]:

    
df.dropna(axis=1, subset=[2])



In [ ]:

    
df.dropna(axis=1, subset=[1, 2])

Filling missing data



In [ ]:

    
df.fillna(value=999)

Count the number of NaN values in a given column



In [ ]:

    
df.A.isnull().sum()

Miscellaneous operations on data frames

Transpose of a data frame



In [ ]:

    
data_array = np.array([np.arange(1, 10, 1), np.arange(10, 100, 10), np.arange(100, 1000, 100)]).T
df = pd.DataFrame(data_array,
                  index=np.arange(1, 10, 1),
                  columns=['A', 'B', 'C'])
df



In [ ]:

    
df.T

Merge

See: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge



In [ ]:

    
a1 = np.array([np.arange(1, 5, 1), np.arange(10, 50, 10), np.arange(100, 500, 100)]).T
df1 = pd.DataFrame(a1,
                   columns=['ID', 'B', 'C'])

a2 = np.array([np.arange(1, 5, 1), np.arange(1000, 5000, 1000), np.arange(10000, 50000, 10000)]).T
df2 = pd.DataFrame(a2,
                   columns=['ID', 'B', 'C'])

display(df1)
display(df2)

df = pd.merge(df1, df2, on="ID", suffixes=('_1', '_2'))  #.dropna(how='any')

display(df)

Merge with NaN



In [ ]:

    
a1 = np.array([np.arange(1, 5, 1), np.arange(10, 50, 10), np.arange(100, 500, 100)]).T
df1 = pd.DataFrame(a1,
                   columns=['ID', 'B', 'C'])

a2 = np.array([np.arange(1, 5, 1), np.arange(1000, 5000, 1000), np.arange(10000, 50000, 10000)]).T
df2 = pd.DataFrame(a2,
                   columns=['ID', 'B', 'C'])

df1.iloc[0,2] = np.nan
df1.iloc[1,1] = np.nan
df1.iloc[2,2] = np.nan
df1.iloc[3,1] = np.nan

df2.iloc[0,1] = np.nan
df2.iloc[1,2] = np.nan
df2.iloc[2,1] = np.nan
df2.iloc[3,2] = np.nan

df = pd.merge(df1, df2, on="ID", suffixes=('_1', '_2'))  #.dropna(how='any')

display(df1)
display(df2)
display(df)

Merge with missing rows



In [ ]:

    
a1 = np.array([np.arange(1, 5, 1), np.arange(10, 50, 10), np.arange(100, 500, 100)]).T
df1 = pd.DataFrame(a1,
                   columns=['ID', 'B', 'C'])

a2 = np.array([np.arange(1, 3, 1), np.arange(1000, 3000, 1000), np.arange(10000, 30000, 10000)]).T
df2 = pd.DataFrame(a2,
                   columns=['ID', 'B', 'C'])

display(df1)
display(df2)

print("Left: use only keys from left frame (SQL: left outer join)")
df = pd.merge(df1, df2, on="ID", how="left", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)

print("Right: use only keys from right frame (SQL: right outer join)")
df = pd.merge(df1, df2, on="ID", how="right", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)

print("Inner: use intersection of keys from both frames (SQL: inner join) [DEFAULT]")
df = pd.merge(df1, df2, on="ID", how="inner", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)

print("Outer: use union of keys from both frames (SQL: full outer join)")
df = pd.merge(df1, df2, on="ID", how="outer", suffixes=('_1', '_2'))  #.dropna(how='any')
display(df)

GroupBy

See: http://pandas.pydata.org/pandas-docs/stable/groupby.html



In [ ]:

    
a = np.array([[3, 5, 5, 5, 7, 7, 7, 7],
              [2, 4, 4, 3, 1, 3, 3, 2],
              [3, 4, 5, 6, 1, 8, 9, 8]]).T
df = pd.DataFrame(a,
                  columns=['A', 'B', 'C'])

df

GroupBy with single key



In [ ]:

    
df.groupby(["A"]).count()



In [ ]:

    
df.groupby(["A"]).sum().B



In [ ]:

    
df.groupby(["A"]).mean().B

GroupBy with multiple keys



In [ ]:

    
df.groupby(["A","B"]).count()

Rolling

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html#pandas-dataframe-rolling

For rolling over a time window, see https://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling

Basic example

Rolling with an aggregation window of size 2.



In [ ]:

    
s = pd.Series([1., 0., 5., 2., 1.])
print("DATA:")
print(s)

mean_s = s.rolling(2).mean()
print()
print("ROLLING MEAN:")
print(mean_s)

sum_s = s.rolling(2).sum()
print()
print("ROLLING SUM:")
print(sum_s)

min_s = s.rolling(2).min()
print()
print("ROLLING MIN:")
print(min_s)

max_s = s.rolling(2).max()
print()
print("ROLLING MAX:")
print(max_s)

ax = s.plot(figsize=(18, 3), color="blue")
mean_s.plot(color="red", label="mean", ax=ax)
sum_s.plot(color="green", label="sum", style="--", alpha=0.5, ax=ax)
min_s.plot(color="black", label="min", style=":", alpha=0.25, ax=ax)
max_s.plot(color="black", label="max", style=":", alpha=0.25, ax=ax)
ax.legend();

More realistic example



In [ ]:

    
index = np.arange(0, 20, 0.05)
s = pd.Series(np.sin(index))
s = s + np.random.normal(scale=0.4, size=s.shape)

ax = s.plot(figsize=(18, 3))



In [ ]:

    
s.shape

Rolling with an aggregation window of size 20.



In [ ]:

    
s_mean = s.rolling(20).mean()
s_median = s.rolling(20).median()
s_min = s.rolling(20).min()
s_max = s.rolling(20).max()

ax = s_mean.plot(y='duration', figsize=(18, 8), color="red", label="mean", alpha=0.75)
s_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
s_min.plot(ax=ax, color="blue", alpha=0.5, style=":", label="min")
s_max.plot(ax=ax, color="blue", alpha=0.5, style=":", label="max")

plt.fill_between(s_min.index, s_min.values, s_max.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');



In [ ]:

    
s_mean.shape

Pivot



In [ ]:

    
df = pd.DataFrame([["i1", "A", 1],
                   ["i1", "B", 2],
                   ["i2", "A", 3],
                   ["i2", "B", 4]], columns=["foo", "bar", "baz"])
df



In [ ]:

    
df.pivot(index="foo",  columns="bar", values="baz")

Count the number of occurrences of a column value



In [ ]:

    
a = np.array([[3, 5, 5, 5, 7, 7, 7, 7],
              [2, 4, 4, 3, 1, 3, 3, 2],
              [3, 4, 5, 6, 1, 8, 9, 8]]).T
df = pd.DataFrame(a,
                  columns=['A', 'B', 'C'])

df



In [ ]:

    
df.A.value_counts()



In [ ]:

    
df.A.value_counts().plot.bar()

Stats



In [ ]:

    
df = pd.DataFrame(np.random.normal(size=100000))



In [ ]:

    
df.quantile(0.50)



In [ ]:

    
df.quantile([0.25, 0.75])



In [ ]:

    
df.quantile([0.01, 0.001])

Time series

See https://pandas.pydata.org/pandas-docs/stable/timeseries.html#

There are 3 main time related types in Pandas (and the equivalent type for Series and DataFrame indices):

pandas.Timestamp (pandas.DatetimeIndex for indices): pandas equivalent of python's datetime.datetime
pandas.Period (pandas.PeriodIndex for indices): represents a period of time
pandas.Timedelta (pandas.TimedeltaIndex for indices): represents a duration (the difference between two dates or times) i.e. the pandas equivalent of python's datetime.timedelta

A Timestamp is a point in time:



In [ ]:

    
pd.Timestamp(year=2018, month=1, day=1, hour=12, minute=30)

A Period is a range in time (with a "anchored" start time and a "anchored" end time):



In [ ]:

    
p = pd.Period(freq='D', year=2018, month=1, day=1, hour=12, minute=30)
print(p)
print("Start time:", p.start_time)
print("End time:", p.end_time)

A Timedelta is a "floating" duration (i.e. not "anchored" in time):



In [ ]:

    
print(pd.Timedelta(days=5, seconds=30))

ts1 = pd.Timestamp(year=2018, month=1, day=1, hour=12, minute=30)
ts2 = pd.Timestamp(year=2018, month=1, day=2, hour=12, minute=30)
print(ts2 - ts1)

Generate datetime index (with a fixed frequency)

See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html



In [ ]:

    
pd.date_range('2018-01-01', '2018-03-01', freq='D')



In [ ]:

    
pd.date_range('2018-01-01', periods=10, freq='h')



In [ ]:

    
pd.date_range('1/1/2012', periods=10, freq='S')



In [ ]:

    
pd.date_range('3/6/2012 00:00', periods=5, freq='D')



In [ ]:

    
pd.date_range('1/1/2012', periods=5, freq='M')

Generate period index



In [ ]:

    
pd.period_range('2018-01-01', '2018-03-01', freq='D')



In [ ]:

    
pd.date_range('2018-01-01', '2018-03-01', freq='D').to_period()

Plot time series



In [ ]:

    
dti = pd.date_range('2012-01-01 00:00', periods=40, freq='D')
ts = pd.Series(np.random.randint(0, 200, len(dti)), index=dti)



In [ ]:

    
ts.plot();



In [ ]:

    
ts.plot(x_compat=True);



In [ ]:

    
dti = pd.date_range('2018-01-01 00:00', '2018-01-03 00:00', freq='H')
ts = pd.Series(np.random.randint(0, 100, len(dti)), index=dti)

ax = ts.plot(x_compat=True, figsize=(16, 4))    # x_compat is required as matplotlib doesn't understand pandas datetime format -> x_compat=True makes the conversion...

# set monthly locator
ax.xaxis.set_major_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_minor_locator(mdates.HourLocator(interval=1))

# set formatter
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d-%m-%Y'))

# set font and rotation for date tick labels
plt.gcf().autofmt_xdate()

Indexing (select datetime)

See https://pandas.pydata.org/pandas-docs/stable/timeseries.html#indexing



In [ ]:

    
dti = pd.date_range('2012-1-1 00:00', periods=40, freq='D')
ts = pd.Series(np.random.randint(0, 200, len(dti)), index=dti)



In [ ]:

    
ts



In [ ]:

    
ts["2012-01-09"]



In [ ]:

    
ts[datetime.datetime(2012, 1, 9)]



In [ ]:

    
ts[ts.index < "2012-01-09"]



In [ ]:

    
ts[ts.index > "2012-01-20"]



In [ ]:

    
ts["2012-01-09":"2012-01-20"]



In [ ]:

    
ts[datetime.datetime(2012, 1, 9):datetime.datetime(2012, 1, 20)]



In [ ]:

    
ts[ts.index.day <= 3]



In [ ]:

    
ts[ts.index.month == 2]



In [ ]:

    
ts["2012-02"]



In [ ]:

    
ts[ts.index.dayofweek == 1]

Rolling

See https://stackoverflow.com/a/39917862

Basic example

Rolling window size: 1 day



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 06:45', '2018-1-1 12:00',
                        '2018-1-2 00:00', '2018-1-2 06:00', '2018-1-2 12:00'])
ts = pd.Series([2., 1., 3., 2., 2., 0.], index=dti)

print("DATA:")
print(ts)

ax = ts.plot(figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00']), ymin=0, ymax=8, color="red", linestyle=":", alpha=0.3);

ts_rw = ts.rolling('D').sum()   # Rolling window size: 1 day

print()
print("MEAN:")
print(ts_rw)

ts_rw.plot(color="red", label="sum", style="*-", alpha=0.75, ax=ax)

ax.legend()
ax.set_xlabel('Time')
ax.grid(True);



In [ ]:

    
ts.rolling('6h').min()



In [ ]:

    
ts.rolling('3h').mean()

More realistic example



In [ ]:

    
dti = pd.date_range('1/1/2018 00:00', periods=6*480, freq='10min')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=480/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);



In [ ]:

    
ts.shape



In [ ]:

    
ts_mean = ts.rolling('5H').mean()
ts_median = ts.rolling('5H').median()
ts_min = ts.rolling('5H').min()
ts_max = ts.rolling('5H').max()

ax = ts_mean.plot(y='duration', figsize=(18, 3), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_min.plot(ax=ax, color="blue", alpha=0.5, style=":", label="min")
ts_max.plot(ax=ax, color="blue", alpha=0.5, style=":", label="max")

plt.fill_between(ts_min.index, ts_min.values, ts_max.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');



In [ ]:

    
ts_mean.shape

Resampling

See https://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling

resample() is a time-based groupby, followed by a reduction method on each of its groups. resample() is similar to using a rolling() operation with a time-based offset.

Basic example



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 06:45', '2018-1-1 12:00',
                        '2018-1-2 00:00', '2018-1-2 12:00'])
ts = pd.Series([1., 0., 5., 2., 0.], index=dti)

print("DATA:")
print(ts)

ax = ts.plot(figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00']), ymin=0, ymax=5, color="red", linestyle=":", alpha=0.3);

ts_resampled = ts.resample('D').mean()

print()
print("MEAN:")
print(ts_resampled)

ts_resampled.plot(color="red", style="*-", label="mean", alpha=0.75, ax=ax)

ax.legend()
ax.set_xlabel('Time');



In [ ]:

    
ts.resample('6h').min()



In [ ]:

    
ts.resample('3h').sum()

Is there an offset ?

No. resample('D') aggregates values for each day between "00:00:00" and "23:59:59", whatever the first index of ts. See the examples bellow...



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 12:00',
                        '2018-1-2 08:00', '2018-1-2 18:00', '2018-1-2 23:59:59',
                        '2018-1-3 00:00'])
ts = pd.Series([0.,
                10., 20., 30.,
                5.], index=dti)

print("DATA:")
print(ts)

ts_resampled = ts.resample('D').mean()

print()
print("MEAN:")
print(ts_resampled)

# Illustrative plot

ax = ts.plot(x_compat=True, figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00', '2018-1-3 00:00']), ymin=-10, ymax=40, color="red", linestyle=":", linewidth=2, alpha=0.5);
ax.vlines(pd.DatetimeIndex(['2018-1-1 12:00', '2018-1-2 12:00', '2018-1-3 12:00']), ymin=-10, ymax=40, color="green", linestyle=":", linewidth=2, alpha=0.5);

ax.plot(pd.DatetimeIndex(['2018-1-1 12:15', '2018-1-2 11:45']), [40, 40], marker="|", markersize=20, color="green")
ax.plot(pd.DatetimeIndex(['2018-1-2 12:15', '2018-1-3 11:45']), [40, 40], marker="|", markersize=20, color="green")

ax.plot(pd.DatetimeIndex(['2018-1-1 00:15', '2018-1-1 23:45']), [35, 35], marker="|", markersize=20, color="red")
ax.plot(pd.DatetimeIndex(['2018-1-2 00:15', '2018-1-2 23:45']), [35, 35], marker="|", markersize=20, color="red")

ts_resampled.plot(color="red", style="*-", label="mean", alpha=0.75, ax=ax)

# set monthly locator
ax.xaxis.set_major_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_minor_locator(mdates.HourLocator(interval=1))

# set formatter
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d-%m-%Y  %H:%M'))

# set font and rotation for date tick labels
plt.gcf().autofmt_xdate()

ax.legend()
ax.set_xlabel('Time');



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 01:00',
                        '2018-1-1 05:30', '2018-1-1 07:30',
                        '2018-1-1 10:00'])
ts = pd.Series([0.,
                10., 20.,
                5.], index=dti)

print("DATA:")
print(ts)

ts_resampled = ts.resample('5h').mean()

print()
print("MEAN:")
print(ts_resampled)

# Illustrative plot

ax = ts.plot(x_compat=True, figsize=(18, 3), style="*-", color="blue")
ax.vlines(pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 05:00', '2018-1-1 10:00']), ymin=-10, ymax=40, color="red", linestyle=":", linewidth=2, alpha=0.5);
ax.vlines(pd.DatetimeIndex(['2018-1-1 01:00', '2018-1-1 06:00', '2018-1-1 11:00']), ymin=-10, ymax=40, color="green", linestyle=":", linewidth=2, alpha=0.5);

ax.plot(pd.DatetimeIndex(['2018-1-1 01:05', '2018-1-1 05:55']), [40, 40], marker="|", markersize=20, color="green")
ax.plot(pd.DatetimeIndex(['2018-1-1 06:05', '2018-1-1 10:55']), [40, 40], marker="|", markersize=20, color="green")

ax.plot(pd.DatetimeIndex(['2018-1-1 00:05', '2018-1-1 04:55']), [35, 35], marker="|", markersize=20, color="red")
ax.plot(pd.DatetimeIndex(['2018-1-1 05:05', '2018-1-1 09:55']), [35, 35], marker="|", markersize=20, color="red")

ts_resampled.plot(color="red", style="*-", label="mean", alpha=0.75, ax=ax)

# set monthly locator
ax.xaxis.set_major_locator(mdates.HourLocator(interval=1))
#ax.xaxis.set_minor_locator(mdates.HourLocator(interval=1))

# set formatter
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))

# set font and rotation for date tick labels
plt.gcf().autofmt_xdate()

ax.legend()
ax.set_xlabel('Time');

More realistic example



In [ ]:

    
dti = pd.date_range('1/1/2018 00:00', periods=60*480, freq='min')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=480/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);



In [ ]:

    
ts.shape



In [ ]:

    
ts_mean = ts.resample('2H').mean()
ts_median = ts.resample('2H').median()
ts_min = ts.resample('2H').min()
ts_max = ts.resample('2H').max()

ax = ts_mean.plot(y='duration', figsize=(18, 8), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_min.plot(ax=ax, color="blue", alpha=0.5, style=":", label="min")
ts_max.plot(ax=ax, color="blue", alpha=0.5, style=":", label="max")

plt.fill_between(ts_min.index, ts_min.values, ts_max.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');



In [ ]:

    
ts_mean.shape

Difference between `rolling()` and `resample()`



In [ ]:

    
rolling_window = '6H'
start = '2018-1-1 00:00'
end = '2018-1-4 00:00'



In [ ]:

    
dti = pd.date_range(start=start, end=end, freq='min')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)
ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range(start=start, end=end, freq=rolling_window), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.5);



In [ ]:

    
ts2 = ts.rolling(rolling_window).mean()   # Rolling window size: 1 day

ax = ts2.plot(figsize=(18, 3), color="red", alpha=0.75)
ax.vlines(pd.date_range(start=start, end=end, freq=rolling_window), ymin=-1, ymax=1, color="red", linestyle=":", alpha=0.5);



In [ ]:

    
ts2 = ts.resample(rolling_window).mean()   # Rolling window size: 1 day

ax = ts2.plot(figsize=(18, 3), color="red", alpha=0.75)
ax.vlines(pd.date_range(start=start, end=end, freq=rolling_window), ymin=-1, ymax=1, color="red", linestyle=":", alpha=0.5);

Group by

See https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html#Digging-into-the-data

Basic example



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 12:00', '2018-1-2 00:00', '2018-1-2 12:00'])
ts = pd.Series([1., 0., 2., 1.], index=dti)

print(ts)

ax = ts.plot(figsize=(18, 3))

dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-2 00:00'])
ax.vlines(dti, ymin=0, ymax=2, color="red", linestyle=":", alpha=0.3);



In [ ]:

    
ts_mean = ts.groupby(ts.index.time).mean()

print(ts_mean)

ax = ts_mean.plot(y='duration', figsize=(10, 4), color="red", label="mean", alpha=0.75)
ax.legend()
ax.set_xlabel('Time');

Basic example of wrong usage



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 12:00', '2018-1-2 00:31', '2018-1-2 12:25'])  # Here time is not aligned (non constant frequency)
ts = pd.Series([1., 0., 2., 1.], index=dti)

print(ts)

ax = ts.plot(figsize=(18, 3));



In [ ]:

    
ts_mean = ts.groupby(ts.index.time).mean()

print(ts_mean)

ax = ts_mean.plot(y='duration', figsize=(10, 4), color="red", label="mean", alpha=0.75)
ax.legend()
ax.set_xlabel('Time');

More realistic example



In [ ]:

    
dti = pd.date_range('1/1/2018 00:00', periods=960, freq='h')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=960/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);



In [ ]:

    
ts_mean = ts.groupby(ts.index.time).mean()
ts_median = ts.groupby(ts.index.time).median()
ts_quartile_1 = ts.groupby(ts.index.time).quantile(0.25)
ts_quartile_3 = ts.groupby(ts.index.time).quantile(0.75)

ax = ts_mean.plot(y='duration', figsize=(14, 8), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_quartile_1.plot(ax=ax, color="blue", alpha=0.5, style=":", label="1st quartile")
ts_quartile_3.plot(ax=ax, color="blue", alpha=0.5, style=":", label="3rd quartile")

plt.fill_between(ts_quartile_1.index, ts_quartile_1.values, ts_quartile_3.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');

With Periods



In [ ]:

    
dti = pd.period_range('1/1/2018 00:00', periods=960, freq='h')
ts = pd.Series(np.sin(dti.hour * 2. * np.pi / 24.), index=dti)

ts = ts + np.random.normal(scale=0.4, size=ts.shape)

ax = ts.plot(figsize=(18, 3))
ax.vlines(pd.date_range('1/1/2018 00:00', periods=960/24, freq='D'), ymin=-2, ymax=2, color="red", linestyle=":", alpha=0.3);



In [ ]:

    
ts_mean = ts.groupby(ts.index.start_time.time).mean()                 # Note the ".start_time" here
ts_median = ts.groupby(ts.index.start_time.time).median()             # Note the ".start_time" here
ts_quartile_1 = ts.groupby(ts.index.start_time.time).quantile(0.25)   # Note the ".start_time" here
ts_quartile_3 = ts.groupby(ts.index.start_time.time).quantile(0.75)   # Note the ".start_time" here

ax = ts_mean.plot(y='duration', figsize=(14, 8), color="red", label="mean", alpha=0.75)
ts_median.plot(ax=ax, color="blue", label="median", alpha=0.75)
ts_quartile_1.plot(ax=ax, color="blue", alpha=0.5, style=":", label="1st quartile")
ts_quartile_3.plot(ax=ax, color="blue", alpha=0.5, style=":", label="3rd quartile")

plt.fill_between(ts_quartile_1.index, ts_quartile_1.values, ts_quartile_3.values, facecolor='blue', alpha=0.1)

ax.legend()
ax.set_xlabel('Time');

Round



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 12:00', '2018-1-2 00:31', '2018-1-2 12:25'])  # Here time is not aligned (non constant frequency)
ts = pd.Series([1., 0., 2., 1.], index=dti)

print(ts)



In [ ]:

    
ts.index.round('H')

Count

TODO: is it the cleanest way to do this ?



In [ ]:

    
dti = pd.DatetimeIndex(['2018-1-1 00:00', '2018-1-1 06:45', '2018-1-1 12:00',
                        '2018-1-3 00:00', '2018-1-3 06:00'])

s = pd.Series(np.ones(dti.shape), index=dti)



In [ ]:

    
#dti.groupby(dti.date)     # it works but it returns a dictionary...



In [ ]:

    
s.groupby(dti.date).count().plot.bar(color="blue", alpha=0.5);



In [ ]:

    
s.resample('1d').count().plot.bar(color="blue", alpha=0.5);

Plot

see https://pandas.pydata.org/pandas-docs/stable/visualization.html



In [ ]:

    
#help(df.plot)

Line plot



In [ ]:

    
x = np.arange(0, 6, 0.1)
y1 = np.cos(x)
y2 = np.sin(x)
Y = np.array([y1, y2]).T

df = pd.DataFrame(Y,
                  columns=['cos(x)', 'sin(x)'],
                  index=x)
df.iloc[:10]



In [ ]:

    
df.plot(legend=True)



In [ ]:

    
df.plot.line(legend=True)

Steps



In [ ]:

    
df = pd.DataFrame(np.random.randn(36, 2))

df.plot(drawstyle="steps", linewidth=2)
df.plot(drawstyle="steps-post", linewidth=2);

Bar plot



In [ ]:

    
x = np.arange(0, 6, 0.5)
y1 = np.cos(x)
y2 = np.sin(x)
Y = np.array([y1, y2]).T

df = pd.DataFrame(Y,
                  columns=['cos(x)', 'sin(x)'],
                  index=x)
df

Vertical



In [ ]:

    
df.plot.bar(legend=True)



In [ ]:

    
df.plot.bar(legend=True, stacked=True)

Horizontal



In [ ]:

    
df.plot.barh(legend=True)

Histogram



In [ ]:

    
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.hist(alpha=0.5, bins=100, legend=True);

To normalize the $y$ axis, use density=True:



In [ ]:

    
df.plot.hist(alpha=0.5, bins=100, legend=True, density=True);

Box plot



In [ ]:

    
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.box()

Hexbin plot



In [ ]:

    
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df['b'] = df['b'] + np.arange(1000)
df.plot.hexbin(x='a', y='b', gridsize=25)

Kernel Density Estimation (KDE) plot



In [ ]:

    
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.kde()

Area plot



In [ ]:

    
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])

df.plot.area()

Pie chart



In [ ]:

    
x = np.random.randint(low=0, high=6, size=(50))

df = pd.DataFrame(x, columns=["A"])
df.A.value_counts()



In [ ]:

    
df.A.value_counts().plot.pie(y="A")

Scatter plot



In [ ]:

    
x1 = np.random.normal(size=(10000))
x2 = np.random.normal(loc=3, scale=2, size=(10000))
X = np.array([x1, x2]).T

df = pd.DataFrame(X, columns=[r'$\mathcal{N}(0,1)$', r'$\mathcal{N}(3,2)$'])

df.plot.scatter(x=r'$\mathcal{N}(0,1)$',
                y=r'$\mathcal{N}(3,2)$',
                alpha=0.2)

Pandas

Import directives

Make data

Series (1D data)

With automatic indices

With defined indices

Get information about a series

Date ranges

Frames (2D data)

With automatic indices and columns

With defined indices and columns

With columns from dict

Get information about a dataframe

More details about dtype

Panels (3D data)

Panel4D and PanelND (ND data)

Export/import data (write/read files)

HDF5 files

CSV files

Write CSV files

Read CSV files

JSON files

Write JSON files

Simplest version

Setting orient="split"

Setting orient="records"

Setting orient="index" (the default option for Series)

Setting orient="columns" (the default option for DataFrame) (for DataFrame only)

Setting orient="values" (for DataFrame only)

Setting more options

Read JSON files

Using orient="split"

Using orient="records"

Using orient="index"

Using orient="columns"

Using orient="values" (for DataFrame only)

Setting more options

YAML

Other file formats

Select columns

Select a single column

Label based selection

Index based selection

Select multiple columns

Label based selection

Index based selection

Select rows

Select a single row

Label based selection

Index based selection

Select multiple rows

Label based selection

Index based selection

Select rows based on values

Select rows and columns

Setting values

Apply a function to selected colunms values

Apply a function to selected rows values

Sample rows

With replacement

Without replacement

Weighted sampling

Shuffle/permute rows

Sort a DataFrame

Sorting by row index or column label

Sorting by rows

Sorting by columns

Sorting by values

Missing data

Get the boolean mask where values are nan

Drop any rows that have missing data

Drop any rows that have missing data in a given column

Drop any columns that have missing data

Drop any columns that have missing data in a given row

Filling missing data

Count the number of NaN values in a given column

Miscellaneous operations on data frames

Transpose of a data frame

Merge

Merge with NaN

Setting `orient`="split"

Setting `orient`="records"

Setting `orient`="index" (the default option for `Series`)

Setting `orient`="columns" (the default option for `DataFrame`) (for `DataFrame` only)

Setting `orient`="values" (for `DataFrame` only)

Using `orient`="split"

Using `orient`="records"

Using `orient`="index"

Using `orient`="columns"

Using `orient`="values" (for `DataFrame` only)

Difference between `rolling()` and `resample()`