Introduction to pandas, the Python Data Analysis Library

What is pandas, and why should I use it?

Wes McKinney, primary developer of pandas:

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

In more tangible terms, pandas is a Python package that streamlines common data analysis tasks and makes handling large datasets more efficient and manageable. Probably the best way to demonstrate this is by example. For these examples it's not so important to understand exactly what's going on in terms of the syntax of the code, but more about what the code is doing.

The examples in this tutorial series use the netCDF dataset which can be downloaded by clicking here. This is a surface meteorology file which contains a number of variables which could be interesting for study. In addition, all the notebooks may be found here, they are the files with the *.ipynb file extension.

Outline

  1. Introduction (this page)
  1. Series Objects - pandas.Series()

    This section walks through reading and plotting a simple timeseries from a netCDF file using pandas. General topics include reading a netCDF file, writing a function, working with datetime objects, and plotting.

  1. DataFrame Objects - pandas.DataFrame()

    This section is not yet written, but will cover working with multiple variables and datasets. General topics will include joining and concatenating frames from different sources, working with multiple Series objects, and plotting multiple variables.

  1. Importing Data - pandas.read_csv and pandas.read_table()

    Also not written. Will talk about importing data from text/tabular/csv type files natively using pandas.


In [1]:
import pandas
help(pandas)


Help on package pandas:

NAME
    pandas

FILE
    /Users/jstemm/anaconda/lib/python2.7/site-packages/pandas/__init__.py

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    =====================================================================
    
    See http://pandas.sourceforge.net for full documentation. Otherwise, see the
    docstrings of the various objects in the pandas namespace:
    
    Series
    DataFrame
    Panel
    Index
    DatetimeIndex
    HDFStore
    bdate_range
    date_range
    read_csv
    read_fwf
    read_table
    ols

PACKAGE CONTENTS
    _sparse
    _testing
    algos
    compat (package)
    computation (package)
    core (package)
    hashtable
    index
    info
    io (package)
    json
    lib
    msgpack
    parser
    rpy (package)
    sandbox (package)
    sparse (package)
    stats (package)
    tests (package)
    tools (package)
    tseries (package)
    tslib
    util (package)
    version

SUBMODULES
    datetools
    offsets

DATA
    IndexSlice = <pandas.core.indexing._IndexSlice object>
    NaT = NaT
    __docformat__ = 'restructuredtext'
    __version__ = '0.15.2'
    describe_option = <pandas.core.config.CallableDynamicDoc object>
    get_option = <pandas.core.config.CallableDynamicDoc object>
    options = <pandas.core.config.DictWrapper object>
    plot_params = {'xaxis.compat': False}
    reset_option = <pandas.core.config.CallableDynamicDoc object>
    set_option = <pandas.core.config.CallableDynamicDoc object>

VERSION
    0.15.2