In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


Advanced Tables

Why are databases so complex?

  1. Consistency when updating: no duplicate places for information https://en.wikipedia.org/wiki/Database_normalization
  2. Performance https://en.wikipedia.org/wiki/Star_schema

How to Join?

https://en.wikipedia.org/wiki/Join_(SQL)

  1. Two tables can be joined at a time. 'Join' is a binary operator. (Commutations and associations may produce different results, though.)
  2. Tables must have key values that can be matched. Usually one table has a primary key and the other table has a foreign key.

Pandas

  1. Merge, join, and concatenate http://pandas.pydata.org/pandas-docs/version/0.18.1/merging.html#merge-join-and-concatenate
  2. Reshaping and Pivot Tables http://pandas.pydata.org/pandas-docs/version/0.18.1/reshaping.html

In [4]:
import pandas as pd
import numpy as np

In [6]:
help(pd.concat)


Help on function concat in module pandas.tools.merge:

concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)
    Concatenate pandas objects along a particular axis with optional set logic
    along the other axes. Can also add a layer of hierarchical indexing on the
    concatenation axis, which may be useful if the labels are the same (or
    overlapping) on the passed axis number
    
    Parameters
    ----------
    objs : a sequence or mapping of Series, DataFrame, or Panel objects
        If a dict is passed, the sorted keys will be used as the `keys`
        argument, unless it is passed, in which case the values will be
        selected (see below). Any None objects will be dropped silently unless
        they are all None in which case a ValueError will be raised
    axis : {0, 1, ...}, default 0
        The axis to concatenate along
    join : {'inner', 'outer'}, default 'outer'
        How to handle indexes on other axis(es)
    join_axes : list of Index objects
        Specific indexes to use for the other n - 1 axes instead of performing
        inner/outer set logic
    verify_integrity : boolean, default False
        Check whether the new concatenated axis contains duplicates. This can
        be very expensive relative to the actual data concatenation
    keys : sequence, default None
        If multiple levels passed, should contain tuples. Construct
        hierarchical index using the passed keys as the outermost level
    levels : list of sequences, default None
        Specific levels (unique values) to use for constructing a
        MultiIndex. Otherwise they will be inferred from the keys
    names : list, default None
        Names for the levels in the resulting hierarchical index
    ignore_index : boolean, default False
        If True, do not use the index values along the concatenation axis. The
        resulting axis will be labeled 0, ..., n - 1. This is useful if you are
        concatenating objects where the concatenation axis does not have
        meaningful indexing information. Note the index values on the other
        axes are still respected in the join.
    copy : boolean, default True
        If False, do not copy data unnecessarily
    
    Notes
    -----
    The keys, levels, and names arguments are all optional
    
    Returns
    -------
    concatenated : type of objects


In [8]:
help(pd.DataFrame.append)


Help on function append in module pandas.core.frame:

append(self, other, ignore_index=False, verify_integrity=False)
    Append rows of `other` to the end of this frame, returning a new
    object. Columns not in this frame are added as new columns.
    
    Parameters
    ----------
    other : DataFrame or Series/dict-like object, or list of these
        The data to append.
    ignore_index : boolean, default False
        If True, do not use the index labels.
    verify_integrity : boolean, default False
        If True, raise ValueError on creating index with duplicates.
    
    Returns
    -------
    appended : DataFrame
    
    Notes
    -----
    If a list of dict/series is passed and the keys are all contained in
    the DataFrame's index, the order of the columns in the resulting
    DataFrame will be unchanged.
    
    See also
    --------
    pandas.concat : General function to concatenate DataFrame, Series
        or Panel objects
    
    Examples
    --------
    
    >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
    >>> df
       A  B
    0  1  2
    1  3  4
    >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
    >>> df.append(df2)
       A  B
    0  1  2
    1  3  4
    0  5  6
    1  7  8
    
    With `ignore_index` set to True:
    
    >>> df.append(df2, ignore_index=True)
       A  B
    0  1  2
    1  3  4
    2  5  6
    3  7  8


In [9]:
help(pd.merge)


Help on function merge in module pandas.tools.merge:

merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)
    Merge DataFrame objects by performing a database-style join operation by
    columns or indexes.
    
    If joining columns on columns, the DataFrame indexes *will be
    ignored*. Otherwise if joining indexes on indexes or indexes on a column or
    columns, the index will be passed on.
    
    Parameters
    ----------
    left : DataFrame
    right : DataFrame
    how : {'left', 'right', 'outer', 'inner'}, default 'inner'
        * left: use only keys from left frame (SQL: left outer join)
        * right: use only keys from right frame (SQL: right outer join)
        * outer: use union of keys from both frames (SQL: full outer join)
        * inner: use intersection of keys from both frames (SQL: inner join)
    on : label or list
        Field names to join on. Must be found in both DataFrames. If on is
        None and not merging on indexes, then it merges on the intersection of
        the columns by default.
    left_on : label or list, or array-like
        Field names to join on in left DataFrame. Can be a vector or list of
        vectors of the length of the DataFrame to use a particular vector as
        the join key instead of columns
    right_on : label or list, or array-like
        Field names to join on in right DataFrame or vector/list of vectors per
        left_on docs
    left_index : boolean, default False
        Use the index from the left DataFrame as the join key(s). If it is a
        MultiIndex, the number of keys in the other DataFrame (either the index
        or a number of columns) must match the number of levels
    right_index : boolean, default False
        Use the index from the right DataFrame as the join key. Same caveats as
        left_index
    sort : boolean, default False
        Sort the join keys lexicographically in the result DataFrame
    suffixes : 2-length sequence (tuple, list, ...)
        Suffix to apply to overlapping column names in the left and right
        side, respectively
    copy : boolean, default True
        If False, do not copy data unnecessarily
    indicator : boolean or string, default False
        If True, adds a column to output DataFrame called "_merge" with
        information on the source of each row.
        If string, column with information on source of each row will be added to
        output DataFrame, and column will be named value of string.
        Information column is Categorical-type and takes on a value of "left_only"
        for observations whose merge key only appears in 'left' DataFrame,
        "right_only" for observations whose merge key only appears in 'right'
        DataFrame, and "both" if the observation's merge key is found in both.
    
        .. versionadded:: 0.17.0
    
    Examples
    --------
    
    >>> A              >>> B
        lkey value         rkey value
    0   foo  1         0   foo  5
    1   bar  2         1   bar  6
    2   baz  3         2   qux  7
    3   foo  4         3   bar  8
    
    >>> A.merge(B, left_on='lkey', right_on='rkey', how='outer')
       lkey  value_x  rkey  value_y
    0  foo   1        foo   5
    1  foo   4        foo   5
    2  bar   2        bar   6
    3  bar   2        bar   8
    4  baz   3        NaN   NaN
    5  NaN   NaN      qux   7
    
    Returns
    -------
    merged : DataFrame
        The output type will the be same as 'left', if it is a subclass
        of DataFrame.


In [10]:
help(pd.DataFrame.combine_first)


Help on function combine_first in module pandas.core.frame:

combine_first(self, other)
    Combine two DataFrame objects and default to non-null values in frame
    calling the method. Result index columns will be the union of the
    respective indexes and columns
    
    Parameters
    ----------
    other : DataFrame
    
    Examples
    --------
    a's values prioritized, use values from b to fill holes:
    
    >>> a.combine_first(b)
    
    
    Returns
    -------
    combined : DataFrame


In [11]:
help(pd.DataFrame.update)


Help on function update in module pandas.core.frame:

update(self, other, join='left', overwrite=True, filter_func=None, raise_conflict=False)
    Modify DataFrame in place using non-NA values from passed
    DataFrame. Aligns on indices
    
    Parameters
    ----------
    other : DataFrame, or object coercible into a DataFrame
    join : {'left'}, default 'left'
    overwrite : boolean, default True
        If True then overwrite values for common keys in the calling frame
    filter_func : callable(1d-array) -> 1d-array<boolean>, default None
        Can choose to replace values other than NA. Return True for values
        that should be updated
    raise_conflict : boolean
        If True, will raise an error if the DataFrame and other both
        contain data in the same place.