03 - Introduction to Python for Data Analysis

version 0.2, May 2016

Part of the class Machine Learning for Security Informatics

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Donne Martin and Wes McKinney's Python for Data Analysis book.



In [1]:

    
import pandas as pd
import numpy as np

Series

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. The data can be any NumPy data type and the labels are the Series' index.

Create a Series:



In [2]:

    
ser_1 = pd.Series([1, 1, 2, -3, -5, 8, 13])
ser_1









    Out[2]:





0     1
1     1
2     2
3    -3
4    -5
5     8
6    13
dtype: int64

Get the array representation of a Series:



In [3]:

    
ser_1.values









    Out[3]:





array([ 1,  1,  2, -3, -5,  8, 13])

Index objects are immutable and hold the axis labels and metadata such as names and axis names.

Get the index of the Series:



In [4]:

    
ser_1.index









    Out[4]:





Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

Create a Series with a custom index:



In [5]:

    
ser_2 = pd.Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2









    Out[5]:





a    1
b    1
c    2
d   -3
e   -5
dtype: int64

Get a value from a Series:



In [6]:

    
ser_2[4] == ser_2['e']









    Out[6]:





True

Get a set of values from a Series by passing in a list:



In [7]:

    
ser_2[['c', 'a', 'b']]









    Out[7]:





c    2
a    1
b    1
dtype: int64

Get values great than 0:



In [8]:

    
ser_2[ser_2 > 0]









    Out[8]:





a    1
b    1
c    2
dtype: int64

Scalar multiply:



In [9]:

    
ser_2 * 2









    Out[9]:





a     2
b     2
c     4
d    -6
e   -10
dtype: int64

Apply a numpy math function:



In [10]:

    
import numpy as np
np.exp(ser_2)









    Out[10]:





a    2.718282
b    2.718282
c    7.389056
d    0.049787
e    0.006738
dtype: float64

A Series is like a fixed-length, ordered dict.

Create a series by passing in a dict:



In [11]:

    
dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = pd.Series(dict_1)
ser_3









    Out[11]:





bar    200
baz    300
foo    100
dtype: int64

Re-order a Series by passing in an index (indices not found are NaN):



In [12]:

    
index = ['foo', 'bar', 'baz', 'qux']
ser_4 = pd.Series(dict_1, index=index)
ser_4









    Out[12]:





foo    100
bar    200
baz    300
qux    NaN
dtype: float64

Check for NaN with the pandas method:



In [13]:

    
pd.isnull(ser_4)









    Out[13]:





foo    False
bar    False
baz    False
qux     True
dtype: bool

Check for NaN with the Series method:



In [14]:

    
ser_4.isnull()









    Out[14]:





foo    False
bar    False
baz    False
qux     True
dtype: bool

Series automatically aligns differently indexed data in arithmetic operations:



In [15]:

    
ser_3 + ser_4









    Out[15]:





bar    400
baz    600
foo    200
qux    NaN
dtype: float64

Name a Series:



In [16]:

    
ser_4.name = 'foobarbazqux'

Name a Series index:



In [17]:

    
ser_4.index.name = 'label'



In [18]:

    
ser_4









    Out[18]:





label
foo    100
bar    200
baz    300
qux    NaN
Name: foobarbazqux, dtype: float64

Rename a Series' index in place:



In [19]:

    
ser_4.index = ['fo', 'br', 'bz', 'qx']
ser_4









    Out[19]:





fo    100
br    200
bz    300
qx    NaN
Name: foobarbazqux, dtype: float64

DataFrame

A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.

Create a DataFrame:



In [20]:

    
data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
          'year' : [2012, 2013, 2014, 2014, 2015],
          'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = pd.DataFrame(data_1)
df_1

Create a DataFrame specifying a sequence of columns:



In [21]:

    
df_2 = pd.DataFrame(data_1, columns=['year', 'state', 'pop'])
df_2

Like Series, columns that are not present in the data are NaN:



In [22]:

    
df_3 = pd.DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])
df_3

Retrieve a column by key, returning a Series:



In [23]:

    
df_3['state']









    Out[23]:





0    VA
1    VA
2    VA
3    MD
4    MD
Name: state, dtype: object

Retrive a column by attribute, returning a Series:



In [24]:

    
df_3.year









    Out[24]:





0    2012
1    2013
2    2014
3    2014
4    2015
Name: year, dtype: int64

Retrieve a row by position:



In [25]:

    
df_3.ix[0]









    Out[25]:





year      2012
state       VA
pop          5
unempl     NaN
Name: 0, dtype: object

Update a column by assignment:



In [26]:

    
df_3['unempl'] = np.arange(5)
df_3

Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):



In [27]:

    
unempl = pd.Series([6.0, 6.0, 6.1], index=[2, 3, 4])
df_3['unempl'] = unempl
df_3

Assign a new column that doesn't exist to create a new column:



In [28]:

    
df_3['state_dup'] = df_3['state']
df_3

Delete a column:



In [29]:

    
del df_3['state_dup']
df_3

Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):



In [30]:

    
pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
       'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = pd.DataFrame(pop)
df_4

Transpose the DataFrame:



In [31]:

    
df_4.T

Create a DataFrame from a dict of Series:



In [32]:

    
data_2 = {'VA' : df_4['VA'][1:],
          'MD' : df_4['MD'][2:]}
df_5 = pd.DataFrame(data_2)
df_5

Set the DataFrame index name:



In [33]:

    
df_5.index.name = 'year'
df_5

Set the DataFrame columns name:



In [34]:

    
df_5.columns.name = 'state'
df_5

Return the data contained in a DataFrame as a 2D ndarray:



In [35]:

    
df_5.values









    Out[35]:





array([[ nan,  5.2],
       [ 4.1,  nan]])

If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:



In [36]:

    
df_3.values









    Out[36]:





array([[2012, 'VA', 5.0, nan],
       [2013, 'VA', 5.1, nan],
       [2014, 'VA', 5.2, 6.0],
       [2014, 'MD', 4.0, 6.0],
       [2015, 'MD', 4.1, 6.1]], dtype=object)

Reindexing

Create a new object with the data conformed to a new index. Any missing values are set to NaN.



In [37]:

    
df_3

Reindexing rows returns a new frame with the specified index:



In [38]:

    
df_3.reindex(list(reversed(range(0, 6))))

Missing values can be set to something other than NaN:



In [39]:

    
df_3.reindex(range(6, 0), fill_value=0)









    Out[39]:






  
    
      
      year
      state
      pop
      unempl

Interpolate ordered data like a time series:



In [40]:

    
ser_5 = pd.Series(['foo', 'bar', 'baz'], index=[0, 2, 4])



In [41]:

    
ser_5.reindex(range(5), method='ffill')









    Out[41]:





0    foo
1    foo
2    bar
3    bar
4    baz
dtype: object



In [42]:

    
ser_5.reindex(range(5), method='bfill')









    Out[42]:





0    foo
1    bar
2    bar
3    baz
4    baz
dtype: object

Reindex columns:



In [43]:

    
df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])

Reindex rows and columns while filling rows:



In [44]:

    
df_3.reindex(index=list(reversed(range(0, 6))),
             fill_value=0,
             columns=['state', 'pop', 'unempl', 'year'])

Reindex using ix:



In [45]:

    
df_6 = df_3.ix[range(0, 7), ['state', 'pop', 'unempl', 'year']]
df_6

Dropping Entries

Drop rows from a Series or DataFrame:



In [46]:

    
df_7 = df_6.drop([0, 1])
df_7

Drop columns from a DataFrame:



In [47]:

    
df_7 = df_7.drop('unempl', axis=1)
df_7

Indexing, Selecting, Filtering

Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.



In [48]:

    
ser_2









    Out[48]:





a    1
b    1
c    2
d   -3
e   -5
dtype: int64

Select a value from a Series:



In [49]:

    
ser_2[0] == ser_2['a']









    Out[49]:





True

Select a slice from a Series:



In [50]:

    
ser_2[1:4]









    Out[50]:





b    1
c    2
d   -3
dtype: int64

Select specific values from a Series:



In [51]:

    
ser_2[['b', 'c', 'd']]









    Out[51]:





b    1
c    2
d   -3
dtype: int64

Select from a Series based on a filter:



In [52]:

    
ser_2[ser_2 > 0]









    Out[52]:





a    1
b    1
c    2
dtype: int64

Select a slice from a Series with labels (note the end point is inclusive):



In [53]:

    
ser_2['a':'c']









    Out[53]:





a    1
b    1
c    2
dtype: int64

Assign to a Series slice (note the end point is inclusive):



In [54]:

    
ser_2['a':'b'] = 0
ser_2









    Out[54]:





a    0
b    0
c    2
d   -3
e   -5
dtype: int64

Pandas supports indexing into a DataFrame.



In [55]:

    
df_6

Select specified columns from a DataFrame:



In [56]:

    
df_6[['pop', 'unempl']]

Select a slice from a DataFrame:



In [57]:

    
df_6[:2]

Select from a DataFrame based on a filter:



In [58]:

    
df_6[df_6['pop'] > 5]

Perform a scalar comparison on a DataFrame:



In [59]:

    
df_6 > 5









    Out[59]:






  
    
      
      state
      pop
      unempl
      year
    
  
  
    
      0
      True
      False
      False
      True
    
    
      1
      True
      True
      False
      True
    
    
      2
      True
      True
      True
      True
    
    
      3
      True
      False
      True
      True
    
    
      4
      True
      False
      True
      True
    
    
      5
      False
      False
      False
      False
    
    
      6
      False
      False
      False
      False

Perform a scalar comparison on a DataFrame, retain the values that pass the filter:



In [60]:

    
df_6[df_6 > 5]

Select a slice of rows from a DataFrame (note the end point is inclusive):



In [61]:

    
df_6.ix[2:3]

Select a slice of rows from a specific column of a DataFrame:



In [62]:

    
df_6.ix[0:2, 'pop']
df_6

Select rows based on an arithmetic operation on a specific row:



In [63]:

    
df_6.ix[df_6.unempl > 5.0]

Arithmetic and Data Alignment

Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:



In [64]:

    
np.random.seed(0)
ser_6 = pd.Series(np.random.randn(5),
               index=['a', 'b', 'c', 'd', 'e'])
ser_6









    Out[64]:





a    1.764052
b    0.400157
c    0.978738
d    2.240893
e    1.867558
dtype: float64



In [65]:

    
np.random.seed(1)
ser_7 = pd.Series(np.random.randn(5),
               index=['a', 'c', 'e', 'f', 'g'])
ser_7









    Out[65]:





a    1.624345
c   -0.611756
e   -0.528172
f   -1.072969
g    0.865408
dtype: float64



In [66]:

    
ser_6 + ser_7









    Out[66]:





a    3.388398
b         NaN
c    0.366982
d         NaN
e    1.339386
f         NaN
g         NaN
dtype: float64

Set a fill value instead of NaN for indices that do not overlap:



In [67]:

    
ser_6.add(ser_7, fill_value=0)









    Out[67]:





a    3.388398
b    0.400157
c    0.366982
d    2.240893
e    1.339386
f   -1.072969
g    0.865408
dtype: float64

Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:



In [68]:

    
np.random.seed(0)
df_8 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['a', 'b', 'c'])
df_8



In [69]:

    
np.random.seed(1)
df_9 = pd.DataFrame(np.random.rand(9).reshape((3, 3)),
                 columns=['b', 'c', 'd'])
df_9



In [70]:

    
df_8 + df_9

Set a fill value instead of NaN for indices that do not overlap:



In [71]:

    
df_10 = df_8.add(df_9, fill_value=0)
df_10

Like NumPy, pandas supports arithmetic operations between DataFrames and Series.

Match the index of the Series on the DataFrame's columns, broadcasting down the rows:



In [72]:

    
ser_8 = df_10.ix[0]
df_11 = df_10 - ser_8
df_11

Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:



In [73]:

    
ser_9 = pd.Series(range(3), index=['a', 'd', 'e'])
ser_9









    Out[73]:





a    0
d    1
e    2
dtype: int64



In [74]:

    
df_11 - ser_9

Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:



In [75]:

    
df_10



In [76]:

    
ser_10 = pd.Series([100, 200, 300])
ser_10









    Out[76]:





0    100
1    200
2    300
dtype: int64



In [77]:

    
df_10.sub(ser_10, axis=0)









    Out[77]:






  
    
      
      a
      b
      c
      d
    
  
  
    
      0
      -99.451186
      -98.867789
      -98.676912
      -99.999886
    
    
      1
      -199.455117
      -199.274013
      -199.207350
      -199.907661
    
    
      2
      -299.562413
      -298.921967
      -298.690777
      -299.603233

Function Application and Mapping

NumPy ufuncs (element-wise array methods) operate on pandas objects:



In [78]:

    
df_11 = np.abs(df_11)
df_11

Apply a function on 1D arrays to each column:



In [79]:

    
func_1 = lambda x: x.max() - x.min()
df_11.apply(func_1)









    Out[79]:





a    0.111226
b    0.406224
c    0.530438
d    0.396653
dtype: float64

Apply a function on 1D arrays to each row:



In [80]:

    
df_11.apply(func_1, axis=1)









    Out[80]:





0    0.000000
1    0.526508
2    0.382789
dtype: float64

Apply a function and return a DataFrame:



In [81]:

    
func_2 = lambda x: pd.Series([x.min(), x.max()], index=['min', 'max'])
df_11.apply(func_2)

Apply an element-wise Python function to a DataFrame:



In [82]:

    
func_3 = lambda x: '%.2f' %x
df_11.applymap(func_3)

Apply an element-wise Python function to a Series:



In [83]:

    
df_11['a'].map(func_3)









    Out[83]:





0    0.00
1    0.00
2    0.11
Name: a, dtype: object

Sorting and Ranking



In [84]:

    
ser_4









    Out[84]:





fo    100
br    200
bz    300
qx    NaN
Name: foobarbazqux, dtype: float64

Sort a Series by its index:



In [85]:

    
ser_4.sort_index()









    Out[85]:





br    200
bz    300
fo    100
qx    NaN
Name: foobarbazqux, dtype: float64

Sort a Series by its values:



In [86]:

    
ser_4.order()









    



/home/al/anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: order is deprecated, use sort_values(...)
  if __name__ == '__main__':






    Out[86]:





fo    100
br    200
bz    300
qx    NaN
Name: foobarbazqux, dtype: float64



In [87]:

    
df_12 = pd.DataFrame(np.arange(12).reshape((3, 4)),
                  index=['three', 'one', 'two'],
                  columns=['c', 'a', 'b', 'd'])
df_12

Sort a DataFrame by its index:



In [88]:

    
df_12.sort_index()

Sort a DataFrame by columns in descending order:



In [89]:

    
df_12.sort_index(axis=1, ascending=False)

Sort a DataFrame's values by column:



In [90]:

    
df_12.sort_index(by=['d', 'c'])









    



/home/al/anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  if __name__ == '__main__':






    Out[90]:






  
    
      
      c
      a
      b
      d
    
  
  
    
      three
      0
      1
      2
      3
    
    
      one
      4
      5
      6
      7
    
    
      two
      8
      9
      10
      11

Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:



In [91]:

    
ser_11 = pd.Series([7, -5, 7, 4, 2, 0, 4, 7])
ser_11 = ser_11.order()
ser_11









    



/home/al/anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning: order is deprecated, use sort_values(...)
  from ipykernel import kernelapp as app






    Out[91]:





1   -5
5    0
4    2
3    4
6    4
0    7
2    7
7    7
dtype: int64



In [92]:

    
ser_11.rank()









    Out[92]:





1    1.0
5    2.0
4    3.0
3    4.5
6    4.5
0    7.0
2    7.0
7    7.0
dtype: float64

Rank a Series according to when they appear in the data:



In [93]:

    
ser_11.rank(method='first')









    Out[93]:





1    1
5    2
4    3
3    4
6    5
0    6
2    7
7    8
dtype: float64

Rank a Series in descending order, using the maximum rank for the group:



In [94]:

    
ser_11.rank(ascending=False, method='max')









    Out[94]:





1    8
5    7
4    6
3    5
6    5
0    3
2    3
7    3
dtype: float64

DataFrames can rank over rows or columns.



In [95]:

    
df_13 = pd.DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
                   'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
                   'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13

Rank a DataFrame over rows:



In [96]:

    
df_13.rank()

Rank a DataFrame over columns:



In [97]:

    
df_13.rank(axis=1)

Axis Indexes with Duplicate Values

Labels do not have to be unique in Pandas:



In [98]:

    
ser_12 = pd.Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12









    Out[98]:





foo    0
foo    1
bar    2
bar    3
baz    4
dtype: int64



In [99]:

    
ser_12.index.is_unique









    Out[99]:





False

Select Series elements:



In [100]:

    
ser_12['foo']









    Out[100]:





foo    0
foo    1
dtype: int64

Select DataFrame elements:



In [101]:

    
df_14 = pd.DataFrame(np.random.randn(5, 4),
                  index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14



In [102]:

    
df_14.ix['bar']

Summarizing and Computing Descriptive Statistics

Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.



In [103]:

    
df_15 = pd.DataFrame(np.random.randn(10, 3),
                     columns=['a', 'b', 'c'])
df_15['cat1'] = (np.random.rand(10) * 3).round(0)
df_15['cat2'] = (np.random.rand(10)).round(0)



In [104]:

    
df_15

Sum and Mean



In [105]:

    
df_15.sum()









    Out[105]:





a       -0.569574
b        2.485935
c       -1.287030
cat1    20.000000
cat2     5.000000
dtype: float64



In [106]:

    
df_15.sum(axis=1)









    Out[106]:





0    3.814515
1    0.841098
2    2.535536
3   -0.257158
4    0.575058
5    2.250285
6    4.350478
7    3.153791
8    4.625053
9    3.740676
dtype: float64



In [107]:

    
df_15.mean(axis=0)









    Out[107]:





a      -0.056957
b       0.248594
c      -0.128703
cat1    2.000000
cat2    0.500000
dtype: float64

Max and max location



In [108]:

    
df_15.max()









    Out[108]:





a       1.250050
b       1.691826
c       2.042029
cat1    3.000000
cat2    1.000000
dtype: float64



In [109]:

    
df_15.idxmax()









    Out[109]:





a       9
b       5
c       6
cat1    1
cat2    0
dtype: int64

Descriptive analysis



In [110]:

    
df_15['a'].describe()









    Out[110]:





count    10.000000
mean     -0.056957
std       0.892876
min      -1.738654
25%      -0.489952
50%       0.082167
75%       0.412244
max       1.250050
Name: a, dtype: float64



In [111]:

    
df_15['cat1'].value_counts()









    Out[111]:





2    4
3    3
1    3
Name: cat1, dtype: int64

Pivot tables

group by cat1 and calculate mean



In [112]:

    
pd.pivot_table(df_15, index='cat1', aggfunc=np.mean)

group by cat1 and cat2 calculate the sum of b



In [113]:

    
pd.pivot_table(df_15, index='cat1', columns='cat2', values='b', aggfunc=np.sum)



In [ ]:

	a	b	c
0	0.548814	0.715189	0.602763
1	0.544883	0.423655	0.645894
2	0.437587	0.891773	0.963663

	b	c	d
0	0.417022	0.720324	0.000114
1	0.302333	0.146756	0.092339
2	0.186260	0.345561	0.396767

	a	b	c	d
0	0.548814	1.132211	1.323088	0.000114
1	0.544883	0.725987	0.792650	0.092339
2	0.437587	1.078033	1.309223	0.396767

	a	b	c	d
0	0.000000	0.000000	0.000000	0.000000
1	-0.003930	-0.406224	-0.530438	0.092224
2	-0.111226	-0.054178	-0.013864	0.396653

	a	b	c	d
0	0.548814	1.132211	1.323088	0.000114
1	0.544883	0.725987	0.792650	0.092339
2	0.437587	1.078033	1.309223	0.396767

	year	state	pop	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	NaN
2	2014	VA	5.2	NaN
3	2014	MD	4.0	NaN
4	2015	MD	4.1	NaN

	state	pop	unempl	year
0	True	False	False	True
1	True	True	False	True
2	True	True	True	True
3	True	False	True	True
4	True	False	True	True
5	False	False	False	False
6	False	False	False	False

	a	b	c	d	e
0	0.000000	NaN	NaN	-1.000000	NaN
1	-0.003930	NaN	NaN	-0.907776	NaN
2	-0.111226	NaN	NaN	-0.603347	NaN

	a	b	c	d
0	-99.451186	-98.867789	-98.676912	-99.999886
1	-199.455117	-199.274013	-199.207350	-199.907661
2	-299.562413	-298.921967	-298.690777	-299.603233

	bar	baz	foo
0	1.0	1.0	7.0
1	4.5	3.0	1.0
2	3.0	4.0	7.0
3	2.0	2.0	4.5
4	4.5	5.5	3.0
5	6.5	7.5	2.0
6	6.5	7.5	4.5
7	8.0	5.5	7.0

	bar	baz	foo
0	1.0	2.0	3
1	3.0	2.0	1
2	1.0	2.0	3
3	1.5	1.5	3
4	2.0	3.0	1
5	2.0	3.0	1
6	2.0	3.0	1
7	3.0	1.0	2

	0	1	2	3
foo	-2.363469	1.135345	-1.017014	0.637362
foo	-0.859907	1.772608	-1.110363	0.181214
bar	0.564345	-0.566510	0.729976	0.372994
bar	0.533811	-0.091973	1.913820	0.330797
baz	1.141943	-1.129595	-0.850052	0.960820

	a	b	c	cat1	cat2
0	-0.217418	0.158515	0.873418	2	1
1	-0.111383	-1.038039	-1.009480	3	0
2	-1.058257	0.656284	-0.062492	2	1
3	-1.738654	0.103163	-0.621667	2	0
4	0.275718	-1.090675	-0.609985	1	1
5	0.306412	1.691826	-0.747954	1	0
6	-0.580797	-0.110754	2.042029	3	0
7	0.447521	0.683384	0.022886	1	1
8	0.857234	0.183931	-0.416112	3	1
9	1.250050	1.248300	-0.757674	2	0

	a	b	c	cat2
cat1
1	0.343217	0.428178	-0.445018	0.666667
2	-0.441070	0.541565	-0.142104	0.500000
3	0.055018	-0.321621	0.205479	0.333333

	bar	baz	foo
0	1.0	1.0	7.0
1	4.5	3.0	1.0
2	3.0	4.0	7.0
3	2.0	2.0	4.5
4	4.5	5.5	3.0
5	6.5	7.5	2.0
6	6.5	7.5	4.5
7	8.0	5.5	7.0

	bar	baz	foo
0	1.0	2.0	3
1	3.0	2.0	1
2	1.0	2.0	3
3	1.5	1.5	3
4	2.0	3.0	1
5	2.0	3.0	1
6	2.0	3.0	1
7	3.0	1.0	2

	bar	baz	foo
0	1.0	1.0	7.0
1	4.5	3.0	1.0
2	3.0	4.0	7.0
3	2.0	2.0	4.5
4	4.5	5.5	3.0
5	6.5	7.5	2.0
6	6.5	7.5	4.5
7	8.0	5.5	7.0

	bar	baz	foo
0	1.0	2.0	3
1	3.0	2.0	1
2	1.0	2.0	3
3	1.5	1.5	3
4	2.0	3.0	1
5	2.0	3.0	1
6	2.0	3.0	1
7	3.0	1.0	2