Pandas_QuickStart

Origin from http://pandas.pydata.org/pandas-docs/stable/
by openthings@163.com, 2016-04.

6.1 Object Creation

Creating a Series by passing a list of values, letting pandas create a default integer index:


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

s = pd.Series([1,3,5,np.nan,6,8])
s


Out[1]:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:


In [2]:
dates = pd.date_range('20130101', periods=6)
dates


Out[2]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [7]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df


Out[7]:
A B C D
2013-01-01 -1.334013 -0.348297 0.388654 -0.225967
2013-01-02 -0.139974 -1.347789 0.817077 0.192477
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367
2013-01-04 0.034788 -0.677221 0.120449 0.794341
2013-01-05 0.428547 -0.610156 -0.950891 -0.058047
2013-01-06 0.125631 -0.116653 -0.544575 -1.578785

Creating a DataFrame by passing a dict of objects that can be converted to series-like.


In [8]:
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })

df2


Out[8]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo

In [9]:
df2.dtypes


Out[9]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [13]: df2.<TAB>

In [11]:
df2.


Out[11]:
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

6.2 Viewing Data


In [14]:
df.head()


Out[14]:
A B C D
2013-01-01 -1.334013 -0.348297 0.388654 -0.225967
2013-01-02 -0.139974 -1.347789 0.817077 0.192477
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367
2013-01-04 0.034788 -0.677221 0.120449 0.794341
2013-01-05 0.428547 -0.610156 -0.950891 -0.058047

In [15]:
df.tail(3)


Out[15]:
A B C D
2013-01-04 0.034788 -0.677221 0.120449 0.794341
2013-01-05 0.428547 -0.610156 -0.950891 -0.058047
2013-01-06 0.125631 -0.116653 -0.544575 -1.578785

In [16]:
df.index


Out[16]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [17]:
df.values


Out[17]:
array([[-1.33401275, -0.34829657,  0.38865407, -0.22596701],
       [-0.13997444, -1.34778853,  0.81707707,  0.19247685],
       [-1.0827386 , -0.5441047 , -1.42388302, -1.24736743],
       [ 0.03478847, -0.67722051,  0.12044917,  0.7943414 ],
       [ 0.42854678, -0.61015602, -0.95089113, -0.0580473 ],
       [ 0.12563068, -0.11665286, -0.54457518, -1.57878468]])

In [18]:
df.describe()


Out[18]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.327960 -0.607370 -0.265528 -0.353891
std 0.710887 0.415940 0.853035 0.896606
min -1.334013 -1.347789 -1.423883 -1.578785
25% -0.847048 -0.660454 -0.849312 -0.992017
50% -0.052593 -0.577130 -0.212063 -0.142007
75% 0.102920 -0.397249 0.321603 0.129846
max 0.428547 -0.116653 0.817077 0.794341

In [19]:
df.T


Out[19]:
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A -1.334013 -0.139974 -1.082739 0.034788 0.428547 0.125631
B -0.348297 -1.347789 -0.544105 -0.677221 -0.610156 -0.116653
C 0.388654 0.817077 -1.423883 0.120449 -0.950891 -0.544575
D -0.225967 0.192477 -1.247367 0.794341 -0.058047 -1.578785

In [20]:
df.sort_index(axis=1, ascending=False)


Out[20]:
D C B A
2013-01-01 -0.225967 0.388654 -0.348297 -1.334013
2013-01-02 0.192477 0.817077 -1.347789 -0.139974
2013-01-03 -1.247367 -1.423883 -0.544105 -1.082739
2013-01-04 0.794341 0.120449 -0.677221 0.034788
2013-01-05 -0.058047 -0.950891 -0.610156 0.428547
2013-01-06 -1.578785 -0.544575 -0.116653 0.125631

In [21]:
df.sort_values(by='B')


Out[21]:
A B C D
2013-01-02 -0.139974 -1.347789 0.817077 0.192477
2013-01-04 0.034788 -0.677221 0.120449 0.794341
2013-01-05 0.428547 -0.610156 -0.950891 -0.058047
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367
2013-01-01 -1.334013 -0.348297 0.388654 -0.225967
2013-01-06 0.125631 -0.116653 -0.544575 -1.578785

6.3 Selection

Getting


In [22]:
df['A']


Out[22]:
2013-01-01   -1.334013
2013-01-02   -0.139974
2013-01-03   -1.082739
2013-01-04    0.034788
2013-01-05    0.428547
2013-01-06    0.125631
Freq: D, Name: A, dtype: float64

In [23]:
df[0:3]


Out[23]:
A B C D
2013-01-01 -1.334013 -0.348297 0.388654 -0.225967
2013-01-02 -0.139974 -1.347789 0.817077 0.192477
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367

In [24]:
df['20130102':'20130104']


Out[24]:
A B C D
2013-01-02 -0.139974 -1.347789 0.817077 0.192477
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367
2013-01-04 0.034788 -0.677221 0.120449 0.794341

6.3.2 Selection by Label

For getting a cross section using a label


In [25]:
df.loc[dates[0]]


Out[25]:
A   -1.334013
B   -0.348297
C    0.388654
D   -0.225967
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label


In [26]:
df.loc[:,['A','B']]


Out[26]:
A B
2013-01-01 -1.334013 -0.348297
2013-01-02 -0.139974 -1.347789
2013-01-03 -1.082739 -0.544105
2013-01-04 0.034788 -0.677221
2013-01-05 0.428547 -0.610156
2013-01-06 0.125631 -0.116653

Showing label slicing, both endpoints are included


In [27]:
df.loc['20130102':'20130104',['A','B']]


Out[27]:
A B
2013-01-02 -0.139974 -1.347789
2013-01-03 -1.082739 -0.544105
2013-01-04 0.034788 -0.677221

Reduction in the dimensions of the returned object


In [30]:
df.loc['20130102',['A','B']]


Out[30]:
A   -0.139974
B   -1.347789
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value


In [31]:
df.loc[dates[0],'A']


Out[31]:
-1.3340127475498547

For getting fast access to a scalar (equiv to the prior method)


In [32]:
df.at[dates[0],'A']


Out[32]:
-1.3340127475498547

6.3.3 Selection by Position

See more in Selection by Position Select via the position of the passed integers


In [33]:
df.iloc[3]


Out[33]:
A    0.034788
B   -0.677221
C    0.120449
D    0.794341
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python


In [34]:
df.iloc[3:5,0:2]


Out[34]:
A B
2013-01-04 0.034788 -0.677221
2013-01-05 0.428547 -0.610156

By lists of integer position locations, similar to the numpy/python style


In [35]:
df.iloc[[1,2,4],[0,2]]


Out[35]:
A C
2013-01-02 -0.139974 0.817077
2013-01-03 -1.082739 -1.423883
2013-01-05 0.428547 -0.950891

For slicing rows explicitly


In [36]:
df.iloc[1:3,:]


Out[36]:
A B C D
2013-01-02 -0.139974 -1.347789 0.817077 0.192477
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367

For slicing columns explicitly


In [37]:
df.iloc[:,1:3]


Out[37]:
B C
2013-01-01 -0.348297 0.388654
2013-01-02 -1.347789 0.817077
2013-01-03 -0.544105 -1.423883
2013-01-04 -0.677221 0.120449
2013-01-05 -0.610156 -0.950891
2013-01-06 -0.116653 -0.544575

For getting a value explicitly


In [39]:
df.iloc[1,1]


Out[39]:
-1.3477885295869219

For getting fast access to a scalar (equiv to the prior method)


In [40]:
df.iat[1,1]


Out[40]:
-1.3477885295869219

6.3.4 Boolean Indexing

Using a single column’s values to select data.


In [41]:
df[df.A > 0]


Out[41]:
A B C D
2013-01-04 0.034788 -0.677221 0.120449 0.794341
2013-01-05 0.428547 -0.610156 -0.950891 -0.058047
2013-01-06 0.125631 -0.116653 -0.544575 -1.578785

A where operation for getting.


In [42]:
df[df > 0]


Out[42]:
A B C D
2013-01-01 NaN NaN 0.388654 NaN
2013-01-02 NaN NaN 0.817077 0.192477
2013-01-03 NaN NaN NaN NaN
2013-01-04 0.034788 NaN 0.120449 0.794341
2013-01-05 0.428547 NaN NaN NaN
2013-01-06 0.125631 NaN NaN NaN

Using the isin() method for filtering:


In [43]:
df2 = df.copy()

添加一列。


In [44]:
df2['E'] = ['one', 'one','two','three','four','three']

In [45]:
df2


Out[45]:
A B C D E
2013-01-01 -1.334013 -0.348297 0.388654 -0.225967 one
2013-01-02 -0.139974 -1.347789 0.817077 0.192477 one
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367 two
2013-01-04 0.034788 -0.677221 0.120449 0.794341 three
2013-01-05 0.428547 -0.610156 -0.950891 -0.058047 four
2013-01-06 0.125631 -0.116653 -0.544575 -1.578785 three

In [46]:
df2[df2['E'].isin(['two','four'])]


Out[46]:
A B C D E
2013-01-03 -1.082739 -0.544105 -1.423883 -1.247367 two
2013-01-05 0.428547 -0.610156 -0.950891 -0.058047 four

6.3.5 Setting

Setting a new column automatically aligns the data by the indexes


In [48]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1


Out[48]:
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

Setting values by position


In [49]:
df.iat[0,1] = 0

Setting by assigning with a numpy array


In [50]:
df.loc[:,'D'] = np.array([5] * len(df))

The result of the prior setting operations


In [51]:
df


Out[51]:
A B C D
2013-01-01 -1.334013 0.000000 0.388654 5
2013-01-02 -0.139974 -1.347789 0.817077 5
2013-01-03 -1.082739 -0.544105 -1.423883 5
2013-01-04 0.034788 -0.677221 0.120449 5
2013-01-05 0.428547 -0.610156 -0.950891 5
2013-01-06 0.125631 -0.116653 -0.544575 5

A where operation with setting.


In [52]:
df2 = df.copy()

In [53]:
df2[df2 > 0] = -df2

In [54]:
df2


Out[54]:
A B C D
2013-01-01 -1.334013 0.000000 -0.388654 -5
2013-01-02 -0.139974 -1.347789 -0.817077 -5
2013-01-03 -1.082739 -0.544105 -1.423883 -5
2013-01-04 -0.034788 -0.677221 -0.120449 -5
2013-01-05 -0.428547 -0.610156 -0.950891 -5
2013-01-06 -0.125631 -0.116653 -0.544575 -5

In [ ]: