Pandas_QuickStart

Origin from http://pandas.pydata.org/pandas-docs/stable/
by openthings@163.com, 2016-04.

6.1 Object Creation

Creating a Series by passing a list of values, letting pandas create a default integer index:



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

s = pd.Series([1,3,5,np.nan,6,8])
s









    Out[1]:





0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns:



In [2]:

    
dates = pd.date_range('20130101', periods=6)
dates









    Out[2]:





DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')



In [7]:

    
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df

Creating a DataFrame by passing a dict of objects that can be converted to series-like.



In [8]:

    
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })

df2



In [9]:

    
df2.dtypes









    Out[9]:





A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:

In [13]: df2.<TAB>



In [11]:

    
df2.









    Out[11]:





0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

As you can see, the columns A, B, C, and D are automatically tab completed. E is there as well; the rest of the attributes have been truncated for brevity.

6.2 Viewing Data



In [14]:

    
df.head()



In [15]:

    
df.tail(3)



In [16]:

    
df.index









    Out[16]:





DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')



In [17]:

    
df.values









    Out[17]:





array([[-1.33401275, -0.34829657,  0.38865407, -0.22596701],
       [-0.13997444, -1.34778853,  0.81707707,  0.19247685],
       [-1.0827386 , -0.5441047 , -1.42388302, -1.24736743],
       [ 0.03478847, -0.67722051,  0.12044917,  0.7943414 ],
       [ 0.42854678, -0.61015602, -0.95089113, -0.0580473 ],
       [ 0.12563068, -0.11665286, -0.54457518, -1.57878468]])



In [18]:

    
df.describe()



In [19]:

    
df.T









    Out[19]:






  
    
      
      2013-01-01 00:00:00
      2013-01-02 00:00:00
      2013-01-03 00:00:00
      2013-01-04 00:00:00
      2013-01-05 00:00:00
      2013-01-06 00:00:00
    
  
  
    
      A
      -1.334013
      -0.139974
      -1.082739
      0.034788
      0.428547
      0.125631
    
    
      B
      -0.348297
      -1.347789
      -0.544105
      -0.677221
      -0.610156
      -0.116653
    
    
      C
      0.388654
      0.817077
      -1.423883
      0.120449
      -0.950891
      -0.544575
    
    
      D
      -0.225967
      0.192477
      -1.247367
      0.794341
      -0.058047
      -1.578785



In [20]:

    
df.sort_index(axis=1, ascending=False)



In [21]:

    
df.sort_values(by='B')

6.3 Selection

Getting



In [22]:

    
df['A']









    Out[22]:





2013-01-01   -1.334013
2013-01-02   -0.139974
2013-01-03   -1.082739
2013-01-04    0.034788
2013-01-05    0.428547
2013-01-06    0.125631
Freq: D, Name: A, dtype: float64



In [23]:

    
df[0:3]



In [24]:

    
df['20130102':'20130104']

6.3.2 Selection by Label

For getting a cross section using a label



In [25]:

    
df.loc[dates[0]]









    Out[25]:





A   -1.334013
B   -0.348297
C    0.388654
D   -0.225967
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label



In [26]:

    
df.loc[:,['A','B']]

Showing label slicing, both endpoints are included



In [27]:

    
df.loc['20130102':'20130104',['A','B']]

Reduction in the dimensions of the returned object



In [30]:

    
df.loc['20130102',['A','B']]









    Out[30]:





A   -0.139974
B   -1.347789
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value



In [31]:

    
df.loc[dates[0],'A']









    Out[31]:





-1.3340127475498547

For getting fast access to a scalar (equiv to the prior method)



In [32]:

    
df.at[dates[0],'A']









    Out[32]:





-1.3340127475498547

6.3.3 Selection by Position

See more in Selection by Position Select via the position of the passed integers



In [33]:

    
df.iloc[3]









    Out[33]:





A    0.034788
B   -0.677221
C    0.120449
D    0.794341
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to numpy/python



In [34]:

    
df.iloc[3:5,0:2]

By lists of integer position locations, similar to the numpy/python style



In [35]:

    
df.iloc[[1,2,4],[0,2]]

For slicing rows explicitly



In [36]:

    
df.iloc[1:3,:]

For slicing columns explicitly



In [37]:

    
df.iloc[:,1:3]

For getting a value explicitly



In [39]:

    
df.iloc[1,1]









    Out[39]:





-1.3477885295869219

For getting fast access to a scalar (equiv to the prior method)



In [40]:

    
df.iat[1,1]









    Out[40]:





-1.3477885295869219

6.3.4 Boolean Indexing

Using a single column’s values to select data.



In [41]:

    
df[df.A > 0]

A where operation for getting.



In [42]:

    
df[df > 0]

Using the isin() method for filtering:



In [43]:

    
df2 = df.copy()

添加一列。



In [44]:

    
df2['E'] = ['one', 'one','two','three','four','three']



In [45]:

    
df2



In [46]:

    
df2[df2['E'].isin(['two','four'])]

6.3.5 Setting

Setting a new column automatically aligns the data by the indexes



In [48]:

    
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1









    Out[48]:





2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

Setting values by position



In [49]:

    
df.iat[0,1] = 0

Setting by assigning with a numpy array



In [50]:

    
df.loc[:,'D'] = np.array([5] * len(df))

The result of the prior setting operations



In [51]:

    
df

A where operation with setting.



In [52]:

    
df2 = df.copy()



In [53]:

    
df2[df2 > 0] = -df2



In [54]:

    
df2



In [ ]:

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.327960	-0.607370	-0.265528	-0.353891
std	0.710887	0.415940	0.853035	0.896606
min	-1.334013	-1.347789	-1.423883	-1.578785
25%	-0.847048	-0.660454	-0.849312	-0.992017
50%	-0.052593	-0.577130	-0.212063	-0.142007
75%	0.102920	-0.397249	0.321603	0.129846
max	0.428547	-0.116653	0.817077	0.794341

	A	B	C	D
2013-01-01	-1.334013	-0.348297	0.388654	-0.225967
2013-01-02	-0.139974	-1.347789	0.817077	0.192477
2013-01-03	-1.082739	-0.544105	-1.423883	-1.247367
2013-01-04	0.034788	-0.677221	0.120449	0.794341
2013-01-05	0.428547	-0.610156	-0.950891	-0.058047
2013-01-06	0.125631	-0.116653	-0.544575	-1.578785

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

	A	B	C	D
2013-01-01	-1.334013	0.000000	0.388654	5
2013-01-02	-0.139974	-1.347789	0.817077	5
2013-01-03	-1.082739	-0.544105	-1.423883	5
2013-01-04	0.034788	-0.677221	0.120449	5
2013-01-05	0.428547	-0.610156	-0.950891	5
2013-01-06	0.125631	-0.116653	-0.544575	5