新手教程—Pandas库使用示例

以下内容是对pandas库的简要介绍，主要面向新用户，您可以访问http://pandas.pydata.org/pandas-docs/dev/10min.html 查看更多示例，如果您想更深入的学习pandas，推荐您查看《利用Python进行数据分析》一书。



In [1]:

    
#导入常用libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 31

数据结构

Pandas最重要的数据结构有Series和DataFrame，这里介绍series一些基本操作。

创建一个Series



In [4]:

    
s = pd.Series([1,3,5,np.nan,6,8])
s









    Out[4]:





0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

创建一个DataFrame，包括一个numpy array, 时间索引和列名字。



In [2]:

    
dates = pd.date_range('20130101',periods=6)
dates









    Out[2]:





DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D', tz=None)



In [3]:

    
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

创建一个DataFrame，包括一个可以转化为Series的字典



In [4]:

    
df2 = pd.DataFrame({ 'A' : 1.,
   ....:            'B' : pd.Timestamp('20130102'),
   ....:            'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:            'D' : np.array([3] * 4,dtype='int32'),
   ....:            'E' : pd.Categorical(["test","train","test","train"]),
   ....:            'F' : 'foo' })
df2



In [5]:

    
df2.dtypes









    Out[5]:





A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

可以完成自动补全功能，下面这是一个例子。



In [ ]:

    
df2.

df2.A df2.boxplot df2.abs df2.C df2.add df2.clip df2.add_prefix df2.clip_lower df2.add_suffix df2.clip_upper df2.align df2.columns df2.all df2.combine df2.any df2.combineAdd df2.append df2.combine_first df2.apply df2.combineMult df2.applymap df2.compound df2.as_blocks df2.consolidate df2.asfreq df2.convert_objects df2.as_matrix df2.copy df2.astype df2.corr df2.at df2.corrwith df2.at_time df2.count df2.axes df2.cov df2.B df2.cummax df2.between_time df2.cummin df2.bfill df2.cumprod df2.blocks df2.cumsum df2.bool df2.D

查看数据

查看前几条数据



In [10]:

    
df.head()

查看后几条数据



In [11]:

    
df.tail()



In [12]:

    
df.index









    Out[12]:





DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D', tz=None)



In [13]:

    
df.columns









    Out[13]:





Index([u'A', u'B', u'C', u'D'], dtype='object')



In [14]:

    
df.values









    Out[14]:





array([[ 2.10889288, -0.51012447, -0.22662373,  0.42826453],
       [ 0.02850516, -0.86149264, -0.40833718,  2.68309202],
       [-1.07837404, -0.08994899,  0.49524732, -0.90692598],
       [-0.65544482,  0.77960699,  1.08173914,  1.91717896],
       [-1.06596263,  1.85296197,  1.64954395,  0.31825327],
       [ 0.6874145 , -0.79814426,  0.78050989, -0.1760253 ]])



In [15]:

    
df.describe()



In [16]:

    
df.T









    Out[16]:






  
    
      
      2013-01-01 00:00:00
      2013-01-02 00:00:00
      2013-01-03 00:00:00
      2013-01-04 00:00:00
      2013-01-05 00:00:00
      2013-01-06 00:00:00
    
  
  
    
      A
      2.108893
      0.028505
      -1.078374
      -0.655445
      -1.065963
      0.687415
    
    
      B
      -0.510124
      -0.861493
      -0.089949
      0.779607
      1.852962
      -0.798144
    
    
      C
      -0.226624
      -0.408337
      0.495247
      1.081739
      1.649544
      0.780510
    
    
      D
      0.428265
      2.683092
      -0.906926
      1.917179
      0.318253
      -0.176025



In [17]:

    
df.sort(columns='B')

选择数据



In [18]:

    
df['A']









    Out[18]:





2013-01-01    2.108893
2013-01-02    0.028505
2013-01-03   -1.078374
2013-01-04   -0.655445
2013-01-05   -1.065963
2013-01-06    0.687415
Freq: D, Name: A, dtype: float64



In [19]:

    
df[0:3]



In [20]:

    
df['20130104':'20130106']



In [21]:

    
df.loc[dates[0]]









    Out[21]:





A    2.108893
B   -0.510124
C   -0.226624
D    0.428265
Name: 2013-01-01 00:00:00, dtype: float64



In [22]:

    
df.loc[:,['A','B']]



In [23]:

    
df.loc['20130101':'20130103','A':'B']



In [24]:

    
df.loc['20130101','A':'B']









    Out[24]:





A    2.108893
B   -0.510124
Name: 2013-01-01 00:00:00, dtype: float64



In [25]:

    
df.loc[dates[0],'A']









    Out[25]:





2.1088928765565074



In [26]:

    
df.at[dates[0],'A']









    Out[26]:





2.1088928765565074



In [27]:

    
df.iloc[3]









    Out[27]:





A   -0.655445
B    0.779607
C    1.081739
D    1.917179
Name: 2013-01-04 00:00:00, dtype: float64



In [22]:

    
df.iloc[3:5,0:2]



In [23]:

    
df.iloc[[1,2,4],[0,2]]



In [24]:

    
df.iloc[1:3,:]



In [25]:

    
df.iloc[:,1:3]



In [26]:

    
df.iloc[1,1]









    Out[26]:





0.48324869963712741



In [27]:

    
df[df.A>0]



In [28]:

    
df[df>0]



In [29]:

    
df2=df.copy()

df2['E']=['one', 'one','two','three','four','three']

df2



In [30]:

    
df2[df2['E'].isin(['two','four'])]



In [28]:

    
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))

s1









    Out[28]:





2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64



In [31]:

    
df['F'] = s1



In [32]:

    
df.at[dates[0],'A'] = 0



In [33]:

    
df.iat[0,1] = 0



In [34]:

    
df.loc[:,'D'] = np.array([5] * len(df))



In [35]:

    
df



In [36]:

    
df2 = df.copy()

df2[df2 > 0] = -df2

df2

缺失数据

缺失数据使用np.nan表示，默认不包括在计算内，可以通过下列方法更改缺失数据。



In [37]:

    
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

df1.loc[dates[0]:dates[1],'E'] = 1

df1



In [38]:

    
df1.dropna(how='any')



In [39]:

    
df1.fillna(value=5)



In [40]:

    
pd.isnull(df1)









    Out[40]:






  
    
      
      A
      B
      C
      D
      F
      E
    
  
  
    
      2013-01-01
      False
      False
      False
      False
      True
      False
    
    
      2013-01-02
      False
      False
      False
      False
      False
      False
    
    
      2013-01-03
      False
      False
      False
      False
      False
      True
    
    
      2013-01-04
      False
      False
      False
      False
      False
      True

统计



In [41]:

    
df.mean()#列计算









    Out[41]:





A   -0.347310
B    0.147164
C    0.562013
D    5.000000
F    3.000000
dtype: float64



In [42]:

    
df.mean(1)#行计算









    Out[42]:





2013-01-01    1.193344
2013-01-02    0.951735
2013-01-03    1.265385
2013-01-04    1.841180
2013-01-05    2.287309
2013-01-06    2.133956
Freq: D, dtype: float64



In [44]:

    
s = pd.Series([1,3,5,np.nan,6,8],index=dates)#.shift(2)

s









    Out[44]:





2013-01-01     1
2013-01-02     3
2013-01-03     5
2013-01-04   NaN
2013-01-05     6
2013-01-06     8
Freq: D, dtype: float64



In [45]:

    
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)

s









    Out[45]:





2013-01-01   NaN
2013-01-02   NaN
2013-01-03     1
2013-01-04     3
2013-01-05     5
2013-01-06   NaN
Freq: D, dtype: float64



In [46]:

    
df.sub(s,axis='index')



In [49]:

    
df



In [47]:

    
df.apply(np.cumsum)



In [50]:

    
df.apply(lambda x: x.max() - x.min())









    Out[50]:





A    1.765789
B    2.714455
C    2.057881
D    0.000000
F    4.000000
dtype: float64



In [51]:

    
s = pd.Series(np.random.randint(0,7,size=10))

s









    Out[51]:





0    3
1    2
2    5
3    6
4    0
5    5
6    5
7    0
8    3
9    2
dtype: int64



In [52]:

    
s.value_counts()#统计频率









    Out[52]:





5    3
3    2
2    2
0    2
6    1
dtype: int64



In [53]:

    
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

s.str.lower()









    Out[53]:





0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

规整 Merge

concat



In [54]:

    
df = pd.DataFrame(np.random.randn(10, 4))

df



In [55]:

    
# break it into pieces

pieces = [df[:3], df[3:7], df[7:]]

pd.concat(pieces)

merge



In [56]:

    
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})



In [57]:

    
left



In [59]:

    
right



In [58]:

    
pd.merge(left, right, on='key')

append



In [60]:

    
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

df



In [61]:

    
s = df.iloc[3]

df.append(s, ignore_index=True)



In [62]:

    
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                          'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                          'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
   ....: 
        
df

Grouping



In [63]:

    
df.groupby('A').sum()



In [64]:

    
df.groupby(['A','B']).sum()

Stack



In [69]:

    
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))
   ....: 
        
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df2 = df[:4]

df2



In [70]:

    
stacked = df2.stack()

stacked









    Out[70]:





first  second   
bar    one     A   -0.599411
               B    0.511343
       two     A    0.915159
               B    0.096236
baz    one     A   -0.235927
               B    0.837185
       two     A   -0.399887
               B   -0.393647
dtype: float64



In [71]:

    
stacked.unstack()



In [72]:

    
stacked.unstack(1)



In [73]:

    
stacked.unstack(0)



In [74]:

    
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
   .....:                    'B' : ['A', 'B', 'C'] * 4,
   .....:                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
   .....:                    'D' : np.random.randn(12),
   .....:                    'E' : np.random.randn(12)})
   .....: 
        
df



In [75]:

    
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Time Series



In [7]:

    
rng = pd.date_range('1/1/2012', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample('5Min', how='sum')









    Out[7]:





2012-01-01    21469
Freq: 5T, dtype: int64



In [78]:

    
ts









    Out[78]:





2012-01-01 00:00:00    312
2012-01-01 00:00:01     67
2012-01-01 00:00:02    256
2012-01-01 00:00:03    486
2012-01-01 00:00:04    485
2012-01-01 00:00:05    210
2012-01-01 00:00:06    238
2012-01-01 00:00:07    484
2012-01-01 00:00:08    280
2012-01-01 00:00:09    223
2012-01-01 00:00:10     95
2012-01-01 00:00:11    254
2012-01-01 00:00:12    436
2012-01-01 00:00:13    234
2012-01-01 00:00:14    337
                      ... 
2012-01-01 00:01:25    176
2012-01-01 00:01:26     92
2012-01-01 00:01:27    355
2012-01-01 00:01:28    486
2012-01-01 00:01:29    270
2012-01-01 00:01:30    215
2012-01-01 00:01:31    165
2012-01-01 00:01:32    449
2012-01-01 00:01:33    373
2012-01-01 00:01:34    386
2012-01-01 00:01:35    271
2012-01-01 00:01:36     17
2012-01-01 00:01:37     89
2012-01-01 00:01:38    108
2012-01-01 00:01:39    131
Freq: S, dtype: int64



In [3]:

    
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
 
ts = pd.Series(np.random.randn(len(rng)), rng)

ts









    Out[3]:





2012-03-06    0.167706
2012-03-07   -1.507317
2012-03-08    0.382276
2012-03-09    2.126438
2012-03-10    0.415160
Freq: D, dtype: float64



In [4]:

    
ts_utc = ts.tz_localize('UTC')

ts_utc









    Out[4]:





2012-03-06 00:00:00+00:00    0.167706
2012-03-07 00:00:00+00:00   -1.507317
2012-03-08 00:00:00+00:00    0.382276
2012-03-09 00:00:00+00:00    2.126438
2012-03-10 00:00:00+00:00    0.415160
Freq: D, dtype: float64



In [5]:

    
ts_utc.tz_convert('US/Eastern')









    Out[5]:





2012-03-05 19:00:00-05:00    0.167706
2012-03-06 19:00:00-05:00   -1.507317
2012-03-07 19:00:00-05:00    0.382276
2012-03-08 19:00:00-05:00    2.126438
2012-03-09 19:00:00-05:00    0.415160
Freq: D, dtype: float64



In [6]:

    
rng = pd.date_range('1/1/2012', periods=5, freq='M')

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts









    Out[6]:





2012-01-31   -0.110541
2012-02-29    0.558980
2012-03-31   -0.438710
2012-04-30   -1.489807
2012-05-31    0.175205
Freq: M, dtype: float64



In [7]:

    
ps = ts.to_period()

ps









    Out[7]:





2012-01   -0.110541
2012-02    0.558980
2012-03   -0.438710
2012-04   -1.489807
2012-05    0.175205
Freq: M, dtype: float64



In [8]:

    
ps.to_timestamp()









    Out[8]:





2012-01-01   -0.110541
2012-02-01    0.558980
2012-03-01   -0.438710
2012-04-01   -1.489807
2012-05-01    0.175205
Freq: MS, dtype: float64



In [9]:

    
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')

ts = pd.Series(np.random.randn(len(prng)), prng)

ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

ts.head()









    Out[9]:





1990-03-01 09:00   -2.013458
1990-06-01 09:00   -1.760266
1990-09-01 09:00   -1.266192
1990-12-01 09:00   -0.665547
1991-03-01 09:00    1.581657
Freq: H, dtype: float64

Categoricals



In [10]:

    
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

df



In [11]:

    
df["grade"] = df["raw_grade"].astype("category")

df["grade"]









    Out[11]:





0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]



In [12]:

    
df["grade"].cat.categories = ["very good", "good", "very bad"]



In [13]:

    
df.sort("grade")



In [14]:

    
df.groupby("grade").size()









    Out[14]:





grade
very good    3
good         2
very bad     1
dtype: int64

Plotting



In [15]:

    
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f92ddc02090>



In [16]:

    
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
   .....:                   columns=['A', 'B', 'C', 'D'])
   .....: 
        
df = df.cumsum()

plt.figure(); df.plot(); plt.legend(loc='best')









    Out[16]:





<matplotlib.legend.Legend at 0x7f92dda3f990>






    





<matplotlib.figure.Figure at 0x7f92dda17650>



In [ ]:

	A	B	C	D
2013-01-01	0.451654	-1.669159	-0.716677	-0.311599
2013-01-02	1.834961	-0.612790	-0.376212	1.239693
2013-01-03	-0.422785	0.663677	0.878228	-0.132077
2013-01-04	-1.201822	1.054161	0.641047	0.166984
2013-01-05	1.953751	1.072182	-0.477465	0.514392
2013-01-06	0.181696	-1.068304	1.676052	0.322696

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.004172	0.062143	0.562013	0.710640
std	1.236887	1.064482	0.783398	1.340351
min	-1.078374	-0.861493	-0.408337	-0.906926
25%	-0.963333	-0.726139	-0.046156	-0.052456
50%	-0.313470	-0.300037	0.637879	0.373259
75%	0.522687	0.562218	1.006432	1.544950
max	2.108893	1.852962	1.649544	2.683092

	A	B
2013-01-04	0.385315	1.002269
2013-01-05	-0.276139	1.422227

	A	C
2013-01-02	1.643882	0.586507
2013-01-03	-0.660116	-0.278159
2013-01-05	-0.276139	0.168013

	A	B	C	D
2013-01-02	1.643882	0.483249	0.586507	-1.525807
2013-01-03	-0.660116	0.444498	-0.278159	-0.492252

	A	B	C	D	E	F
0	1	2013-01-02	1	3	test	foo
1	1	2013-01-02	1	3	train	foo
2	1	2013-01-02	1	3	test	foo
3	1	2013-01-02	1	3	train	foo

	A	B	C	D
2013-01-01	2.108893	-0.510124	-0.226624	0.428265
2013-01-02	0.028505	-0.861493	-0.408337	2.683092
2013-01-03	-1.078374	-0.089949	0.495247	-0.906926
2013-01-04	-0.655445	0.779607	1.081739	1.917179
2013-01-05	-1.065963	1.852962	1.649544	0.318253

	B	C
2013-01-01	-1.166239	-1.229485
2013-01-02	0.483249	0.586507
2013-01-03	0.444498	-0.278159
2013-01-04	1.002269	-0.017041
2013-01-05	1.422227	0.168013
2013-01-06	-0.873137	-0.025599

	A	B	C	D
2013-01-01	NaN	NaN	NaN	0.453450
2013-01-02	1.643882	0.483249	0.586507	NaN
2013-01-03	NaN	0.444498	NaN	NaN
2013-01-04	0.385315	1.002269	NaN	2.023510
2013-01-05	NaN	1.422227	0.168013	NaN
2013-01-06	0.758207	NaN	NaN	0.845468

	A	B	C	D	E
2013-01-01	-0.463576	-1.166239	-1.229485	0.453450	one
2013-01-02	1.643882	0.483249	0.586507	-1.525807	one
2013-01-03	-0.660116	0.444498	-0.278159	-0.492252	two
2013-01-04	0.385315	1.002269	-0.017041	2.023510	three
2013-01-05	-0.276139	1.422227	0.168013	-0.116013	four
2013-01-06	0.758207	-0.873137	-0.025599	0.845468	three

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-0.226624	5	NaN
2013-01-02	0.028505	-0.861493	-0.408337	5	1
2013-01-03	-1.078374	-0.089949	0.495247	5	2
2013-01-04	-0.655445	0.779607	1.081739	5	3
2013-01-05	-1.065963	1.852962	1.649544	5	4
2013-01-06	0.687415	-0.798144	0.780510	5	5

	A	B	C	D	F	E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

	0	1	2	3
0	1.592374	0.655436	0.951883	-0.317604
1	0.998669	0.396345	0.554562	-0.213849
2	0.598832	-0.795983	-0.581768	0.241221
3	-0.081055	0.206142	1.143169	-0.749618
4	1.427974	-1.996137	-2.494941	-2.499584
5	-0.568733	-0.691511	-0.316464	0.149466
6	1.123894	-0.875845	2.322487	-0.676746
7	-2.862173	-0.679058	-1.556605	-1.727630
8	0.009725	-1.435217	-0.225359	2.155263
9	0.139717	0.580340	0.457227	0.132505

	A	B	C	D
0	1.247717	-1.612009	0.472290	0.201178
1	1.100179	0.161626	-0.321770	0.474538
2	-0.119958	-1.389366	0.494838	0.182669
3	-0.101885	-1.178154	0.109359	0.505889
4	-0.064535	0.084794	0.213919	-0.222954
5	-1.563538	1.374349	1.617262	1.060599
6	1.353898	-0.560971	0.625005	-0.024575
7	-0.349520	-0.441983	-2.094049	1.159827

	A	B	C	D
0	foo	one	-0.698394	0.176046
1	bar	one	0.643372	-0.026620
2	foo	two	1.127251	-0.321595
3	bar	three	1.290459	-0.010702
4	foo	two	0.836285	-0.346573
5	bar	two	0.981900	-1.203876
6	foo	one	1.527603	1.903233
7	foo	three	0.880489	0.347362

		A	B
first	second
bar	one	-0.599411	0.511343
bar	two	0.915159	0.096236
baz	one	-0.235927	0.837185
baz	two	-0.399887	-0.393647

	A	B	C	D	E
0	one	A	foo	-0.127958	1.228190
1	one	B	foo	0.149878	-0.197303
2	two	C	foo	-1.586735	0.635435
3	three	A	bar	-0.174615	0.014008
4	one	B	bar	0.226525	-0.045204
5	one	C	bar	-1.540367	-0.612553
6	two	A	foo	0.006047	-0.089030
7	three	B	foo	1.037414	-0.895084
8	one	C	foo	1.542884	-0.695924
9	one	A	bar	-2.256235	0.645135
10	two	B	bar	-0.128245	-0.686096
11	three	C	bar	0.751396	0.014718

	id	raw_grade	grade
0	1	a	very good
3	4	a	very good
4	5	a	very good
1	2	b	good
2	3	b	good
5	6	e	very bad