以下内容是对pandas库的简要介绍,主要面向新用户,您可以访问http://pandas.pydata.org/pandas-docs/dev/10min.html 查看更多示例,如果您想更深入的学习pandas,推荐您查看《利用Python进行数据分析》一书。
In [1]:
#导入常用libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 31
Pandas最重要的数据结构有Series和DataFrame,这里介绍series一些基本操作。
创建一个Series
In [4]:
s = pd.Series([1,3,5,np.nan,6,8])
s
Out[4]:
创建一个DataFrame,包括一个numpy array, 时间索引和列名字。
In [2]:
dates = pd.date_range('20130101',periods=6)
dates
Out[2]:
In [3]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
Out[3]:
创建一个DataFrame,包括一个可以转化为Series的字典
In [4]:
df2 = pd.DataFrame({ 'A' : 1.,
....: 'B' : pd.Timestamp('20130102'),
....: 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
....: 'D' : np.array([3] * 4,dtype='int32'),
....: 'E' : pd.Categorical(["test","train","test","train"]),
....: 'F' : 'foo' })
df2
Out[4]:
In [5]:
df2.dtypes
Out[5]:
可以完成自动补全功能,下面这是一个例子。
In [ ]:
df2.
查看前几条数据
In [10]:
df.head()
Out[10]:
查看后几条数据
In [11]:
df.tail()
Out[11]:
In [12]:
df.index
Out[12]:
In [13]:
df.columns
Out[13]:
In [14]:
df.values
Out[14]:
In [15]:
df.describe()
Out[15]:
In [16]:
df.T
Out[16]:
In [17]:
df.sort(columns='B')
Out[17]:
In [18]:
df['A']
Out[18]:
In [19]:
df[0:3]
Out[19]:
In [20]:
df['20130104':'20130106']
Out[20]:
In [21]:
df.loc[dates[0]]
Out[21]:
In [22]:
df.loc[:,['A','B']]
Out[22]:
In [23]:
df.loc['20130101':'20130103','A':'B']
Out[23]:
In [24]:
df.loc['20130101','A':'B']
Out[24]:
In [25]:
df.loc[dates[0],'A']
Out[25]:
In [26]:
df.at[dates[0],'A']
Out[26]:
In [27]:
df.iloc[3]
Out[27]:
In [22]:
df.iloc[3:5,0:2]
Out[22]:
In [23]:
df.iloc[[1,2,4],[0,2]]
Out[23]:
In [24]:
df.iloc[1:3,:]
Out[24]:
In [25]:
df.iloc[:,1:3]
Out[25]:
In [26]:
df.iloc[1,1]
Out[26]:
In [27]:
df[df.A>0]
Out[27]:
In [28]:
df[df>0]
Out[28]:
In [29]:
df2=df.copy()
df2['E']=['one', 'one','two','three','four','three']
df2
Out[29]:
In [30]:
df2[df2['E'].isin(['two','four'])]
Out[30]:
In [28]:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
s1
Out[28]:
In [31]:
df['F'] = s1
In [32]:
df.at[dates[0],'A'] = 0
In [33]:
df.iat[0,1] = 0
In [34]:
df.loc[:,'D'] = np.array([5] * len(df))
In [35]:
df
Out[35]:
In [36]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2
Out[36]:
缺失数据使用np.nan表示,默认不包括在计算内,可以通过下列方法更改缺失数据。
In [37]:
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
Out[37]:
In [38]:
df1.dropna(how='any')
Out[38]:
In [39]:
df1.fillna(value=5)
Out[39]:
In [40]:
pd.isnull(df1)
Out[40]:
In [41]:
df.mean()#列计算
Out[41]:
In [42]:
df.mean(1)#行计算
Out[42]:
In [44]:
s = pd.Series([1,3,5,np.nan,6,8],index=dates)#.shift(2)
s
Out[44]:
In [45]:
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
s
Out[45]:
In [46]:
df.sub(s,axis='index')
Out[46]:
In [49]:
df
Out[49]:
In [47]:
df.apply(np.cumsum)
Out[47]:
In [50]:
df.apply(lambda x: x.max() - x.min())
Out[50]:
In [51]:
s = pd.Series(np.random.randint(0,7,size=10))
s
Out[51]:
In [52]:
s.value_counts()#统计频率
Out[52]:
In [53]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
Out[53]:
In [54]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[54]:
In [55]:
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
Out[55]:
In [56]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [57]:
left
Out[57]:
In [59]:
right
Out[59]:
In [58]:
pd.merge(left, right, on='key')
Out[58]:
In [60]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
Out[60]:
In [61]:
s = df.iloc[3]
df.append(s, ignore_index=True)
Out[61]:
In [62]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B' : ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C' : np.random.randn(8),
....: 'D' : np.random.randn(8)})
....:
df
Out[62]:
In [63]:
df.groupby('A').sum()
Out[63]:
In [64]:
df.groupby(['A','B']).sum()
Out[64]:
In [69]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
....: 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two',
....: 'one', 'two', 'one', 'two']]))
....:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
Out[69]:
In [70]:
stacked = df2.stack()
stacked
Out[70]:
In [71]:
stacked.unstack()
Out[71]:
In [72]:
stacked.unstack(1)
Out[72]:
In [73]:
stacked.unstack(0)
Out[73]:
In [74]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
.....: 'B' : ['A', 'B', 'C'] * 4,
.....: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
.....: 'D' : np.random.randn(12),
.....: 'E' : np.random.randn(12)})
.....:
df
Out[74]:
In [75]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[75]:
In [7]:
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min', how='sum')
Out[7]:
In [78]:
ts
Out[78]:
In [3]:
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
Out[3]:
In [4]:
ts_utc = ts.tz_localize('UTC')
ts_utc
Out[4]:
In [5]:
ts_utc.tz_convert('US/Eastern')
Out[5]:
In [6]:
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
Out[6]:
In [7]:
ps = ts.to_period()
ps
Out[7]:
In [8]:
ps.to_timestamp()
Out[8]:
In [9]:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
Out[9]:
In [10]:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
Out[10]:
In [11]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
Out[11]:
In [12]:
df["grade"].cat.categories = ["very good", "good", "very bad"]
In [13]:
df.sort("grade")
Out[13]:
In [14]:
df.groupby("grade").size()
Out[14]:
In [15]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
Out[15]:
In [16]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
Out[16]:
In [ ]: