由于许多潜在pandas用户已经熟悉SQL,这个页面旨在使用pandas给出SQL各种操作的例子。
如果你对pandas比较陌生,你可能需要通过10分钟先读一下pandas。
按照惯例,我们先导入pandas和numpy:
In [3]:
import pandas as pd
import numpy as np
In [4]:
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'
tips = pd.read_csv(url)
tips.head()
Out[4]:
在SQL中, 用逗号将所要选择的列分隔开,来进行选择。(或用*表示选择所有列):
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
在pandas中,通过給DataFrame传递列名进行列选择:
In [6]:
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Out[6]:
如果没有说明列名,默认选择所有列
通过WHERE子句完成SQL过滤。
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
可以以多种方式筛选DataFrames;其中最直观的使用布尔索引。
In [7]:
tips[tips['time'] == 'Dinner'].head(5)
Out[7]:
上述声明是通过給DataFrame传递一系列的真/假对象,返回所有真实的行。
In [8]:
is_dinner = tips['time'] == 'Dinner'
In [9]:
is_dinner.value_counts()
Out[9]:
In [10]:
tips[is_dinner].head(5)
Out[10]:
就像SQL的or和and,通过使用|(or)和&(and),可以将多个条件传递给DataFrame。
-- tips of more than $5.00 at Dinner meals
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
In [11]:
# tips of more than $5.00 at Dinner meals
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
Out[11]:
-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
In [12]:
tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[12]:
使用notnull()和isnull()方法进行NULL检查。
In [13]:
frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
'col2': ['F', np.NaN, 'G', 'H', 'I']})
In [14]:
frame
Out[14]:
SELECT *
FROM frame
WHERE col2 IS NULL;
In [15]:
frame[frame['col2'].isnull()]
Out[15]:
SELECT * FROM frame WHERE col1 IS NOT NULL;
In [16]:
frame[frame['col1'].notnull()]
Out[16]:
在pandas中,执行SQL的GROUP BY操作时使用同样的命名groupby()方法。groupby()通常是指一个过程,我们想把数据集分成组,应用一些函数(通常聚合),然后结合分组在一起。一个常见的SQL操作可以得到数据集中每组记录的数量。例如:
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
同理,The pandas是这样的:
In [19]:
tips.groupby('sex').size()
Out[19]:
注意,在pandas的代码中,我们使用size()并不是count()。这是因为count()函数适用于每一列,返回每个非空记录的数量。
In [20]:
tips.groupby('sex').count()
Out[20]:
或者,我们可以用count()方法应用于一个单独的列:
In [21]:
tips.groupby('sex')['total_bill'].count()
Out[21]:
多样函数也可以应用。例如,我们想看看tip数量在一个星期中的变化,gg()允许您传递一个字典給分组DataFrame,表明函数适用于特定列。
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/
In [22]:
tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
Out[22]:
传递一个列表給groupby()方法,完成多个列的分组。
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
*/
In [23]:
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
Out[23]:
使用join()或merge()进行连接。默认情况下,join()将加入DataFrames索引。每个方法有参数允许您指定连接类型(LEFT, RIGHT, INNER, FULL)或列的连接(列名或索引)。
In [24]:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value': np.random.randn(4)})
In [25]:
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
'value': np.random.randn(4)})
假设我们有两个与DataFrames同名同结构的数据库表。现在我们来讨论各种类型的连接
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
In [26]:
# merge performs an INNER JOIN by default
pd.merge(df1, df2, on='key')
Out[26]:
当你想把DataFrame的列与另一个DataFrame索引连接时,merge()函数给出了其他参数。
In [27]:
indexed_df2 = df2.set_index('key')
In [28]:
indexed_df2
Out[28]:
In [29]:
pd.merge(df1, indexed_df2, left_on='key', right_index=True)
Out[29]:
-- show all records from df1
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
In [30]:
# show all records from df1
pd.merge(df1, df2, on='key', how='left')
Out[30]:
-- show all records from df2
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
In [31]:
# show all records from df2
pd.merge(df1, df2, on='key', how='right')
Out[31]:
pandas还允许全连接,显示两边的数据集,无论是否找到列匹配。全连接不支持所有RDBMS(MySQL)。
-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
In [32]:
pd.merge(df1, df2, on='key', how='outer')
Out[32]:
使用concat()联合所有。
In [41]:
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
'rank': range(1, 4)})
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
'rank': [1, 4, 5]})
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
Chicago 2
New York 3
Chicago 1
Boston 4
Angeles 5
In [34]:
pd.concat([df1, df2])
Out[34]:
SQL’s UNION和UNION ALL相似,而UNION 删除重复行
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
Chicago 2
New York 3
Chicago 1
Boston 4
Angeles 5
在pandas中,使用drop_duplicates()删除重复行。
In [35]:
pd.concat([df1, df2]).drop_duplicates()
Out[35]:
pandas等价于一些SQL解析和聚合函数
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [37]:
tips.nlargest(10+5, columns='tip').tail(10)
Out[37]:
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [38]:
(tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
.groupby(['day'])
.cumcount() + 1)
.query('rn < 3')
.sort_values(['day','rn'])
)
Out[38]:
使用rank(method=’first’)函数
In [39]:
(tips.assign(rnk=tips.groupby(['day'])['total_bill']
.rank(method='first', ascending=False))
.query('rnk < 3')
.sort_values(['day','rnk'])
)
Out[39]:
-- Oracle's RANK() analytic function
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [40]:
tips.loc[tips['tip'] < 2, 'tip'] *= 2
DELETE FROM tips
WHERE tip > 9;
在pandas中,我们选择应该留下剩下的行,而不是删除它们
In [41]:
tips = tips.loc[tips['tip'] <= 9]