与SQL的比较

由于许多潜在pandas用户已经熟悉SQL,这个页面旨在使用pandas给出SQL各种操作的例子。

如果你对pandas比较陌生,你可能需要通过10分钟先读一下pandas。

按照惯例,我们先导入pandas和numpy:


In [3]:
import pandas as pd

import numpy as np

In [4]:
url = 'https://raw.github.com/pydata/pandas/master/pandas/tests/data/tips.csv'
tips = pd.read_csv(url)
tips.head()


Out[4]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

选择

在SQL中, 用逗号将所要选择的列分隔开,来进行选择。(或用*表示选择所有列):

SELECT total_bill, tip, smoker, time

FROM tips

LIMIT 5;

在pandas中,通过給DataFrame传递列名进行列选择:


In [6]:
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)


Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner

如果没有说明列名,默认选择所有列

WHERE

通过WHERE子句完成SQL过滤。

SELECT *

FROM tips

WHERE time = 'Dinner'

LIMIT 5;

可以以多种方式筛选DataFrames;其中最直观的使用布尔索引。


In [7]:
tips[tips['time'] == 'Dinner'].head(5)


Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

上述声明是通过給DataFrame传递一系列的真/假对象,返回所有真实的行。


In [8]:
is_dinner = tips['time'] == 'Dinner'

In [9]:
is_dinner.value_counts()


Out[9]:
True     176
False     68
Name: time, dtype: int64

In [10]:
tips[is_dinner].head(5)


Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

就像SQL的or和and,通过使用|(or)和&(and),可以将多个条件传递给DataFrame。

-- tips of more than $5.00 at Dinner meals

SELECT *

FROM tips

WHERE time = 'Dinner' AND tip > 5.00;


In [11]:
# tips of more than $5.00 at Dinner meals
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]


Out[11]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3

-- tips by parties of at least 5 diners OR bill total was more than $45

SELECT *

FROM tips

WHERE size >= 5 OR total_bill > 45;


In [12]:
tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]


Out[12]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5

使用notnull()和isnull()方法进行NULL检查。


In [13]:
frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
                      'col2': ['F', np.NaN, 'G', 'H', 'I']})

In [14]:
frame


Out[14]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I

SELECT *

FROM frame

WHERE col2 IS NULL;


In [15]:
frame[frame['col2'].isnull()]


Out[15]:
col1 col2
1 B NaN

SELECT * FROM frame WHERE col1 IS NOT NULL;


In [16]:
frame[frame['col1'].notnull()]


Out[16]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I

分组

在pandas中,执行SQL的GROUP BY操作时使用同样的命名groupby()方法。groupby()通常是指一个过程,我们想把数据集分成组,应用一些函数(通常聚合),然后结合分组在一起。一个常见的SQL操作可以得到数据集中每组记录的数量。例如:

SELECT sex, count(*)

FROM tips

GROUP BY sex;

/*

Female 87

Male 157

*/

同理,The pandas是这样的:


In [19]:
tips.groupby('sex').size()


Out[19]:
sex
Female     87
Male      157
dtype: int64

注意,在pandas的代码中,我们使用size()并不是count()。这是因为count()函数适用于每一列,返回每个非空记录的数量。


In [20]:
tips.groupby('sex').count()


Out[20]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157

或者,我们可以用count()方法应用于一个单独的列:


In [21]:
tips.groupby('sex')['total_bill'].count()


Out[21]:
sex
Female     87
Male      157
Name: total_bill, dtype: int64

多样函数也可以应用。例如,我们想看看tip数量在一个星期中的变化,gg()允许您传递一个字典給分组DataFrame,表明函数适用于特定列。

SELECT day, AVG(tip), COUNT(*)

FROM tips

GROUP BY day;

/*

Fri 2.734737 19

Sat 2.993103 87

Sun 3.255132 76

Thur 2.771452 62

*/


In [22]:
tips.groupby('day').agg({'tip': np.mean, 'day': np.size})


Out[22]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62

传递一个列表給groupby()方法,完成多个列的分组。

SELECT smoker, day, COUNT(*), AVG(tip)

FROM tips

GROUP BY smoker, day;

/*

smoker day

No Fri 4 2.812500

   Sat     45  3.102889

   Sun     57  3.167895

   Thur    45  2.673778

Yes Fri 15 2.714000

   Sat     42  2.875476

   Sun     19  3.516842

   Thur    17  3.030000

*/


In [23]:
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})


Out[23]:
tip
size mean
smoker day
No Fri 4.0 2.812500
Sat 45.0 3.102889
Sun 57.0 3.167895
Thur 45.0 2.673778
Yes Fri 15.0 2.714000
Sat 42.0 2.875476
Sun 19.0 3.516842
Thur 17.0 3.030000

连接

使用join()或merge()进行连接。默认情况下,join()将加入DataFrames索引。每个方法有参数允许您指定连接类型(LEFT, RIGHT, INNER, FULL)或列的连接(列名或索引)。


In [24]:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                  'value': np.random.randn(4)})

In [25]:
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
                  'value': np.random.randn(4)})

假设我们有两个与DataFrames同名同结构的数据库表。现在我们来讨论各种类型的连接

内连接

SELECT *

FROM df1

INNER JOIN df2

ON df1.key = df2.key;


In [26]:
# merge performs an INNER JOIN by default
pd.merge(df1, df2, on='key')


Out[26]:
key value_x value_y
0 B 1.536419 1.767244
1 D 1.973574 0.050838
2 D 1.973574 0.956527

当你想把DataFrame的列与另一个DataFrame索引连接时,merge()函数给出了其他参数。


In [27]:
indexed_df2 = df2.set_index('key')

In [28]:
indexed_df2


Out[28]:
value
key
B 1.767244
D 0.050838
D 0.956527
E -0.053910

In [29]:
pd.merge(df1, indexed_df2, left_on='key', right_index=True)


Out[29]:
key value_x value_y
1 B 1.536419 1.767244
3 D 1.973574 0.050838
3 D 1.973574 0.956527

左外连接

-- show all records from df1

SELECT *

FROM df1

LEFT OUTER JOIN df2

ON df1.key = df2.key;


In [30]:
# show all records from df1
pd.merge(df1, df2, on='key', how='left')


Out[30]:
key value_x value_y
0 A -0.675068 NaN
1 B 1.536419 1.767244
2 C -0.777839 NaN
3 D 1.973574 0.050838
4 D 1.973574 0.956527

右连接

-- show all records from df2

SELECT *

FROM df1

RIGHT OUTER JOIN df2

ON df1.key = df2.key;


In [31]:
# show all records from df2
pd.merge(df1, df2, on='key', how='right')


Out[31]:
key value_x value_y
0 B 1.536419 1.767244
1 D 1.973574 0.050838
2 D 1.973574 0.956527
3 E NaN -0.053910

全连接

pandas还允许全连接,显示两边的数据集,无论是否找到列匹配。全连接不支持所有RDBMS(MySQL)。

-- show all records from both tables

SELECT *

FROM df1

FULL OUTER JOIN df2

ON df1.key = df2.key;


In [32]:
pd.merge(df1, df2, on='key', how='outer')


Out[32]:
key value_x value_y
0 A -0.675068 NaN
1 B 1.536419 1.767244
2 C -0.777839 NaN
3 D 1.973574 0.050838
4 D 1.973574 0.956527
5 E NaN -0.053910

联合

使用concat()联合所有。


In [41]:
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
                    'rank': range(1, 4)})
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
                    'rank': [1, 4, 5]})

SELECT city, rank

FROM df1

UNION ALL

SELECT city, rank

FROM df2;

/*

     city  rank

  Chicago     1

  Chicago     2

 New York     3

  Chicago     1

   Boston     4

  Angeles     5

In [34]:
pd.concat([df1, df2])


Out[34]:
key value
0 A -0.675068
1 B 1.536419
2 C -0.777839
3 D 1.973574
0 B 1.767244
1 D 0.050838
2 D 0.956527
3 E -0.053910

SQL’s UNION和UNION ALL相似,而UNION 删除重复行

SELECT city, rank

FROM df1

UNION

SELECT city, rank

FROM df2;

-- notice that there is only one Chicago record this time

/*

     city  rank

  Chicago     1

  Chicago     2

 New York     3

  Chicago     1

   Boston     4

  Angeles     5

在pandas中,使用drop_duplicates()删除重复行。


In [35]:
pd.concat([df1, df2]).drop_duplicates()


Out[35]:
key value
0 A -0.675068
1 B 1.536419
2 C -0.777839
3 D 1.973574
0 B 1.767244
1 D 0.050838
2 D 0.956527
3 E -0.053910

pandas等价于一些SQL解析和聚合函数

带有偏移量前N行

-- MySQL

SELECT * FROM tips

ORDER BY tip DESC

LIMIT 10 OFFSET 5;


In [37]:
tips.nlargest(10+5, columns='tip').tail(10)


Out[37]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4

每组前n行

-- Oracle's ROW_NUMBER() analytic function

SELECT * FROM (

SELECT

t.*,

ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn

FROM tips t

)

WHERE rn < 3

ORDER BY day, rn;


In [38]:
(tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
                      .groupby(['day'])
                    .cumcount() + 1)
     .query('rn < 3')
        .sort_values(['day','rn'])
    )


Out[38]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2

使用rank(method=’first’)函数


In [39]:
(tips.assign(rnk=tips.groupby(['day'])['total_bill']
                       .rank(method='first', ascending=False))
       .query('rnk < 3')
     .sort_values(['day','rnk'])
   )


Out[39]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0

-- Oracle's RANK() analytic function

SELECT * FROM (

SELECT

t.*,

RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk

FROM tips t

WHERE tip < 2

)

WHERE rnk < 3

ORDER BY sex, rnk;

更新

UPDATE tips

SET tip = tip*2

WHERE tip < 2;


In [40]:
tips.loc[tips['tip'] < 2, 'tip'] *= 2

删除

DELETE FROM tips

WHERE tip > 9;

在pandas中,我们选择应该留下剩下的行,而不是删除它们


In [41]:
tips = tips.loc[tips['tip'] <= 9]