Why is Python Growing So Quickly?

Pandas vs SQL


In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'
tips = pd.read_csv(url)
tips.head()


Out[2]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

SELECT

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

pandas在DataFrame直接取字段名称即可


In [3]:
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)


Out[3]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner

DataFrame取数时,如果不设置字段名称,就会取所有字段 (与 SQL 的 * 等效)

WHERE

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

DataFrames过滤方法很多,最常用的是布尔索引(boolean indexing)


In [4]:
tips[tips['time'] == 'Dinner'].head(5)


Out[4]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

这行代码是向DataFrame传递一个True/False对象的Series,返回所有带True的行。


In [5]:
is_dinner = tips['time'] == 'Dinner'

In [6]:
is_dinner.value_counts()


Out[6]:
True     176
False     68
Name: time, dtype: int64

In [7]:
tips[is_dinner].head(5)


Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

与SQL的OR/AND类似,DataFrame 用 | (OR) 与 & (AND)

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

In [8]:
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]


Out[8]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;

In [9]:
tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]


Out[9]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5

用 notna() and isna() 方法检测NULL


In [10]:
frame = pd.DataFrame({
    'col1': ['A', 'B', np.NaN, 'C', 'D'],
    'col2': ['F', np.NaN, 'G', 'H', 'I']
})

In [11]:
frame


Out[11]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I

如果用SQL筛选col2是NULL:

SELECT *
FROM frame
WHERE col2 IS NULL;

pandas用isna方法


In [12]:
frame[frame['col2'].isna()]


Out[12]:
col1 col2
1 B NaN

用SQL筛选col1不是NULL:

SELECT *
FROM frame
WHERE col1 IS NOT NULL;

pandas用notna()方法


In [13]:
frame[frame['col1'].notna()]


Out[13]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I

GROUP BY

SQL的GROUP BY操作,在pandas中是groupby()方法。

例如用SQL统计不同性别客人数量:

SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female     87
Male      157
*/

pandas方法是:


In [14]:
tips.groupby('sex').size()


Out[14]:
sex
Female     87
Male      157
dtype: int64

这里用pandas的size()方法,不是count()方法。因为count()方法会自动应用到每一字段,返回所有字段的计数。


In [15]:
tips.groupby('sex').count()


Out[15]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157

也可以对一个字段用count()方法,实现同样效果:


In [16]:
tips.groupby('sex')['total_bill'].count()


Out[16]:
sex
Female     87
Male      157
Name: total_bill, dtype: int64

也可以一次使用多个聚合函数。假如,要统计每星期的日均小费金额,以及给小费的人数,用SQL如下:

SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thur  2.771452   62
*/

pandas用agg()方法实现,参数是一个Python字典(字段名称: 方法名称, ...)


In [17]:
tips.groupby('day').agg({'tip': np.mean, 'day': np.size})


Out[17]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No     Fri      4  2.812500
       Sat     45  3.102889
       Sun     57  3.167895
       Thur    45  2.673778
Yes    Fri     15  2.714000
       Sat     42  2.875476
       Sun     19  3.516842
       Thur    17  3.030000
*/

In [18]:
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})


Out[18]:
tip
size mean
smoker day
No Fri 4.0 2.812500
Sat 45.0 3.102889
Sun 57.0 3.167895
Thur 45.0 2.673778
Yes Fri 15.0 2.714000
Sat 42.0 2.875476
Sun 19.0 3.516842
Thur 17.0 3.030000

JOIN

JOIN在pandas是join()或merge()方法。join()方法默认是按照DataFrames的索引值(indices)连接,两种方法都支持四种连接(LEFT, RIGHT, INNER, FULL),也可以按字段连接。


In [19]:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                     'value': np.random.randn(4)})

In [20]:
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
                    'value': np.random.randn(4)})

INNER JOIN

SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;

In [21]:
# 默认是INNER JOIN
pd.merge(df1, df2, on='key')


Out[21]:
key value_x value_y
0 B 0.042590 0.266005
1 D 0.330982 -0.020091
2 D 0.330982 -1.484913

merge()也可以连接一个表的字段与另一个表的索引


In [22]:
indexed_df2 = df2.set_index('key')

In [23]:
pd.merge(df1, indexed_df2, left_on='key', right_index=True)


Out[23]:
key value_x value_y
1 B 0.042590 0.266005
3 D 0.330982 -0.020091
3 D 0.330982 -1.484913

LEFT OUTER JOIN

SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;

In [24]:
pd.merge(df1, df2, on='key', how='left')


Out[24]:
key value_x value_y
0 A -0.674657 NaN
1 B 0.042590 0.266005
2 C 1.698921 NaN
3 D 0.330982 -0.020091
4 D 0.330982 -1.484913

RIGHT JOIN

SELECT *
FROM df1
RIGHT OUTER JOIN df2
  ON df1.key = df2.key;

In [25]:
pd.merge(df1, df2, on='key', how='right')


Out[25]:
key value_x value_y
0 B 0.042590 0.266005
1 D 0.330982 -0.020091
2 D 0.330982 -1.484913
3 E NaN -0.758915

FULL JOIN

pandas支持FULL JOINs, 绝大部分RDBMS都支持,HIVE支持,MySQL不支持,

SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;

In [26]:
pd.merge(df1, df2, on='key', how='outer')


Out[26]:
key value_x value_y
0 A -0.674657 NaN
1 B 0.042590 0.266005
2 C 1.698921 NaN
3 D 0.330982 -0.020091
4 D 0.330982 -1.484913
5 E NaN -0.758915

UNION

SQL的UNION ALL在pandas中是concat()


In [27]:
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
                     'rank': range(1, 4)})
 
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
                     'rank': [1, 4, 5]})
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
      Chicago     1
       Boston     4
  Los Angeles     5
*/

In [28]:
pd.concat([df1, df2])


Out[28]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5

SQL的UNION与UNION ALL类似,不过UNION会剔除重复行

SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
       Boston     4
  Los Angeles     5
*/

pandas需要用concat()+drop_duplicates()方法实现


In [29]:
pd.concat([df1, df2]).drop_duplicates()


Out[29]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5

每组Top N 问题

SELECT * FROM (
  SELECT
    t.*,
    ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;

In [30]:
tips.assign(rn=tips.sort_values(['total_bill'], ascending=False).groupby(
    ['day']).cumcount() + 1).query('rn < 3').sort_values(['day', 'rn'])


Out[30]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2

也可以用rank(method=’first’)函数


In [31]:
tips.assign(rnk=tips.groupby(['day'])['total_bill'].rank(
    method='first', ascending=False)).query('rnk < 3').sort_values(['day', 'rnk'])


Out[31]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0

找出每组消费低于2美金(tips < 2)的前两名(rank < 3),这里用rank(method='min')

SELECT * FROM (
  SELECT
    t.*,
    RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
  FROM tips t
  WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;

In [32]:
tips[tips['tip'] < 2].assign(rnk_min=tips.groupby(['sex'])['tip'].rank(
    method='min')).query('rnk_min < 3').sort_values(['sex', 'rnk_min'])


Out[32]:
total_bill tip sex smoker day time size rnk_min
67 3.07 1.00 Female Yes Sat Dinner 1 1.0
92 5.75 1.00 Female Yes Fri Dinner 2 1.0
111 7.25 1.00 Female No Sat Dinner 1 1.0
236 12.60 1.00 Male Yes Sat Dinner 2 1.0
237 32.83 1.17 Male Yes Sat Dinner 2 2.0

UPDATE

UPDATE tips
SET tip = tip*2
WHERE tip < 2;

In [33]:
tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE

DELETE FROM tips
WHERE tip > 9;

pandas选择需要的数据,创建新Dataframe,而不是删除数据


In [34]:
tips = tips.loc[tips['tip'] <= 9]

参考资料