Job Market with Pandas Part 2

数据引用

在 Part 1 的基础上,看了文章之后学习的东西


In [1]:
import re
import pandas as pd
import numpy as np
df = pd.read_csv('data/job_market.csv')
df = df.drop(['link', 'description', 'page_number'], axis=1)
%matplotlib inline

查看所有职位地区 Top N


In [2]:
df.location.value_counts().head(10)


Out[2]:
London         203
Cambridge       42
Bristol         19
Reading         19
Manchester      16
Berkhamsted     15
Devon           15
Oxford          11
Hatfield         8
USA              8
dtype: int64

查看这些地区职位的平均申请数量


In [3]:
top_locations = df.location.value_counts().head(10)
df2 = df[df.location.isin(top_locations.keys())]
grouped = df2.groupby('location')
grouped.applications.sum() / grouped.id.count()


Out[3]:
location
Berkhamsted    2.333333
Bristol        4.157895
Cambridge      2.523810
Devon          7.666667
Hatfield       3.000000
London         7.935961
Manchester     2.125000
Oxford         4.181818
Reading        3.631579
USA            5.375000
dtype: float64

查看各地区最高工资职位的平均值


In [4]:
df2.groupby('location').salary_max.mean()


Out[4]:
location
Berkhamsted    203714.285714
Bristol         52352.941176
Cambridge       46720.000000
Devon                    NaN
Hatfield        60000.000000
London          62846.376812
Manchester      74733.333333
Oxford          56666.666667
Reading         41642.857143
USA            154285.714286
Name: salary_max, dtype: float64

没人申请的职位


In [5]:
df3 = df.query("applications == 0")[['title', 'location', 'salary_min', 'salary_max', 'found', 'published']]
df3.head()


Out[5]:
title location salary_min salary_max found published
0 WDF, Python Developer Data Focused with Java, ... London NaN NaN 2014-12-26T21:54:34.050102 2014-12-22T21:54:34.049654
2 Python Developer - Django / PostgreSQL / HTML ... Crewe 25000 35000 2014-12-26T21:54:34.054577 2014-12-22T21:54:34.054197
5 BI Developer - Python - Disruptive Technology ... London 45000 55000 2014-12-26T21:54:34.063716 2014-12-26T21:54:34.063350
6 Python Developer London 40000 48000 2014-12-26T21:54:34.067126 2014-12-26T21:54:34.066736
9 BI Developer - Python - Disruptive Technology ... London 45000 55000 2014-12-26T21:54:34.075229 2014-12-26T21:54:34.074823

In [6]:
df3.describe()


Out[6]:
salary_min salary_max
count 65.000000 65.000000
mean 51898.215385 66288.215385
std 39004.014627 44160.225622
min 20000.000000 28000.000000
25% 30000.000000 40000.000000
50% 45000.000000 55000.000000
75% 55000.000000 75000.000000
max 240000.000000 276000.000000

In [7]:
df3['daysOn'] = df3.found.astype(np.datetime64) - df3.published.astype(np.datetime64)
df3.head()


Out[7]:
title location salary_min salary_max found published daysOn
0 WDF, Python Developer Data Focused with Java, ... London NaN NaN 2014-12-26T21:54:34.050102 2014-12-22T21:54:34.049654 4 days 00:00:00.000448
2 Python Developer - Django / PostgreSQL / HTML ... Crewe 25000 35000 2014-12-26T21:54:34.054577 2014-12-22T21:54:34.054197 4 days 00:00:00.000380
5 BI Developer - Python - Disruptive Technology ... London 45000 55000 2014-12-26T21:54:34.063716 2014-12-26T21:54:34.063350 0 days 00:00:00.000366
6 Python Developer London 40000 48000 2014-12-26T21:54:34.067126 2014-12-26T21:54:34.066736 0 days 00:00:00.000390
9 BI Developer - Python - Disruptive Technology ... London 45000 55000 2014-12-26T21:54:34.075229 2014-12-26T21:54:34.074823 0 days 00:00:00.000406

In [8]:
df3['daysOnInt'] = df3['daysOn'].apply(lambda x: np.timedelta64(x, 'D').astype(int))
df3.daysOnInt.describe()


Out[8]:
count    101.000000
mean      13.336634
std       12.210059
min        0.000000
25%        4.000000
50%        9.000000
75%       17.000000
max       46.000000
Name: daysOnInt, dtype: float64

In [9]:
df3.location.value_counts().head(10)


Out[9]:
London         30
Cambridge      13
Manchester      6
Southampton     3
Surrey          3
Bristol         2
Windsor         2
Hungerford      2
Leicester       2
Devon           2
dtype: int64