Job Market with Pandas Part 1

数据引用
本文是看这篇文章之前在 Titanic baby step for pandas 基础上写的
按照已有思路分析一次后再看文章内容,学习可以改进的地方/没想到的分析方法


In [1]:
import re
import pandas as pd
import numpy as np
df = pd.read_csv('data/job_market.csv')
%matplotlib inline

In [2]:
df.columns


Out[2]:
Index(['salary_min', 'description', 'title', 'salary_max', 'applications', 'page_number', 'location', 'published', 'link', 'found', 'id'], dtype='object')

In [3]:
df.head()


Out[3]:
salary_min description title salary_max applications page_number location published link found id
0 NaN WDF, Python Developer Data Focused with Java... WDF, Python Developer Data Focused with Java, ... NaN 0 1 London 2014-12-22T21:54:34.049654 http://www.reed.co.uk/jobs/wdf-python-develope... 2014-12-26T21:54:34.050102 26242024
1 22000 Drupal Developer – Git / JavaScript / jQuery... URGENT - Drupal Developer - JavaScript / Pytho... 30000 2 1 Leeds 2014-12-22T21:54:34.051957 http://www.reed.co.uk/jobs/urgent-drupal-devel... 2014-12-26T21:54:34.052319 26241429
2 25000 Python Developer Software House - Crewe ... Python Developer - Django / PostgreSQL / HTML ... 35000 0 1 Crewe 2014-12-22T21:54:34.054197 http://www.reed.co.uk/jobs/python-developer-dj... 2014-12-26T21:54:34.054577 26240860
3 18000 Graduate Software Developer (.NET, Python , ... Graduate Software Developer (.NET, Python, PHP... 22000 7 1 Colchester 2014-12-22T21:54:34.056743 http://www.reed.co.uk/jobs/graduate-software-d... 2014-12-26T21:54:34.057299 26014417
4 20000 Graduate Software Developer – Python / Jav... Software Developer - Python/Java - Bangor, Gwy... 22000 3 1 Bangor 2014-12-22T21:54:34.060309 http://www.reed.co.uk/jobs/software-developer-... 2014-12-26T21:54:34.060867 26238754

In [4]:
df.describe()


Out[4]:
salary_min salary_max applications page_number id
count 413.000000 413.000000 615.000000 615.000000 615.000000
mean 46540.220339 59689.428571 4.964228 12.804878 26087345.180488
std 37954.360722 43279.344537 9.716269 7.109725 218523.769873
min 27.000000 22000.000000 0.000000 1.000000 23220057.000000
25% 27000.000000 36000.000000 1.000000 7.000000 26036938.000000
50% 35000.000000 50000.000000 2.000000 13.000000 26126676.000000
75% 50000.000000 65000.000000 5.000000 19.000000 26206711.500000
max 240000.000000 276000.000000 100.000000 25.000000 26254367.000000

describe 表现

  • 总样本 615 个 只有 413 个职位提供了薪资数据
  • salary_min 最小为 27 远远小于 25%/50% 处的值 不和常理 需要去掉
  • salary_min/salary_max 最大值为 75% 处的 5 倍多
  • salary_max 在 min/25%/50%/75% 处的值没有太大起伏 数据正常
  • 每份工作的平均投递人数为 5, 但是标准差为 10 说明不同职位间有较大差别

分析方法

  • salary_min/salary_max 工资区间
  • description 信息量杂 抛弃
  • title 包含 Python/Data/JAVA 等关键字 可用来统计语言需求
  • applications 应聘人数
  • page_number 未知
  • location 地区 地区分布 地区和薪资的关系
  • published 发布时间
  • link/found/id 无用信息

工资区间


In [5]:
(df[['salary_min', 'salary_max']] / 1000).describe()


Out[5]:
salary_min salary_max
count 413.000000 413.000000
mean 46.540220 59.689429
std 37.954361 43.279345
min 0.027000 22.000000
25% 27.000000 36.000000
50% 35.000000 50.000000
75% 50.000000 65.000000
max 240.000000 276.000000

将工资小于 100 的去掉查看范围是否正常


In [6]:
(df[df['salary_min'] > 100]['salary_min'] / 1000).describe()


Out[6]:
count    412.000000
mean      46.653117
std       37.931013
min       15.000000
25%       27.000000
50%       35.000000
75%       50.000000
max      240.000000
Name: salary_min, dtype: float64

In [7]:
df['salary_min'] = df['salary_min'].apply(lambda x: x if x > 100 else None)

In [8]:
df[['salary_min', 'salary_max']].hist(bins=25, figsize=(20, 4))


Out[8]:
array([[<matplotlib.axes.AxesSubplot object at 0x7fcd9df33350>,
        <matplotlib.axes.AxesSubplot object at 0x7fcd9df07d90>]], dtype=object)

In [9]:
cols = ['salary_min', 'salary_max', 'title', 'description', 'applications', 'location']
df[df['salary_min'] > df['salary_min'].mean()].sort('salary_min', ascending=False)[:10][cols]


Out[9]:
salary_min salary_max title description applications location
537 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 2 Berkhamsted
540 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 2 Berkhamsted
538 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 1 Berkhamsted
536 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 3 Berkhamsted
534 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 0 Berkhamsted
533 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 1 Berkhamsted
532 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 1 Berkhamsted
531 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 0 Berkhamsted
530 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 1 Berkhamsted
539 240000 276000 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... 2 Berkhamsted

从这个结果看来,有有些重复数据,我们按照 title/description/location 聚合一下


In [10]:
grouped = df.groupby(['title', 'description', 'location'], as_index=False)
df2 = grouped.agg({
    'salary_min': 'mean',
    'salary_max': 'mean',
    'applications': 'sum',
})
df2['count'] = grouped.size().values
df2.head()


Out[10]:
title description location applications salary_max salary_min count
0 "Linux Systems Administrator" Linux Systems Administrator Opportunity This ... Cornwall 1 55000 40000 1
1 .NET Senior Developer - Bristol - £50,000 Ben... .NET Senior Developer - Bristol - £50,000 + ... Bristol 4 50000 35000 2
2 .NET Senior Developer - Bristol - £50,000 Ben... .NET Senior Developer - Bristol - £50,000 + ... Thornbury 2 50000 40000 1
3 AWS Cloud Engineer - Bleeding Edge Tech! AWS Cloud Engineer - Bleeding Edge Tech A f... Hatfield 2 60000 40000 1
4 AWS Cloud Engineer - Cutting Edge Technology! AWS Cloud Engineer - Cutting Edge Technology... Hatfield 1 60000 40000 1

重新看工资区间和工资高于平均值的职位信息


In [11]:
df2[['salary_min', 'salary_max']].hist(bins=25, figsize=(20, 4))


Out[11]:
array([[<matplotlib.axes.AxesSubplot object at 0x7fcd9dc2c190>,
        <matplotlib.axes.AxesSubplot object at 0x7fcd9ddb3790>]], dtype=object)

In [12]:
df3 = df2[df2['salary_min'] > df2['salary_min'].mean()].sort('salary_min', ascending=False)
df3[:10]


Out[12]:
title description location applications salary_max salary_min count
264 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... Berkhamsted 13 276000 240000 10
353 Senior Analytics Specialist Organisation Description There is no denying... Australia 1 170000 155000 1
364 Senior Data Scientist / Manager Senior Data Scientist / Manager New York /... USA 4 200000 150000 1
411 Software Developer - C & Linux/Unix - San Fran... Software Developer - C & Linux/Unix - San Fran... USA 8 150000 135000 3
37 C++ Developer. SQL. From UK relocate to New York C++ Developer / Programmer / Software Engineer... London 5 150000 130000 1
6 Algorithm Engineer Algorithm Engineer New York City, New York $... USA 2 150000 120000 1
106 E-Trading DMA Support, ORC, RTS, Linux, RedHat 2nd / 3rd Line E-Trading DMA / Market Data Su... London 0 138000 120000 1
419 Software Developer - iOS, C/C++ and/or Java - ... Software Developer - iOS, C/C++ and/or Java - ... USA 9 130000 120000 1
403 Software Architect - DevOps, Linux, Java, Pyth... My exclusive client are recruiting for a stro... Manchester 2 120000 108000 1
207 Linux Software Architect - Big Data Architect, Linux, DevOps, Java, C++, Python ... Manchester 1 120000 108000 1

In [13]:
df3.describe()


Out[13]:
applications salary_max salary_min count
count 119.000000 119.000000 119.000000 119.000000
mean 3.285714 82013.445378 66272.268908 1.235294
std 4.300459 33331.491000 29113.392935 0.988972
min 0.000000 50000.000000 43000.000000 1.000000
25% 1.000000 60000.000000 48000.000000 1.000000
50% 2.000000 70000.000000 55000.000000 1.000000
75% 5.000000 96000.000000 72000.000000 1.000000
max 30.000000 276000.000000 240000.000000 10.000000

In [14]:
df3['location'].describe()


Out[14]:
count        119
unique        34
top       London
freq          56
Name: location, dtype: object

In [15]:
df3.groupby('location').size().plot(kind='bar', figsize=(20, 4))


Out[15]:
<matplotlib.axes.AxesSubplot at 0x7fcd9ddc6cd0>

可以看到高薪职位大量集中在 London, Manchester/Cambridge 次之
工资同具体职位要求也有很大关系,我们试着对职位技术需求处理一下


In [16]:
keywords = """JAVA
C++
C#
C
iOS
Python
DevOps
Linux
Architect
Unix
Data Scientist
Manager
Big Data
Hadoop
JAVAScript
Senior
Ruby
Perl
PHP
Objective-C
Administrator
Full Stack
Test
Network
Automation
Consultant
Analytics
Git
jQuery
.NET"""
# 这里的关键字识别有误差,如 C++ 可能会被识别为 C
df3['keywords'] = df3['description'].apply(lambda x: list(filter(lambda k: k.lower() in x.lower(), keywords.splitlines())))
df3.head()


Out[16]:
title description location applications salary_max salary_min count keywords
264 Payroll Superstar! PAYROLL SUPERSTAR £20,000 - £23,000 Stuck in ... Berkhamsted 13 276000 240000 10 [C, Python]
353 Senior Analytics Specialist Organisation Description There is no denying... Australia 1 170000 155000 1 [C, Test, Analytics]
364 Senior Data Scientist / Manager Senior Data Scientist / Manager New York /... USA 4 200000 150000 1 [C, Data Scientist, Manager, Senior]
411 Software Developer - C & Linux/Unix - San Fran... Software Developer - C & Linux/Unix - San Fran... USA 8 150000 135000 3 [C, Linux, Unix]
37 C++ Developer. SQL. From UK relocate to New York C++ Developer / Programmer / Software Engineer... London 5 150000 130000 1 [C++, C]

将 keywords 与 salary 平均值做关联比较


In [17]:
df4 = pd.DataFrame(columns=['keyword', 'salary', 'location'])
df_cursor = 0
for i, row in df3.iterrows():
    # if not row['salary_min']:
    #     continue
    for k in row['keywords']:
        df4.loc[df_cursor] = [k, row['salary_min'], row['location']]
        df_cursor += 1
df4.boxplot('salary', by='keyword', figsize=(50, 4))


Out[17]:
<matplotlib.axes.AxesSubplot at 0x7fcd9d94f390>

用 boxplot 查看编程语言和最低工资的关系


In [18]:
languages = ['.NET', 'C', 'C++', 'C#', 'JAVA', 'JAVAScript', 'PHP', 'Perl', 'Python', 'Ruby', 'iOS']
df4.query("keyword in %s" % languages).boxplot('salary', by='keyword', figsize=(20, 6))


Out[18]:
<matplotlib.axes.AxesSubplot at 0x7fcd9d8ca310>

iOS/Ruby/C++/C 职位的平均工资是最高的

地区和工资的关系


In [19]:
df3.boxplot('salary_min', by='location', figsize=(50, 6))
# 图可以放大


Out[19]:
<matplotlib.axes.AxesSubplot at 0x7fcd9af756d0>

USA 的工资遥遥领先
applications 和 location/编程语言的关系也可以用同样的方法用 boxplot 表示出来

结论

  • 数据集中的 salary_min 有一处数据错误
  • 最低工资集中在 20-50k
  • 工资水平高于平均的职位基本都在 London
  • 编程语言中 iOS/Ruby/C++/C 的工资是最高的
  • USA 职位提供的工资超高,London 次之
  • ....

总结分析过程

  • 数据去重
  • 通过 describe 查看数据是否有异常
  • 通过 hist 查看单个字段的分布区间
  • 通过 boxplot 查看两个变量之间的分布关系

方法参考了 Titanic baby step for pandas Part 1 一文