数据引用
本文是看这篇文章之前在 Titanic baby step for pandas
基础上写的
按照已有思路分析一次后再看文章内容,学习可以改进的地方/没想到的分析方法
In [1]:
import re
import pandas as pd
import numpy as np
df = pd.read_csv('data/job_market.csv')
%matplotlib inline
In [2]:
df.columns
Out[2]:
In [3]:
df.head()
Out[3]:
In [4]:
df.describe()
Out[4]:
In [5]:
(df[['salary_min', 'salary_max']] / 1000).describe()
Out[5]:
将工资小于 100 的去掉查看范围是否正常
In [6]:
(df[df['salary_min'] > 100]['salary_min'] / 1000).describe()
Out[6]:
In [7]:
df['salary_min'] = df['salary_min'].apply(lambda x: x if x > 100 else None)
In [8]:
df[['salary_min', 'salary_max']].hist(bins=25, figsize=(20, 4))
Out[8]:
In [9]:
cols = ['salary_min', 'salary_max', 'title', 'description', 'applications', 'location']
df[df['salary_min'] > df['salary_min'].mean()].sort('salary_min', ascending=False)[:10][cols]
Out[9]:
从这个结果看来,有有些重复数据,我们按照 title
/description
/location
聚合一下
In [10]:
grouped = df.groupby(['title', 'description', 'location'], as_index=False)
df2 = grouped.agg({
'salary_min': 'mean',
'salary_max': 'mean',
'applications': 'sum',
})
df2['count'] = grouped.size().values
df2.head()
Out[10]:
In [11]:
df2[['salary_min', 'salary_max']].hist(bins=25, figsize=(20, 4))
Out[11]:
In [12]:
df3 = df2[df2['salary_min'] > df2['salary_min'].mean()].sort('salary_min', ascending=False)
df3[:10]
Out[12]:
In [13]:
df3.describe()
Out[13]:
In [14]:
df3['location'].describe()
Out[14]:
In [15]:
df3.groupby('location').size().plot(kind='bar', figsize=(20, 4))
Out[15]:
可以看到高薪职位大量集中在 London, Manchester/Cambridge 次之
工资同具体职位要求也有很大关系,我们试着对职位技术需求处理一下
In [16]:
keywords = """JAVA
C++
C#
C
iOS
Python
DevOps
Linux
Architect
Unix
Data Scientist
Manager
Big Data
Hadoop
JAVAScript
Senior
Ruby
Perl
PHP
Objective-C
Administrator
Full Stack
Test
Network
Automation
Consultant
Analytics
Git
jQuery
.NET"""
# 这里的关键字识别有误差,如 C++ 可能会被识别为 C
df3['keywords'] = df3['description'].apply(lambda x: list(filter(lambda k: k.lower() in x.lower(), keywords.splitlines())))
df3.head()
Out[16]:
将 keywords 与 salary 平均值做关联比较
In [17]:
df4 = pd.DataFrame(columns=['keyword', 'salary', 'location'])
df_cursor = 0
for i, row in df3.iterrows():
# if not row['salary_min']:
# continue
for k in row['keywords']:
df4.loc[df_cursor] = [k, row['salary_min'], row['location']]
df_cursor += 1
df4.boxplot('salary', by='keyword', figsize=(50, 4))
Out[17]:
In [18]:
languages = ['.NET', 'C', 'C++', 'C#', 'JAVA', 'JAVAScript', 'PHP', 'Perl', 'Python', 'Ruby', 'iOS']
df4.query("keyword in %s" % languages).boxplot('salary', by='keyword', figsize=(20, 6))
Out[18]:
iOS/Ruby/C++/C 职位的平均工资是最高的
In [19]:
df3.boxplot('salary_min', by='location', figsize=(50, 6))
# 图可以放大
Out[19]:
USA 的工资遥遥领先
applications 和 location/编程语言的关系也可以用同样的方法用 boxplot 表示出来