Doing things with Pandas


In [2]:
import pandas as pd
%matplotlib inline

In [4]:
job_data = pd.read_csv("../web-scraping/jobs-data.csv")
job_data.head()


Out[4]:
company_link company_name job_link job_location job_summary job_title
0 http://www.indeed.com/cmp/Allegheny-General-Ho... Allegheny General Hospital http://www.indeed.com/rc/clk?jk=805f6ed65b49dd... Pittsburgh, PA Allegheny Health Network’s clinical expertise ... Research Data Analyst - Neurosurgery
1 http://www.indeed.com/cmp/Marathon-Oil Marathon Petroleum Corporation http://www.indeed.com/rc/clk?jk=b28767d68e08b1... Findlay, OH The vision for the Advanced Analytics team is ... Data Scientist
2 http://www.indeed.com/cmp/Oracle Oracle http://www.indeed.com/rc/clk?jk=56e6eae3c307f9... United States The Cloud Data Curation team is looking for Sc... Data Scientist 5
3 http://www.indeed.com/cmp/Google Google http://www.indeed.com/rc/clk?jk=35f4fa4d806cc9... Mountain View, CA From creating experiments and prototyping impl... Research Scientist, Google Brain (United States)
4 http://www.indeed.com/cmp/Childrens-Hospital-L... Childrens Hospital Los Angeles http://www.indeed.com/rc/clk?jk=ff4d4e425d9c76... Los Angeles, CA The Data Scientist conducts research in medica... Data Scientist, VPICU

Now that the data is loaded, we can do analysis.


In [5]:
job_data.dtypes


Out[5]:
company_link    object
company_name    object
job_link        object
job_location    object
job_summary     object
job_title       object
dtype: object

In [ ]:
job_data.shape

In [ ]:
job_data['job_title'].value_counts(ascending=False)

In [ ]:
job_data['company_name'].value_counts(ascending=False)

In [ ]:
job_data['job_location'].value_counts(ascending=False)

In [ ]:
job_data['state'] = job_data['job_location'].str.extract(', (\w{2})', expand=False)
job_data.head()

In [ ]:
ax = job_data['state'].value_counts(ascending=True).plot(kind="barh", figsize=(10,10), xlim=(0,450))
# add counts as annotations
# http://stackoverflow.com/questions/23591254/python-pandas-matplotlib-annotating-labels-above-bar-chart-columns
for p in ax.patches:
    ax.annotate("%d" % p.get_width(), (p.get_x() + p.get_width(), p.get_y()), xytext=(0, 0), textcoords='offset points')

In [ ]:
job_data[job_data['job_location'].str.contains("Pittsburgh")]

Analyzing Company Information


In [7]:
company_data = pd.read_csv("company-data.csv")
company_data.head()


Out[7]:
compensation_benefits_rating culture_rating js_advancement_rating management_rating overall_rating url wl_balanace_rating
0 3.9 3.5 3.3 3.4 3.9 http://www.indeed.com/cmp/Population-Council 3.8
1 3.6 3.8 3.2 3.9 4.0 http://www.indeed.com/cmp/Ensco,-Inc. 4.1
2 3.9 3.6 3.3 3.4 3.8 http://www.indeed.com/cmp/General-Dynamics-Inf... 3.7
3 3.2 3.8 2.9 3.4 3.5 http://www.indeed.com/cmp/Mintel 3.7
4 3.8 3.8 3.6 3.5 4.0 http://www.indeed.com/cmp/ADP 3.8

In [8]:
company_data.dtypes


Out[8]:
compensation_benefits_rating    float64
culture_rating                  float64
js_advancement_rating           float64
management_rating               float64
overall_rating                  float64
url                              object
wl_balanace_rating              float64
dtype: object

In [9]:
company_data.shape


Out[9]:
(399, 7)

In [10]:
company_data.describe()


Out[10]:
compensation_benefits_rating culture_rating js_advancement_rating management_rating overall_rating wl_balanace_rating
count 399.000000 399.000000 399.000000 399.000000 399.000000 399.000000
mean 3.701504 3.680201 3.342607 3.421805 3.883208 3.782206
std 0.480941 0.486898 0.510532 0.502807 0.438297 0.488315
min 2.000000 1.600000 1.500000 1.500000 1.700000 1.000000
25% 3.400000 3.400000 3.100000 3.200000 3.700000 3.550000
50% 3.800000 3.700000 3.400000 3.400000 4.000000 3.800000
75% 4.000000 4.000000 3.600000 3.700000 4.200000 4.100000
max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

In [11]:
company_data['overall_rating'].hist()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x107278f28>

In [12]:
company_data['culture_rating'].hist()


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x10757e668>

In [13]:
company_data['compensation_benefits_rating'].hist()


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1076a4128>

In [14]:
company_data['management_rating'].hist()


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1076b5438>

In [6]:
company_data['js_advancement_rating'].hist()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-b363d602bc9f> in <module>()
----> 1 company_data['js_advancement_rating'].hist()

NameError: name 'company_data' is not defined

In [ ]: