Here at the Data Analytics club at MIT Sloan, we're big fans of the PyData stack. The MBA program brings in a lot of students with diverse backgrounds, including students with software development backgrouns or no coding experience at all. For students with software development experience we've often found that it's often more convenient adopt a full fledged programming language like Python and learn its growing data analysis capabilities. For students who are completely new to coding, Python is a great choice for a first language that also happens to have kick-ass data analysis capabilities.
So Python's low rank in Kaggle's 2011 chart of commonly used tools hurt our pride a bit and we decided to dig in. In this notebook we'll look at:
We're going to do all the work necessary to answer these questions in this notebook. That includes:
Here we're going to make use of Python's great web scraping capabilities to go out and get the raw data from Kaggle's website. If you don't care about scraping and want to go straight to the number crunching, then feel free to skip this section.
When I scape data, I tend to do so in regular python scripts rather than iPython notebooks because I don't really need iPython's rich interactive features and I want to make sure that I save the data to my machine in case iPython crashes. However, for the purposes of this blog post, I'm going to show the scripts below as if I ran and wrote them in iPython.
My data processing pipeline for extracting languages used by the winners consists of three scripts:
file_saver.py identifies all the links to winners interviews from Kaggle's blogblog_miner.py reads in all the links found by file_save.py and saves the text from each blog postkeyword_counter.py reads in the json from blog_miner.py and keeps track of popular tools mentioned in each postThe data processing pipeline for looking at languages used by the top 1000 Kaggle users consists of the following steps:
profile_finder.py finds the profile links for all kaggle competitors!tool_extractor.py Goes through a list of user ids (sorted by top rank first) and extracts the tools that each user has listed on their site.Warning: some scripts take a long time to run and have to request a lot of data from Kaggle's servers... please don't run them unnecessarily!
In [ ]:
"""
file_saver.py
This script downloads pages from http://blog.kaggle.com/category/dojo/ and saves the html.
Script 1 of 3 in the pipeline for getting languages used by Kaggle winners.
"""
import requests
from bs4 import BeautifulSoup
import json
import time
root_url = "http://blog.kaggle.com/category/dojo/page"
pages = range(1, 10) # Currently there are 9 pages
elems = []
for page in pages:
r = requests.get(root_url + "/" + str(page) + "/")
soup = BeautifulSoup(r.text)
elems += soup.find_all("h2", class_="entry-title")
# Convert list of dicts
link_dict = []
for i, elem in enumerate(elems):
link = elem.find("a")
temp_dict = {
'id': i,
'url': link.get('href'),
'title': link.get('title'),
'link_text': link.get_text()
}
link_dict.append(temp_dict)
# Save to json object!
with open('links.json', 'w') as f:
json.dump(link_dict, f)
In [ ]:
"""
blog_miner.py
Reads a list of blogs and mines it for mentions of popular tools
Script 2 of 3 in the pipeline for getting languages used by Kaggle winners.
"""
import time
import json
from bs4 import BeautifulSoup
import requests
import re
# load up a list of blog posts
with open('links.json', 'r') as f:
links = json.load(f)
# loop through links and mine the text
new_links = []
for i, link in enumerate(links):
print(i, link['url'])
# Get the date information
date_search = re.search(r'\d{4}/\d{2}/\d{2}', link['url'])
if date_search:
link['date'] = date_search.group(0).replace('/', '-')
else:
link['date'] = None
# Get the text of the blog post
r = requests.get(link['url'])
soup = BeautifulSoup(r.text)
post = soup.find('div', class_='entry-content')
link['text'] = post.text
new_links.append(link)
time.sleep(.1)
with open('links2.json', 'w') as f:
json.dump(new_links, f)
In [ ]:
"""
keyword_counter.py
Counts the number of times tools appear in kaggle winner interviews.
Script 3 of 3 in the pipeline for getting languages used by Kaggle winners.
"""
import re
import json
tools = ['R', 'Matlab', 'SAS', 'Weka', 'SPSS',
'Excel', 'C[+][+]', 'Mathematica', 'Stata', 'Java']
python_like = ['Python', '(?:sklearn)|(?:sci[-]kit)|(?:scikit)',
'pandas', 'scipy', 'numpy']
other_tools = ['SAS', 'C',
'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)']
all_tools = tools + python_like + other_tools
# Make a list of all the tutorials ids, which we shouldn't count!
tutorial_ids = [35, 32, 29, 21, 20, 19, 11, 10, ]
print(len(tutorial_ids))
# open the data
with open('links2.json', 'r') as f:
links = json.load(f)
new_links = []
# loop through all entries and extract keywords
for post in links:
post['occurences'] = {k: len(re.findall(r'\s' + k + r'\W', post['text'],
flags=re.I)) for k in all_tools}
# let's add a total column for convenience
total_refs = sum([v for k, v in post['occurences'].items()
if k not in python_like])
total_ref_classes = len([v for k, v in post['occurences'].items()
if v != 0 and k not in python_like])
post['total_refs'] = total_refs
post['total_ref_classes'] = total_ref_classes
if post['id'] in tutorial_ids:
post['winner'] = False
else:
post['winner'] = True
new_links.append(post)
# save
with open('links3.json', 'w') as f:
json.dump(new_links, f)
# now loop through all posts, occurences, and make a massive dict
count_dict = {'total': {}, 'once': {},
'proportion': {}, 'class_proportion': {}}
for post in new_links:
if post['id'] in tutorial_ids:
continue # we only care about #WINNERS!
else:
for k, v in post['occurences'].items():
count_dict['total'][k] = count_dict['total'].get(k, 0) + v
if post['total_refs'] > 0:
count_dict['proportion'][k] = count_dict['total'].get(k, 0) / post['total_refs']
if v > 0:
count_dict['once'][k] = count_dict['once'].get(k, 0) + 1
if post['total_ref_classes'] > 0:
count_dict['class_proportion'][k] = count_dict['once'].get(k, 0) / post['total_ref_classes']
else:
count_dict['once'][k] = count_dict['once'].get(k, 0)
# save
with open('counts.json', 'w') as f:
json.dump(count_dict, f)
Now I have all the data I need to investigate what are the popular languages used by winners of Kaggle competitions. Next, I look at what are the popular tools used by Kaggle's top 1000 competitors.
In [ ]:
"""
profile_finder.py
Finds the profile links for all kaggle competitors!
Script 1 of 2 for finding top languages/tools used by kaggle competitors
"""
import requests
import time
from bs4 import BeautifulSoup
import json
root_url = """http://www.kaggle.com/users?page="""
pages = range(1, 3656+1)
users = []
for page in pages:
print(page)
r = requests.get(root_url + str(page))
soup = BeautifulSoup(r.text)
users_content = soup.find("ul", class_="users-list")
user_list = users_content.find_all("a", class_="profilelink")
users += [u.get('href') for u in user_list]
time.sleep(0.1)
# save!
with open('user_list.json', 'w') as f:
json.dump(users, f)
In [ ]:
"""
tool_extractor.py
Goes through a list of user ids (sorted by top rank first)
and extracts the tools that each user has listed on their site.
I'll also save some of the text for possible later use.
Script 2 of 2 for finding top languages/tools used by kaggle competitors
"""
import requests
from bs4 import BeautifulSoup
import json
import time
import re
# load sorted list of user ids
with open('users.json', 'r') as f:
user_list = json.load(f)
# check
print(user_list[0])
root_url = """http://www.kaggle.com"""
summary_url = root_url + """/knockout/profiles/"""
users = []
for i, user in enumerate(user_list[:1000]): # only get top 1000 users
print(i) # this is going to take a while and I want to know the progress
uid = re.search(r'(?<=/)\d+(?=/)', user).group(0)
r = requests.get(summary_url + uid + "/summary")
temp_user = json.loads(r.text)
time.sleep(.05)
# other data is available from the regular url
r = requests.get(summary_url + uid)
temp_add = json.loads(r.text)
temp_user.update(temp_add)
users.append(temp_user)
time.sleep(0.05)
with open('user_data.json', 'w') as f:
json.dump(users, f)
In [2]:
%pylab inline
rcParams['figure.figsize'] = 8, 6
In [3]:
# load up the saved data
import json
import pandas as pd
with open('data/counts.json', 'r') as f:
all_counts = json.load(f)
print(all_counts)
In [4]:
# Now let's load this into a dataframe
data = pd.read_json('data/counts.json')
# undo some of my regex strings
better_col_dict = {
r'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)': 'all_python',
r'(?:sklearn)|(?:sci[-]kit)|(?:scikit)': 'all_sklearn',
r'C[+][+]': 'C++'
}
indices = data.index.values
better_indices = [better_col_dict[ix] if ix in better_col_dict else ix for ix in indices]
data.index = better_indices
data['once_percent'] = data['once'].apply(lambda x:x/90.)
print(data)
In [5]:
data.sort('total', ascending=False)['total'].plot(kind='bar')
title('Total mentions in winner interviews')
show()
Python rises to the top in terms of total mentions. Now let's look at the percent of winner interviews that mention a language at least once.
In [7]:
# Set figure size
figure(figsize=(8, 6))
# Plot it!
data.sort('once', ascending=False)['once_percent'].plot(kind='bar')
title('Percentage of winner interviews which mention language')
# Do a bunch of matplotlib formatting to get y axis in percent
# Code stolen from http://matplotlib.org/examples/pylab_examples/histogram_percent_demo.html
from matplotlib.ticker import FuncFormatter
def to_percent(y, position):
# Ignore the passed in position. This has the effect of scaling the default
# tick locations.
s = str(100 * y)[:2]
# The percent symbol needs escaping in latex
if rcParams['text.usetex'] == True:
return s + r'$\%$'
else:
return s + '%'
# Create the formatter using the function to_percent. This multiplies all the
# default labels by 100, making them all percentages
formatter = FuncFormatter(to_percent)
# Set the formatter
gca().yaxis.set_major_formatter(formatter)
show()
In [8]:
# load my raw data!
with open('data/links3.json', 'r') as f:
raw_data = json.load(f)
print(raw_data[0]['date'], raw_data[0]['occurences'])
In [9]:
# Convert it into a pandas dataframe
time_series = {}
time_series_once = {}
# ignore tutorials
for post in raw_data:
if post['winner']:
time_series[post['date']] = post['occurences']
time_series_once[post['date']] = {k:int(v > 0) for k, v in post['occurences'].items()}
# Convert the data into a dataframe
ts = pd.DataFrame.from_dict(time_series, orient='index')
tso = pd.DataFrame.from_dict(time_series_once, orient='index')
# Fix my ugly regexes
better_cols = [better_col_dict[ix] if ix in better_col_dict else ix for ix in ts.columns.values]
better_cols_o = [better_col_dict[ix] if ix in better_col_dict else ix for ix in tso.columns.values]
ts.columns = better_cols
tso.columns = better_cols_o
# Make the indices into datetimes, not strings
ts.index = pd.to_datetime(ts.index.values)
tso.index = pd.to_datetime(tso.index.values)
print(ts[['R', 'Python']].tail())
print(tso[['R', 'Python']].tail())
In [11]:
# Now let's plot it!
top_five = ['R','all_python', 'C', 'Matlab', 'Java']
tso.cumsum()[top_five].plot()
gca().yaxis.set_major_formatter(formatter)
legend(loc='best')
show()
In [17]:
# Not bad, but this is hard to visualize.
# Let's do a stacked area chart, or what matplotlib calls a stackplot
# http://matplotlib.org/examples/pylab_examples/stackplot_demo.html
from matplotlib.dates import date2num
cts = tso.cumsum()[top_five].astype(float)
dates = [date2num(date) for date in cts.index]
fig = figure()
p = fig.add_subplot(111)
p.stackplot(dates, cts['Java'], cts['Matlab'], cts['C'], cts['all_python'], cts['R'])
p.set_xticks(dates[::25])
p.set_xticklabels(pd.Series(cts.index.values[::25]).apply(lambda x: x.strftime('%Y-%m-%d')))
p.set_title('Language mentions by kaggle winners over time')
show()
It looks like the R and Python duopoly has been pretty consistent since the beginning of 2012 and that Python has been gaining on C and R's early popularity.
In [18]:
# Open the data
with open('data/user_data.json', 'r') as f:
users = json.load(f)
print(json.dumps(users[0], indent=2)) # use json.dumps to pretty print!
So my strategy will be to loop through all the users and look at the most popular tools, skills, and techniques.
In [19]:
techniques_list = []
skills_list = []
tools_list = []
for u in users:
if 'techniques' in u: techniques_list += u['techniques'] # need to handle cases of no data
if 'skills' in u: skills_list += u['skills']
if 'tools' in u: tools_list += u['tools']
print(techniques_list[:10])
print(skills_list[:10])
print(tools_list[:10])
In [20]:
# Let's standardize things. Capitalization is something which tends to vary a lot
import re
standard_map = {
r'(?:python)': 'Python',
r'(?:sklearn)|(?:scikit)|(?:sci[-]kit)': 'sklearn',
r'(?:C[+][+])|C|(?:C[#])': 'C family',
r'R|(?:RStudio)': 'R'
}
def standardize(string):
for search_string, standard in standard_map.items():
if re.search(r'\b' + search_string + r'\b', string, re.I):
return standard
return string.capitalize()
techniques_list = [standardize(t) for t in techniques_list]
tools_list = [standardize(t) for t in tools_list]
skills_list = [standardize(t) for t in skills_list]
In [21]:
# Use counter to add it all up
from collections import Counter
techniques = Counter(techniques_list)
skills = Counter(skills_list)
tools = Counter(tools_list)
print(tools)
In [22]:
def viz_top(counter, n, title=None):
top_tools = counter.most_common(n)
x = [tool for tool, count in top_tools]
y = [count for tool, count in top_tools]
plt.bar(np.arange(len(x)), y)
plt.xticks(np.arange(len(x)) + 0.4, x)
locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
if title is not None:
plt.title(title)
plt.show()
viz_top(tools, 20, title="Top 20 Kaggler Tools")
viz_top(skills, 20, title="Top 20 Kaggler Skills")
viz_top(techniques, 20, title="Top 20 Kaggler Techniques")
What's the difference between tools, skills, and techniques? I'm not completely sure and it looks like Kagglers aren't either; but we can see that R and Python again dominate the most popular languages!