Kaggle winners use R and Python

=>

Here at the Data Analytics club at MIT Sloan, we're big fans of the PyData stack. The MBA program brings in a lot of students with diverse backgrounds, including students with software development backgrouns or no coding experience at all. For students with software development experience we've often found that it's often more convenient adopt a full fledged programming language like Python and learn its growing data analysis capabilities. For students who are completely new to coding, Python is a great choice for a first language that also happens to have kick-ass data analysis capabilities.

So Python's low rank in Kaggle's 2011 chart of commonly used tools hurt our pride a bit and we decided to dig in. In this notebook we'll look at:

  • What tools do Kaggle winners use?
  • What is an updated view of Kaggle's 2011 tools chart?

We're going to do all the work necessary to answer these questions in this notebook. That includes:

  • Part 1: Scraping and data wrangling
  • Part 2: Charting and looking at trends
  • Part 3: Predicting python for the win

Part 1: Scraping and data wrangling

Here we're going to make use of Python's great web scraping capabilities to go out and get the raw data from Kaggle's website. If you don't care about scraping and want to go straight to the number crunching, then feel free to skip this section.

When I scape data, I tend to do so in regular python scripts rather than iPython notebooks because I don't really need iPython's rich interactive features and I want to make sure that I save the data to my machine in case iPython crashes. However, for the purposes of this blog post, I'm going to show the scripts below as if I ran and wrote them in iPython.

My data processing pipeline for extracting languages used by the winners consists of three scripts:

  1. file_saver.py identifies all the links to winners interviews from Kaggle's blog
  2. blog_miner.py reads in all the links found by file_save.py and saves the text from each blog post
  3. keyword_counter.py reads in the json from blog_miner.py and keeps track of popular tools mentioned in each post

The data processing pipeline for looking at languages used by the top 1000 Kaggle users consists of the following steps:

  1. profile_finder.py finds the profile links for all kaggle competitors!
  2. tool_extractor.py Goes through a list of user ids (sorted by top rank first) and extracts the tools that each user has listed on their site.

Warning: some scripts take a long time to run and have to request a lot of data from Kaggle's servers... please don't run them unnecessarily!


In [ ]:
"""
file_saver.py
This script downloads pages from http://blog.kaggle.com/category/dojo/ and saves the html.
Script 1 of 3 in the pipeline for getting languages used by Kaggle winners.
"""
import requests
from bs4 import BeautifulSoup
import json
import time

root_url = "http://blog.kaggle.com/category/dojo/page"
pages = range(1, 10)  # Currently there are 9 pages

elems = []
for page in pages:
    r = requests.get(root_url + "/" + str(page) + "/")
    soup = BeautifulSoup(r.text)
    elems += soup.find_all("h2", class_="entry-title")

# Convert list of dicts
link_dict = []
for i, elem in enumerate(elems):
    link = elem.find("a")
    temp_dict = {
                 'id': i,
                 'url': link.get('href'), 
                 'title': link.get('title'),
                 'link_text': link.get_text()
                }
    link_dict.append(temp_dict)

# Save to json object!
with open('links.json', 'w') as f:
    json.dump(link_dict, f)

In [ ]:
"""
blog_miner.py
Reads a list of blogs and mines it for mentions of popular tools
Script 2 of 3 in the pipeline for getting languages used by Kaggle winners.
"""
import time
import json
from bs4 import BeautifulSoup
import requests
import re

# load up a list of blog posts
with open('links.json', 'r') as f:
    links = json.load(f)

# loop through links and mine the text
new_links = []
for i, link in enumerate(links):
    print(i, link['url'])

    # Get the date information
    date_search = re.search(r'\d{4}/\d{2}/\d{2}', link['url'])
    if date_search:
        link['date'] = date_search.group(0).replace('/', '-')
    else:
        link['date'] = None

    # Get the text of the blog post
    r = requests.get(link['url'])
    soup = BeautifulSoup(r.text)
    post = soup.find('div', class_='entry-content')
    link['text'] = post.text
    new_links.append(link)
    time.sleep(.1)


with open('links2.json', 'w') as f:
    json.dump(new_links, f)

In [ ]:
"""
keyword_counter.py
Counts the number of times tools appear in kaggle winner interviews.
Script 3 of 3 in the pipeline for getting languages used by Kaggle winners.
"""
import re
import json

tools = ['R', 'Matlab', 'SAS', 'Weka', 'SPSS',
         'Excel', 'C[+][+]', 'Mathematica', 'Stata', 'Java']
python_like = ['Python', '(?:sklearn)|(?:sci[-]kit)|(?:scikit)',
               'pandas', 'scipy', 'numpy']
other_tools = ['SAS', 'C',
               '(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)']
all_tools = tools + python_like + other_tools

# Make a list of all the tutorials ids, which we shouldn't count!
tutorial_ids = [35, 32, 29, 21, 20, 19, 11, 10, ]
print(len(tutorial_ids))

# open the data
with open('links2.json', 'r') as f:
    links = json.load(f)

new_links = []
# loop through all entries and extract keywords
for post in links:
    post['occurences'] = {k: len(re.findall(r'\s' + k + r'\W', post['text'],
                                 flags=re.I)) for k in all_tools}
    # let's add a total column for convenience
    total_refs = sum([v for k, v in post['occurences'].items()
                     if k not in python_like])
    total_ref_classes = len([v for k, v in post['occurences'].items()
                            if v != 0 and k not in python_like])
    post['total_refs'] = total_refs
    post['total_ref_classes'] = total_ref_classes
    if post['id'] in tutorial_ids:
        post['winner'] = False
    else:
        post['winner'] = True
    new_links.append(post)

# save
with open('links3.json', 'w') as f:
    json.dump(new_links, f)

# now loop through all posts, occurences, and make a massive dict
count_dict = {'total': {}, 'once': {},
              'proportion': {}, 'class_proportion': {}}
for post in new_links:
    if post['id'] in tutorial_ids:
        continue  # we only care about #WINNERS!
    else:
        for k, v in post['occurences'].items():
            count_dict['total'][k] = count_dict['total'].get(k, 0) + v
            if post['total_refs'] > 0:
                count_dict['proportion'][k] = count_dict['total'].get(k, 0) / post['total_refs']
            if v > 0:
                count_dict['once'][k] = count_dict['once'].get(k, 0) + 1
                if post['total_ref_classes'] > 0:
                    count_dict['class_proportion'][k] = count_dict['once'].get(k, 0) / post['total_ref_classes']
            else:
                count_dict['once'][k] = count_dict['once'].get(k, 0)

# save
with open('counts.json', 'w') as f:
    json.dump(count_dict, f)

Now I have all the data I need to investigate what are the popular languages used by winners of Kaggle competitions. Next, I look at what are the popular tools used by Kaggle's top 1000 competitors.


In [ ]:
"""
profile_finder.py
Finds the profile links for all kaggle competitors!
Script 1 of 2 for finding top languages/tools used by kaggle competitors
"""
import requests
import time
from bs4 import BeautifulSoup
import json

root_url = """http://www.kaggle.com/users?page="""
pages = range(1, 3656+1)

users = []
for page in pages:
    print(page)
    r = requests.get(root_url + str(page))
    soup = BeautifulSoup(r.text)
    users_content = soup.find("ul", class_="users-list")
    user_list = users_content.find_all("a", class_="profilelink")
    users += [u.get('href') for u in user_list]
    time.sleep(0.1)

# save!
with open('user_list.json', 'w') as f:
    json.dump(users, f)

In [ ]:
"""
tool_extractor.py
Goes through a list of user ids (sorted by top rank first)
and extracts the tools that each user has listed on their site.
I'll also save some of the text for possible later use.
Script 2 of 2 for finding top languages/tools used by kaggle competitors
"""
import requests
from bs4 import BeautifulSoup
import json
import time
import re

# load sorted list of user ids
with open('users.json', 'r') as f:
    user_list = json.load(f)

# check
print(user_list[0])

root_url = """http://www.kaggle.com"""
summary_url = root_url + """/knockout/profiles/"""

users = []
for i, user in enumerate(user_list[:1000]):  # only get top 1000 users
    print(i)  # this is going to take a while and I want to know the progress
    
    uid = re.search(r'(?<=/)\d+(?=/)', user).group(0)
    r = requests.get(summary_url + uid + "/summary")
    temp_user = json.loads(r.text)
    time.sleep(.05)
    
    # other data is available from the regular url
    r = requests.get(summary_url + uid)
    temp_add = json.loads(r.text)
    temp_user.update(temp_add)
    users.append(temp_user)
    time.sleep(0.05)

with open('user_data.json', 'w') as f:
    json.dump(users, f)

Now that we've got all of our data conveniently located in the data/ folder let's actually load it up in a convenient Pandas dataframe and start looking at it.


In [2]:
%pylab inline
rcParams['figure.figsize'] = 8, 6


Populating the interactive namespace from numpy and matplotlib

In [3]:
# load up the saved data
import json
import pandas as pd

with open('data/counts.json', 'r') as f:
    all_counts = json.load(f)
    
print(all_counts)


{u'total': {u'C': 40, u'Java': 11, u'Stata': 1, u'(?:sklearn)|(?:sci[-]kit)|(?:scikit)': 18, u'Python': 42, u'Mathematica': 1, u'Excel': 10, u'SPSS': 1, u'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)': 65, u'scipy': 0, u'Matlab': 27, u'C[+][+]': 10, u'SAS': 8, u'Weka': 6, u'R': 96, u'numpy': 2, u'pandas': 4}, u'proportion': {u'C': 5.714285714285714, u'Java': 1.5714285714285714, u'Stata': 0.14285714285714285, u'(?:sklearn)|(?:sci[-]kit)|(?:scikit)': 2.5714285714285716, u'Python': 6.0, u'Mathematica': 0.14285714285714285, u'Excel': 1.4285714285714286, u'SPSS': 0.14285714285714285, u'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)': 9.285714285714286, u'scipy': 0.0, u'Matlab': 3.857142857142857, u'C[+][+]': 1.4285714285714286, u'SAS': 1.1428571428571428, u'Weka': 0.8571428571428571, u'R': 13.714285714285714, u'numpy': 0.2857142857142857, u'pandas': 0.5714285714285714}, u'class_proportion': {u'C': 8.0, u'Java': 3.5, u'Stata': 1.0, u'(?:sklearn)|(?:sci[-]kit)|(?:scikit)': 10.0, u'Python': 9.333333333333334, u'Mathematica': 0.5, u'Excel': 3.5, u'SPSS': 0.25, u'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)': 10.0, u'R': 13.666666666666666, u'Matlab': 20.0, u'C[+][+]': 5.0, u'SAS': 1.0, u'Weka': 2.0, u'numpy': 0.4, u'pandas': 2.0}, u'once': {u'C': 24, u'Java': 7, u'Stata': 1, u'(?:sklearn)|(?:sci[-]kit)|(?:scikit)': 10, u'Python': 28, u'Mathematica': 1, u'Excel': 7, u'SPSS': 1, u'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)': 30, u'scipy': 0, u'Matlab': 20, u'C[+][+]': 10, u'SAS': 5, u'Weka': 4, u'R': 41, u'numpy': 2, u'pandas': 4}}

In [4]:
# Now let's load this into a dataframe
data = pd.read_json('data/counts.json')

# undo some of my regex strings
better_col_dict = {
                   r'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)': 'all_python',
                   r'(?:sklearn)|(?:sci[-]kit)|(?:scikit)': 'all_sklearn',
                   r'C[+][+]': 'C++'
                   }

indices = data.index.values
better_indices = [better_col_dict[ix] if ix in better_col_dict else ix for ix in indices]
data.index = better_indices
data['once_percent'] = data['once'].apply(lambda x:x/90.)

print(data)


             class_proportion  once  proportion  total  once_percent
all_python          10.000000    30    9.285714     65      0.333333
all_sklearn         10.000000    10    2.571429     18      0.111111
C                    8.000000    24    5.714286     40      0.266667
C++                  5.000000    10    1.428571     10      0.111111
Excel                3.500000     7    1.428571     10      0.077778
Java                 3.500000     7    1.571429     11      0.077778
Mathematica          0.500000     1    0.142857      1      0.011111
Matlab              20.000000    20    3.857143     27      0.222222
Python               9.333333    28    6.000000     42      0.311111
R                   13.666667    41   13.714286     96      0.455556
SAS                  1.000000     5    1.142857      8      0.055556
SPSS                 0.250000     1    0.142857      1      0.011111
Stata                1.000000     1    0.142857      1      0.011111
Weka                 2.000000     4    0.857143      6      0.044444
numpy                0.400000     2    0.285714      2      0.022222
pandas               2.000000     4    0.571429      4      0.044444
scipy                     NaN     0    0.000000      0      0.000000

In [5]:
data.sort('total', ascending=False)['total'].plot(kind='bar')
title('Total mentions in winner interviews')
show()


Python rises to the top in terms of total mentions. Now let's look at the percent of winner interviews that mention a language at least once.


In [7]:
# Set figure size
figure(figsize=(8, 6))

# Plot it!
data.sort('once', ascending=False)['once_percent'].plot(kind='bar')
title('Percentage of winner interviews which mention language')

# Do a bunch of matplotlib formatting to get y axis in percent
# Code stolen from http://matplotlib.org/examples/pylab_examples/histogram_percent_demo.html
from matplotlib.ticker import FuncFormatter

def to_percent(y, position):
    # Ignore the passed in position. This has the effect of scaling the default
    # tick locations.
    s = str(100 * y)[:2]

    # The percent symbol needs escaping in latex
    if rcParams['text.usetex'] == True:
        return s + r'$\%$'
    else:
        return s + '%'

# Create the formatter using the function to_percent. This multiplies all the
# default labels by 100, making them all percentages
formatter = FuncFormatter(to_percent)

# Set the formatter
gca().yaxis.set_major_formatter(formatter)

show()


Is Python trending?

When I was looking at some of the data, I thought I saw a trend that most of the recent users were using Python whereas way back in 2011 R seemed a lot more common. Let's visualize the data as a time series to see if this is true.


In [8]:
# load my raw data!
with open('data/links3.json', 'r') as f:
    raw_data = json.load(f)

print(raw_data[0]['date'], raw_data[0]['occurences'])


(u'2014-01-07', {u'C': 0, u'Java': 0, u'Stata': 0, u'(?:sklearn)|(?:sci[-]kit)|(?:scikit)': 1, u'Python': 2, u'Mathematica': 0, u'Excel': 0, u'SPSS': 0, u'(?:python)|(?:sklearn)|(?:scikit)|(?:sci[-]kit)|(?:pandas)': 4, u'scipy': 0, u'Matlab': 0, u'C[+][+]': 0, u'SAS': 0, u'Weka': 0, u'R': 0, u'numpy': 1, u'pandas': 1})

In [9]:
# Convert it into a pandas dataframe
time_series = {}
time_series_once = {}

# ignore tutorials
for post in raw_data:
    if post['winner']:
        time_series[post['date']] = post['occurences']
        time_series_once[post['date']] = {k:int(v > 0) for k, v in post['occurences'].items()}

# Convert the data into a dataframe    
ts = pd.DataFrame.from_dict(time_series, orient='index')
tso = pd.DataFrame.from_dict(time_series_once, orient='index')

# Fix my ugly regexes
better_cols = [better_col_dict[ix] if ix in better_col_dict else ix for ix in ts.columns.values]
better_cols_o = [better_col_dict[ix] if ix in better_col_dict else ix for ix in tso.columns.values]
ts.columns = better_cols
tso.columns = better_cols_o

# Make the indices into datetimes, not strings
ts.index = pd.to_datetime(ts.index.values)
tso.index = pd.to_datetime(tso.index.values)

print(ts[['R', 'Python']].tail())
print(tso[['R', 'Python']].tail())


            R  Python
2013-04-29  0       1
2013-05-06  0       2
2013-08-29  1       1
2013-12-23  1       0
2014-01-07  1       0
            R  Python
2013-04-29  0       1
2013-05-06  0       1
2013-08-29  1       1
2013-12-23  1       0
2014-01-07  1       0

In [11]:
# Now let's plot it!

top_five = ['R','all_python', 'C', 'Matlab', 'Java']

tso.cumsum()[top_five].plot()
gca().yaxis.set_major_formatter(formatter)
legend(loc='best')
show()



In [17]:
# Not bad, but this is hard to visualize.  
# Let's do a stacked area chart, or what matplotlib calls a stackplot
# http://matplotlib.org/examples/pylab_examples/stackplot_demo.html
from matplotlib.dates import date2num

cts = tso.cumsum()[top_five].astype(float)
dates = [date2num(date) for date in cts.index]

fig = figure()
p = fig.add_subplot(111)

p.stackplot(dates, cts['Java'], cts['Matlab'], cts['C'], cts['all_python'], cts['R'])
p.set_xticks(dates[::25])
p.set_xticklabels(pd.Series(cts.index.values[::25]).apply(lambda x: x.strftime('%Y-%m-%d')))
p.set_title('Language mentions by kaggle winners over time')
show()


It looks like the R and Python duopoly has been pretty consistent since the beginning of 2012 and that Python has been gaining on C and R's early popularity.

Refreshing the 2011 analysis

Okay, so Kaggle did their analysis of popular tools in 2011. Let's say we like their approach at looking at a lot of Kagglers and not just winners. Have things changed much?


In [18]:
# Open the data
with open('data/user_data.json', 'r') as f:
    users = json.load(f)
    
print(json.dumps(users[0], indent=2))  # use json.dumps to pretty print!


{
  "ranking": 1, 
  "twitter": null, 
  "linkedinUrl": "http://www.linkedin.com/pub/owen-zhang/51/aa0/363/", 
  "tools": [
    "RStudio"
  ], 
  "highestRanking": 1, 
  "id": 7756, 
  "techniques": [
    "Ensemble", 
    "luck"
  ], 
  "city": null, 
  "rankingText": "1st", 
  "tagline": "To ponder the mystery is, in itself, a gift.", 
  "canSendUserMessages": true, 
  "bio": null, 
  "registered": 1301940814867, 
  "tier": 10, 
  "slug": "owen", 
  "pointsText": "826,937.1", 
  "github": null, 
  "name": "Owen", 
  "skills": [
    "Ensemble", 
    "luck", 
    "RStudio"
  ], 
  "country": "United States", 
  "region": null, 
  "websiteUrl": null, 
  "points": 826937.125, 
  "gravatar": "e326b1164a340bf96a7def7af8c980d1", 
  "highestRankingText": "1st"
}

So my strategy will be to loop through all the users and look at the most popular tools, skills, and techniques.


In [19]:
techniques_list = []
skills_list = []
tools_list = []
for u in users:
    if 'techniques' in u: techniques_list += u['techniques']  # need to handle cases of no data
    if 'skills' in u: skills_list += u['skills']
    if 'tools' in u: tools_list += u['tools']
                   
print(techniques_list[:10])
print(skills_list[:10])
print(tools_list[:10])


[u'Ensemble', u'luck', u'Bagging', u'Boosting', u'Classification', u'clustering', u'ensembles', u'gam', u'GBM', u'Predictive Modeling']
[u'Ensemble', u'luck', u'RStudio', u'Bagging', u'Boosting', u'Classification', u'clustering', u'ensembles', u'gam', u'GBM']
[u'RStudio', u'pascal', u'R', u'.net', u'C#', u'C++', u'CUDA', u'Java', u'Javascript', u'Linux']

In [20]:
# Let's standardize things.  Capitalization is something which tends to vary a lot
import re

standard_map = {
                r'(?:python)': 'Python',
                r'(?:sklearn)|(?:scikit)|(?:sci[-]kit)': 'sklearn',
                r'(?:C[+][+])|C|(?:C[#])': 'C family',
                r'R|(?:RStudio)': 'R'
                }

def standardize(string):
    for search_string, standard in standard_map.items():
        if re.search(r'\b' + search_string + r'\b', string, re.I):
            return standard
    return string.capitalize()

techniques_list = [standardize(t) for t in techniques_list]
tools_list = [standardize(t) for t in tools_list]
skills_list = [standardize(t) for t in skills_list]

In [21]:
# Use counter to add it all up
from collections import Counter

techniques = Counter(techniques_list)
skills = Counter(skills_list)
tools = Counter(tools_list)

print(tools)


Counter({'R': 286, 'C family': 276, 'Python': 253, u'Matlab': 90, u'Java': 52, u'Sql': 49, u'Sas': 27, u'Pandas': 26, u'Hadoop': 17, u'Numpy': 17, 'sklearn': 15, u'Weka': 14, u'Perl': 13, u'Mysql': 13, u'Linux': 12, u'Sql server': 8, u'Postgresql': 8, u'Haskell': 8, u'Julia': 7, u'Nltk': 6, u'Vowpal wabbit': 6, u'Matplotlib': 5, u'Liblinear': 5, u'Mongodb': 5, u'Theano': 5, u'Ggplot2': 5, u'Visual studio': 4, u'Jmp': 4, u'Spss': 4, u'Latex': 4, u'Git': 3, u'Ibm spss modeler': 3, u'Bash': 3, u'.net': 3, u'Tableau': 3, u'Pylearn2': 3, u'Vb.net': 3, u'Php': 3, u'Vba': 3, u'Hive': 3, u'Sqlite': 3, u'Mpi': 2, u'Pig': 2, u'Stata': 2, u'Pybrain': 2, u'Awk': 2, u'Vim': 2, u'Libsvm': 2, u'Libfm': 2, u'Unix': 2, u'F#': 2, u'D3.js': 2, u'Eureqa': 1, u'Delphi': 1, u'Fortran': 1, u'Html': 1, u'Math': 1, u'Shell': 1, u'Mapinfo': 1, u'My own': 1, u'Mymedialite': 1, u'Tiberius (i wrote it so no bias there!)': 1, u'Pen and paper': 1, u'Blending': 1, u'Java mostly': 1, u'Mallet': 1, u'Pen': 1, u'Orange': 1, u'Unix tools': 1, u'Boost': 1, u'Text mining': 1, u'Myrrix': 1, u'Gams': 1, u'Vowpall wabbit': 1, u'Gbdt': 1, u'Pandas\xa0': 1, u'Model builder': 1, u'Jquery': 1, u'Knime': 1, u'Interweb': 1, u'Libfm svdfeature weka': 1, u'Go': 1, u'Sas enterprise miner': 1, u'Gretl': 1, u'Imagej': 1, u'Uniq': 1, u'Pylab': 1, u'Paper': 1, u'Neuroph': 1, u' t-sql': 1, u'Bliasoft kd': 1, u'Elf': 1, u'Mahout': 1, u'Paint': 1, u'(and human brain)': 1, u'Teradata': 1, u'As above': 1, u'\u0421++': 1, u'Vw ': 1, u'Stl': 1, u'Gbm': 1, u'Some stuff of my own': 1, u'Labview': 1, u'Neural networks': 1, u'Spss modeler': 1, u'Grapher': 1, u'Link analysis': 1, u'Opengl': 1, u'Asp.net': 1, u'Neural nets': 1, u'Erlang': 1, u'Sort': 1, u'Treeensemble': 1, u'Greenplum': 1, u'Groovy': 1, u'Lisp': 1, u'Netezza': 1, u'Matlab\xa0': 1, u'Photoshop': 1, u'Assembler': 1, u'Eviews': 1, u'Maple': 1})

In [22]:
def viz_top(counter, n, title=None):
    top_tools = counter.most_common(n)
    x = [tool for tool, count in top_tools]
    y = [count for tool, count in top_tools]
    
    plt.bar(np.arange(len(x)), y)
    plt.xticks(np.arange(len(x)) + 0.4, x)
    locs, labels = plt.xticks()
    plt.setp(labels, rotation=90)
    if title is not None:
        plt.title(title)
    plt.show()
    
viz_top(tools, 20, title="Top 20 Kaggler Tools")
viz_top(skills, 20, title="Top 20 Kaggler Skills")
viz_top(techniques, 20, title="Top 20 Kaggler Techniques")


What's the difference between tools, skills, and techniques? I'm not completely sure and it looks like Kagglers aren't either; but we can see that R and Python again dominate the most popular languages!