In [1]:
import requests as rq
import pandas as pd
import matplotlib.pyplot as mpl
import bs4
import os

from tqdm import tqdm_notebook

from datetime import time
%matplotlib inline

In [2]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')


Table of Contents

Query Data

Grab schedule page:


In [3]:
base_url = "https://pydata.org"
r = rq.get(base_url + "/berlin2017/schedule/")
bs = bs4.BeautifulSoup(r.text, "html.parser")

Let's query every talk description:


In [4]:
data = {}
for ahref in tqdm_notebook(bs.find_all("a")):
    if 'schedule/presentation' in ahref.get("href"):
        url = ahref.get("href")
    else:
        continue
    data[url] = {}
    resp = bs4.BeautifulSoup(rq.get(base_url + url).text, "html.parser")
    title = resp.find("h2").text
    resp = resp.find_all(attrs={'class':"container"})[1]

    when, who = resp.find_all("h4")
    date_info = when.string.split("\n")[1:]
    day_info = date_info[0].strip()
    time_inf = date_info[1].strip()
    room_inf = date_info[3].strip()[3:]
    speaker = who.find("a").text
    level =  resp.find("dd").text
    abstract = resp.find(attrs={'class':'abstract'}).text
    description = resp.find(attrs={'class':'description'}).text
    data[url] = {
        'day_info': day_info, 
        'title': title,
        'time_inf': time_inf, 
        'room_inf': room_inf, 
        'speaker': speaker, 
        'level': level, 
        'abstract': abstract, 
        'description': description
    }



Okay, make a dataframe and add some helpful columns:


In [5]:
df = pd.DataFrame.from_dict(data, orient='index')
df.reset_index(drop=True, inplace=True)

Show Profile Report of Pandas DF


In [6]:
import pandas_profiling as pp
pfr = pp.ProfileReport(df)


/Users/nico.albers/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.py:1405: UserWarning: 
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

In [7]:
from IPython.display import display, HTML
display(HTML(
    pfr.html.replace("<h3", "<h4").replace("<h2", "<h3").replace("<h1", "<h2")
))


Overview

Dataset info

Number of variables 8
Number of observations 48
Total Missing (%) 0.0%
Total size in memory 3.1 KiB
Average record size in memory 65.5 B

Variables types

Numeric 0
Categorical 4
Date 0
Text (Unique) 4
Rejected 0

Warnings

Variables

abstract
Categorical, Unique

First 3 values
Debugging is a daily activity of any programme...
A fast paced high-level overview of speed opti...
For every button on the webpage that is clicke...
Last 3 values
Representing words as vectors in a high-dimens...
Supervised models are trained on labelled data...
Test-driven data analysis fuses and builds upo...

First 10 values

Value Count Frequency (%)  
Introduction to the OpenCV library (loading images, plotting with matplotlib, etc…) Starting with single images, introduce Gaussian blur, region of interest filtering, canny edge detection, Hough transform and draw the lane lines. Create a lane line detection pipeline with those functions; extrapolating the lines to represent the boundaries of the lane. Process dash cam video using the single image techniques with the pipeline. Stitch together images from the processing pipeline to create a sweet video! 1 2.1%
 
Overview of the Jupyter project + setup to get everyone on board. Handling the UI, know the shortcuts Different type of cells Exporting notebooks for presentations Handling different kernels Set styles for visualizations for professional quality Mod the style of the web interface yourself via CSS Profiling code in notebooks, use Cython Debugging in notebooks 1 2.1%
 
A fast paced high-level overview of speed optimisation in Python. We will start by looking systematically at the most common causes of poor speed, highlighting which resources are being the bottleneck in each case and giving practical advice on how to find out. We will then discuss parallelism with threads and processes, both in the standard library and using celery. We will discuss Pypy and Cython as alternatives to regular Python for CPU intensive tasks. We finish our tour with asynchronous processing in Python 3 using async.io. 1 2.1%
 
A new wave of creative applications of AI has arrived, making science fiction authors struggle to keep up with reality. Recent advances in Deep Learning, especially generative models, make it possible to generate text, audio, speech, and images. There's a wonderfully trippy world of neural nets "going wild" out there, with AI choreographed dancing moves, freestyle raps, impressionist paintings, and Trump impersonating bots. Such "bots" and experiments are but one novel use of this kind of "Creative AI". Taking a more human-centered approach, allowing for control and agency, has the potential to turn these content-generating neural nets, into tools for creative use and explorations of human-machine interaction, where the main theorem is "augmentation, not automation". The talk will particularly focus on "generative" models, and show the python fanatic how to make your move with these particular forms of Deep Neural Nets. 1 2.1%
 
About H2O.ai H2O.ai is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance and healthcare companies to deploy machine learning and predictive analytics to solve complex problems. More than 8,500+ organizations and 75,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company accelerates business transformation for 107 Fortune 500 enterprises, 8 of the world’s 12 largest banks, 7 of the 10 largest auto insurance companies and all 5 major telecommunications providers. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy's, Walgreens, Kaiser Permanente, and Aetna. This tutorial aims to demonstrate the basic usage of H2O with worked examples in Python. Code and data for the worked examples will be provided. Learning Objectives By the end of the tutorial, participants will be able to: Start and connect to a local H2O cluster from Python. Start and connect to H2O cluster(s) on the cloud (e.g. AWS) (i.e. straight-forward distributed machine learning) Import data from Python data frames, local files or web. Perform basic data transformation and exploration. Train classification and regression models using H2O machine learning algorithms. Evaluate model performance and make predictions. Agenda About H2O.ai H2O machine learning platform & algorithms H2O + Python API Basic Extract, Transform and Load (ETL) procedures Worked examples: classification and regression 1 2.1%
 

Last 10 values

Value Count Frequency (%)  
Topic Modelling is a great way to analyse completely unstructured textual data - and with the python NLP framework Gensim, it's very, very easy to do this. The purpose of this tutorial is to guide one through the whole process of topic modelling - right from pre-processing your raw textual data, creating your topic models, evaluating the topic models, to visualising them. Advanced topic modelling techniques will also be covered in this tutorial, such as Dynamic Topic Modelling, Topic Coherence, Document Word Coloring, and LSI/HDP. The python packages used during the tutorial will be spaCy (for pre-processing), gensim (for topic modelling), and pyLDAvis (for visualisation). The interface for the tutorial will be an Jupyter notebook. The takeaway from the tutorial would be the participants ability to get their hands dirty with analysing their own textual data, through the entire lifecycle of cleaning raw data to visualising topics. 1 2.1%
 
Visual saliency models aim at describing human eye fixations and finding the most relevant features in a visual scene. From experience one can justify two important processes that drive saliency: First, low level features like color, intensity or orientation contrast and second, high-level features like objects, faces or signs. Eye fixations serve as an experimental setup to study saliency phenomena and give insight into the ways we attend scenes. Hereby, computational modelling is used to explain what information processing might be responsible for saliency. Evaluating how well models explain observed fixations give a framework of identifying what features contribute to saliency by designing models with different, e.g. high- or low-level feature extractors. Many saliency models have used low-level features but were faced with drawbacks in explaining pronounced saliency caused by high-level contributions. Simply adding face or object detectors has been a plausible follow-up but revealed little about the underlying mechanisms. Recent advances in object classification by training convolutional neural networks (CNN) have revealed rich filter representations in a wide range of high-level features and are therefore a promising candidate in building models of visual processing (e.g. VGG-19 trained on ImageNet). 1 2.1%
 
When working with personal data, we need to make sure that our algorithms treat people fairly, are transparent and can be held accountable for their decisions. When using complex techniques like deep learning on very large datasets, it is not easy to prove that our algorithms behave they way we intend them to and e.g. do not discriminate against certain groups of people. In my talk, I will discuss why ensuring transparency and fairness in machine learning is not easy, and how we can use Python tools to investigate our machine learning systems and makre sure they behave they way they should. Introduction: Why you should care about this (->EU-Data Protection Directive) What kind of problems can occur in machine learning systems (bias in the input data, leakage of sensitive information into the training data, hidden usage of protected attributes by the algorithm)? How can we measure and correct for bias in our systems (certifying and removing disparate impact)? How can we understand the decisions that our algorithms make (perturbation analysis, simplified modeling, blackbox testing)? How can we design our machine learning systems to make sure they're compliant and accountable (anonymization of data, monitoring of outcomes, auditing of algorithms)? Outlook: The future of transparency and accountability in machine learning 1 2.1%
 
Whenever you have a machine learning module in your pipeline, persisting and serving the model is not yet a trivial task. This tutorial shows how an open source framework using several open source technologies could potentially solve the problem. My journey started with this[1] question on StackOverflow. I wanted to be able to do my usual data science stuff, mostly in python, and then deploy them somewhere serving like a REST API, responding to requests in real-time, using the output of the trained models. My original line of thought was this workflow: train the model in python or pyspark or in scala in apache spark. get the model, put it in an apache flink stream and serve. This was the point at which I had been reading and watching tutorials and attending meetups related to these technologies. I was looking for a solution which is better than: train models in python write a web-service using flask, put it behind a apache2 server, and put a bunch of them behind a load balancer. This just sounded wrong, or at its best, not scalable. After a bit of research, I came across PipelineIO[2,3] which seems to promise exactly what I'm looking for. In this tutorial we use PipelineIO, to deply a cluster on the cloud, which gives us a JupyterHub to develop our method, and uses PMML to persist and deploy and serve the model. My own jurney and take from PipelineIO are documented github[4]. I'll use Amazon AWS, but PipelineIO uses Kubernetes and you can easily deploy in any environment in which you can use Kubernetes. If you work in an environment in which you have different machine learning modules, which should be used in production in real time and as a part of a stream processing pipeline, this talk is for you. [1] http://stackoverflow.com/questions/42719953/how-to-develop-a-rest-api-using-an-ml-model-trained-on-apache-spark [2] http://pipeline.io [3] https://github.com/fluxcapacitor/pipeline [4] https://github.com/adrinjalali/pipeline-docs 1 2.1%
 
Writing a Python script from scratch is fairly straightforward if you have some experience working in Python. You can usually get by with very little boilerplate code. Starting a new Python project, however, can be tiring if you decide to stick to best practices and plan on submitting it to PyPI later on. It requires great diligence and occasionally gets pretty cumbersome when if you're creating new tools on a regular basis. So why not use a template for your projects? Cookiecutter is a command-line utility that creates projects from templates. It is free and open-source software distributed under the terms of a permissive BSD-3 license. With around 150 individual contributors, more than 1000 public templates on GitHub, and multiple talks at conferences, it is fair to say that there is a solid community around it. In this talk, I will give an introduction on Cookiecutter, how to install it, how to use it in the CLI, and finally how to author your own template. You can use Cookiecutter for all sorts of projects: command-line scripts, Django webapps, and even non-Python projects. The community has authored and published several templates for Data Science projects, for instance widget-cookiecutter. I will demonstrate how to use Cookiecutter to create a custom Jupyter widget using that template. The goal of this talk is to teach how to integrate Cookiecutter into your own workflow and share learnings in the form of templates with your team at work and the open source community. 1 2.1%
 

day_info
Categorical

Distinct count 3
Unique (%) 6.2%
Missing (%) 0.0%
Missing (n) 0
Sunday
20
Saturday
19
Friday
9
Value Count Frequency (%)  
Sunday 20 41.7%
 
Saturday 19 39.6%
 
Friday 9 18.8%
 

description
Categorical, Unique

First 3 values
Shearlab is a Julia Library with toolbox for t...
Those folks in computer vision keep publishing...
Developed at MIT, with a focus on fast numeric...
Last 3 values
Parametric uncertainty is broadly difficult to...
The talk is going to present, with examples, h...
In todays world of online business, it is diff...

First 10 values

Value Count Frequency (%)  
A fast paced high-level overview of speed optimisation in Python. What makes a program "slow"? How to tell what is making your program slow. Common speed-up paradigms: parallelization, alternatives to the regular Python interpreter and asynchronous processing. 1 2.1%
 
A new wave of creative applications of AI has arrived, making science fiction authors struggle to keep up with reality. Recent advances in Deep Learning, especially generative models, make it possible to generate text, audio, speech, and images. There's a wonderfully trippy world of neural nets "going wild" out there, which you, the python enthusiastic, can be part of... 1 2.1%
 
Analyzing millions of images and enormous text sources using machine learning and deep learning techniques is simple and straightforward in the Python ecosystem. Powerful machine learning algorithms and interactive visualization frameworks make it easy to conduct and communicate large scale experiments. Exploring this data can yield new insights for researchers, journalists, and businesses. 1 2.1%
 
As Germany’s largest online vehicle marketplace mobile.de uses recommendations at scale to help users find the perfect car. We elaborate on collaborative & content-based filtering as well as a hybrid approach addressing the problem of a fast-changing inventory. We then dive into the technical implementation of the recommendation engine, outlining the various challenges faced and experiences made. 1 2.1%
 
As a data scientist, one of the challenges after you develop and train your model, is to deploy it in production where other systems would use the output of the model in real time. In this tutorial we use PipelineIO, to deploy a cluster on the cloud, which gives us a JupyterHub to develop our method, and uses PMML to persist and deploy and serve the model. 1 2.1%
 

Last 10 values

Value Count Frequency (%)  
Unsupervised models in natural language processing (NLP) have become very popular recently. Word2vec, GloVe and LDA provide powerful computational tools to deal with natural language and make exploring large document collections feasible. We would like to be able to say if a model is objectively good or bad, and compare different models to each other, this is often tricky to do in practice. 1 2.1%
 
What is the closest word to "king"? Is it "Canute" or is it "crowned"? There are many ways to define "similar words" and "similar texts". Depending on your definition you should choose a word embedding to use. There is a new generation of word embeddings added to Gensim open source NLP package using morphological information and learning-to-rank: Facebook's FastText, VarEmbed and WordRank. 1 2.1%
 
Which features in an image draw our focus to a specific area while neglecting others entirely? This fascinating question has been motivating researchers for decades but also sparked interest in design and marketing. Thus, saliency models aim at identifying locations that stand out from their visual neighbourhood. Using tensorflow and matplotlib this talk will shed some light on these features.. 1 2.1%
 
Wikidata is a Knowledge Base where anybody can add new information. Unfortunately, it is targeted by vandals, who put inaccurate or offensive information there. To fight them, Wikidata employs moderators, who manually inspect each suggested edit. In this talk we will look into how we can use Machine Learning to automatically detect vandalic revisions and help the moderators. 1 2.1%
 
https://github.com/bhargavvader/personal/tree/master/notebooks/text_analysis_tutorial This tutorial will guide you through the process of analysing your textual data through topic modelling - from finding and cleaning your data, pre-processing using spaCy, applying topic modelling algorithms using gensim - before moving on to more advanced textual analysis techniques. 1 2.1%
 

level
Categorical

Distinct count 3
Unique (%) 6.2%
Missing (%) 0.0%
Missing (n) 0
Intermediate
33
Novice
13
Experienced
 
2
Value Count Frequency (%)  
Intermediate 33 68.8%
 
Novice 13 27.1%
 
Experienced 2 4.2%
 

room_inf
Categorical

Distinct count 5
Unique (%) 10.4%
Missing (%) 0.0%
Missing (n) 0
A238
17
A208
13
D105 Audimax
13
Other values (2)
5
Value Count Frequency (%)  
A238 17 35.4%
 
A208 13 27.1%
 
D105 Audimax 13 27.1%
 
A130 4 8.3%
 
A239 1 2.1%
 

speaker
Categorical, Unique

First 3 values
Alexander Hendorf
Adrin Jalali
Aisha Bello
Last 3 values
Vincent D. Warmerdam
Emily Gorcenski
Stefan Otte

First 10 values

Value Count Frequency (%)  
Abhishek Thakur 1 2.1%
 
Adrin Jalali 1 2.1%
 
Aisha Bello 1 2.1%
 
Alexander Hendorf 1 2.1%
 
Alexander Weiss 1 2.1%
 

Last 10 values

Value Count Frequency (%)  
Tom Bocklisch 1 2.1%
 
Trent McConaghy 1 2.1%
 
Ulrike Thalheim 1 2.1%
 
Vaibhav Singh 1 2.1%
 
Vincent D. Warmerdam 1 2.1%
 

time_inf
Categorical

Distinct count 17
Unique (%) 35.4%
Missing (%) 0.0%
Missing (n) 0
16:00–16:45
 
6
15:15–16:00
 
5
10:30–11:15
 
3
Other values (14)
34
Value Count Frequency (%)  
16:00–16:45 6 12.5%
 
15:15–16:00 5 10.4%
 
10:30–11:15 3 6.2%
 
11:15–12:00 3 6.2%
 
12:15–13:00 3 6.2%
 
17:00–17:45 3 6.2%
 
10:45–11:30 3 6.2%
 
12:00–12:45 3 6.2%
 
17:45–18:30 3 6.2%
 
14:30–15:15 3 6.2%
 
Other values (7) 13 27.1%
 

title
Categorical, Unique

First 3 values
Analysing user comments on news articels with ...
When the grassroots grow stronger - 2017 throu...
Introductory tutorial on data exploration and ...
Last 3 values
Data Science for Digital Humanities: Extractin...
Introduction to Machine Learning with H2O and ...
Machine Learning to moderate ads in real world...

First 10 values

Value Count Frequency (%)  
A word is worth a thousand pictures: Convolutional methods for text 1 2.1%
 
AI assisted creativity 1 2.1%
 
Advanced Metaphors in Coding with Python 1 2.1%
 
Analysing user comments on news articels with Doc2Vec and Machine Learning classification 1 2.1%
 
Best Practices for Debugging 1 2.1%
 

Last 10 values

Value Count Frequency (%)  
Towards Pythonic Innovation in Recommender Systems 1 2.1%
 
What does it all mean? - Compositional distributional semantics for modelling natural language 1 2.1%
 
When the grassroots grow stronger - 2017 through the eyes of German open data activists 1 2.1%
 
Where are we looking? Prediciting human gaze using deep networks. 1 2.1%
 
“Which car fits my life?” - mobile.de’s approach to recommendations 1 2.1%
 

Sample

day_info time_inf speaker room_inf title abstract level description
0 Friday 13:45–15:15 Alexandru Agachi A238 Introductory tutorial on data exploration and ... I would be happy to conduct an introductory le... Novice This tutorial will focus on analyzing a datase...
1 Friday 15:30–17:00 Adrin Jalali A130 The path between developing and serving machin... Whenever you have a machine learning module in... Experienced As a data scientist, one of the challenges aft...
2 Friday 9:00–10:30 David Higgins A130 Introduction to Julia for Scientific Computing... Julia is a new and exciting language, sponsore... Intermediate Developed at MIT, with a focus on fast numeric...
3 Friday 10:45–12:15 Gerrit Gruben A130 Leveling up your Jupyter notebook skills \nOverview of the Jupyter project + setup to g... Intermediate Most of us regularly work with Jupyter noteboo...
4 Friday 9:00–10:30 Alexander Hendorf A238 Introduction to Data-Analysis with Pandas Pandas is the Swiss-Multipurpose Knife for Dat... Novice Pandas is the Swiss-Multipurpose Knife for Dat...

Further Processing


In [8]:
# Tutorials on Friday
df.loc[df.day_info=='Friday', 'tutorial'] = True
df['tutorial'].fillna(False, inplace=True)

In [9]:
# time handling
df['time_from'], df['time_to'] = zip(*df.time_inf.str.split(u'\u2013'))
df.time_from = pd.to_datetime(df.time_from).dt.time
df.time_to = pd.to_datetime(df.time_to).dt.time
del df['time_inf']

In [10]:
df.head(3)


Out[10]:
day_info speaker room_inf title abstract level description tutorial time_from time_to
0 Friday Alexandru Agachi A238 Introductory tutorial on data exploration and ... I would be happy to conduct an introductory le... Novice This tutorial will focus on analyzing a datase... True 13:45:00 15:15:00
1 Friday Adrin Jalali A130 The path between developing and serving machin... Whenever you have a machine learning module in... Experienced As a data scientist, one of the challenges aft... True 15:30:00 17:00:00
2 Friday David Higgins A130 Introduction to Julia for Scientific Computing... Julia is a new and exciting language, sponsore... Intermediate Developed at MIT, with a focus on fast numeric... True 09:00:00 10:30:00

In [11]:
# Example: Let's query all non-novice talks on sunday, starting at 4 pm
tmp = df.query("(level!='Novice') & (day_info=='Sunday')")
tmp[tmp.time_from >= time(16)]


Out[11]:
day_info speaker room_inf title abstract level description tutorial time_from time_to
7 Sunday Hendrik Heuer D105 Audimax Data Science for Digital Humanities: Extractin... The focus of this talk is extracting meaning f... Intermediate Analyzing millions of images and enormous text... False 17:00:00 17:45:00
12 Sunday Alexey Grigorev A238 Large Scale Vandalism Detection in Knowledge B... Knowledge bases are an important source of inf... Intermediate Wikidata is a Knowledge Base where anybody can... False 17:00:00 17:45:00
14 Sunday Jonathan Ronen D105 Audimax Social Networks and Protest Participation: Evi... Pinning down the role of social ties in the de... Intermediate Data mining social networks for evidence of po... False 17:45:00 18:30:00
24 Sunday Daniele Rapati A208 Engage the Hyper-Python - a rattle-through man... A fast paced high-level overview of speed opti... Intermediate A fast paced high-level overview of speed opti... False 17:45:00 18:30:00
26 Sunday Vaibhav Singh A238 Machine Learning to moderate ads in real world... In an online classified's business, one may en... Intermediate In todays world of online business, it is diff... False 17:45:00 18:30:00
30 Sunday Oliver Eberle A208 Where are we looking? Prediciting human gaze u... Visual saliency models aim at describing human... Intermediate Which features in an image draw our focus to a... False 17:00:00 17:45:00
39 Sunday Lev Konstantinovskiy A208 Find the text similiarity you need with the ne... There are many ways to find similar words/docs... Intermediate What is the closest word to "king"? Is it "Can... False 16:00:00 16:45:00
45 Sunday Roelof Pieters D105 Audimax AI assisted creativity A new wave of creative applications of AI has ... Intermediate A new wave of creative applications of AI has ... False 16:00:00 16:45:00

visualize some relations


In [12]:
ax = df.level.value_counts().plot.pie(figsize=(3,3), autopct="%1.1f %%")
ax.axis("equal")
ax.set_ylabel("")
ax.set_title("levels of the talks where:")
plt.show()



In [13]:
ax = df.groupby("tutorial")['level'].value_counts().unstack(level=0).plot.pie(
    subplots=True, legend=False, autopct="%1.1f %%", startangle=90, labels=["","",""])
for axx in ax:
    axx.axis("equal")
    axx.set_ylabel("")
ax[0].set_xlabel("Not tutorial")   
ax[1].set_xlabel("tutorial")
plt.gcf().suptitle("Level of the talks where (splitted by tutorial or not):", fontsize=16)
plt.tight_layout()
axx.legend(
    df.groupby("tutorial")['level'].value_counts().unstack(level=0).index.tolist(),
    loc='center left', bbox_to_anchor=(1, .8))
plt.show()



In [14]:
categorical_cols = [col for col in df.columns if len(df[col].unique())<=len(df[col]) / 5]
f, ax = plt.subplots(2,2, figsize=(20,10), sharey="row")
for i, axx in enumerate(ax.flatten()):
    col = categorical_cols[i]
    df.groupby("tutorial")[col].value_counts().unstack(level=0).plot.bar(ax=axx, rot=0, stacked=True)
    axx.set_title(col)
    axx.set_facecolor("white")
    axx.grid(True, color="lightgrey")
f.suptitle("Number of talks grouped by day, level, is tutorial, room: ...", fontsize=20)
f.tight_layout()












Test Meadow

The following is just WIP crap - do not read this.


In [ ]:
df.title = df.title.str.replace(".", "").replace(":", "").replace(",", "")

In [ ]:
from itertools import chain

In [ ]:
foo = pd.DataFrame(
    pd.Series(list(chain(*df.title.str.split().apply(lambda x: np.unique(x)).tolist()))).value_counts(normalize=True)
).reset_index().rename(columns={0:'share', 'index': 'word'}).query("share >= 0.001")

In [ ]:
foo.head()

In [ ]:
foo['len'] = foo.word.str.len()

In [ ]:
from gensim import corpora, models, similarities

In [ ]:
documents = df.query("tutorial != True").title.tolist()

In [ ]:
stoplist = set('''
    for a of the and to in i be on with here we will an each its type as our their then apply them very would this
    make large talk, basic search is are there more than pages it can or that they how by have what from talk use you
    these using which but some not your do used at if like such has about - my one most those should between may good • why
    give way time been need many so does case when also all into lot build features new who often discuss building
    best text * was out
    '''.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
        for document in documents]

In [ ]:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

In [ ]:
pd.DataFrame(dict(nr=dict(frequency))).sort_values('nr', ascending=False).query("nr>1")

In [ ]:
keywords = pd.DataFrame(dict(nr=dict(frequency))).sort_values('nr', ascending=False).query("nr>1").index.tolist()

In [ ]:
for keyword in keywords[::-1]:
    df.loc[df.title.str.lower().str.contains(keyword), 'color'] = keyword
df.color.fillna("UNKNOWN", inplace=True)

In [ ]:
df.head()

In [ ]:
pd.set_option("max.colwidth", 400)

In [ ]:
df['approach'] = np.where(df.tutorial, "tutorial", np.nan)
df.loc[df.title.str.lower().str.contains('ai'), 'approach'] = 'artificial intelligence'
df.loc[df.title.str.lower().str.contains('artificial intelligence'), 'approach'] = 'artificial intelligence'
df.loc[df.title.str.lower().str.contains('pandas'), 'approach'] = 'pandas'
df.loc[df.title.str.lower().str.contains('jupyter'), 'approach'] = 'jupyter'
df.loc[df.title.str.lower().str.contains('data scien'), 'approach'] = 'data science'
df.loc[df.title.str.lower().str.contains('machine learn'), 'approach'] = 'machine learning'
df.loc[df.title.str.lower().str.contains('deep'), 'approach'] = 'deep learning'

In [ ]:
# attention: may change when querying again
df.loc[11, 'approach'] = 'R'
df.loc[13, 'approach'] = 'Julia'

In [ ]:
df.loc[df.title.apply(lambda x: any([y in x.lower() for y in ['question', 'text', 'natural language', 'nlp', 'doc2vec']])), 'type'] = 'text / NLP'
df.loc[df.title.apply(lambda x: any([y in x.lower() for y in ['creativ', 'image', 'signal process', ]])), 'type'] = 'image process'
df['type'] = np.where(df.tutorial, "tutorial", df.type)

In [ ]:
print '{'
for title in df.title:
    if df.loc[df.title==title, 'tutorial'].values[0]:
        continue
    print "\t'" + title + "': '',"
print '}'

In [ ]:
pd.DataFrame({'bla':rename_dict})

In [ ]:
df['type'] = df.title.map(rename_dict)

In [ ]:
df.groupby(["type", "level", "approach"]).color.count().reset_index().head(1)

In [ ]:
gb = df.groupby(["type", "approach"]).color.count()#.reset_index()

In [ ]:
for col in ['type', 'level', 'approach']:
    gb[col] = gb[col].astype("category", )

In [ ]:
cats = gb.level.unique()

In [ ]:
gb.level.astype()

In [ ]:
gb.unstack(level=1).plot.barh()
  • x axis: mean level
  • y axis:
  • size: number of times
  • color: type?!

In [ ]:
rename_dict = {
    'Introduction to Search': 'search',
    'Data Science for Digital Humanities: Extracting meaning from Images and Text': 'text / NLP / *2vec',
    'TNaaS - Tech Names as a Service': 'creativity',
    'Developments in Test-Driven Data Analysis': 'testing',
    'Analysing user comments on news articels with Doc2Vec and Machine Learning classification': 'text / NLP / *2vec',
    'Patsy: The Lingua Franca to and from R': 'R',
    'Large Scale Vandalism Detection in Knowledge Bases': 'network data',
    'Fast Multidimensional Signal Processing using Julia with Shearlabjl': 'images',
    'Social Networks and Protest Participation: Evidence from 130 Million Twitter Users': 'ethics & social',
    'Patterns for Collaboration between Data Scientists And Software Engineers': 'tools / frameworks',
    'Blockchains for Artificial Intelligence': 'blockchain',
    'Data Analytics and the new European Privacy Legislation': 'laws',
    'Building smart IoT applications with Python and Spark': 'iot',
    '“Which car fits my life?”  - mobilede’s approach to recommendations': 'recommendation',
    'Towards Pythonic Innovation in Recommender Systems': 'recommendation',
    'Gold standard data: lessons from the trenches': 'tools / frameworks',
    'Biases are bugs: algorithm fairness and machine learning ethics': 'ethics & social',
    'On Bandits, Bayes, and swipes: gamification of search': 'active learning',
    'Engage the Hyper-Python - a rattle-through many of the ways you can make a Python program faster': 'tools / frameworks',
    'Fairness and transparency in machine learning: Tools and techniques': 'ethics & social',
    "Machine Learning to moderate ads in real world classified's business": '',
    'Size Matters! A/B Testing When Not Knowing Your Number of Trials': '',
    'Is That a Duplicate Quora Question?': 'text / NLP / *2vec',
    'Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics': '',
    'Where are we looking? Prediciting human gaze using deep networks': '',
    'Spying on my Network for a Day: Data Analysis for Networks': 'network data',
    'Deep Learning for detection on a phone: how to stay sane and build a pipeline you can trust': '',
    'A word is worth a thousand pictures: Convolutional methods for text': 'text / NLP / *2vec',
    'Polynomial Chaos: A technique for modeling uncertainty': '',
    'Kickstarting projects with Cookiecutter': 'tools / frameworks',
    'What does it all mean? - Compositional distributional semantics for modelling natural language': 'text / NLP / *2vec',
    'When the grassroots grow stronger - 2017 through the eyes of German open data activists': 'ethics & social',
    'Finding Lane Lines for Self Driving Cars': 'images',
    'Find the text similiarity you need with the next generation of word embeddings in Gensim': 'text / NLP / *2vec',
    'Evaluating Topic Models': 'text / NLP / *2vec',
    'Best Practices for Debugging': 'tools / frameworks',
    'Data Science & Data Visualization in Python How to harness power of Python for social good?': 'ethics & social',
    'Conversational AI: Building clever chatbots': '',
    'AI assisted creativity': 'creativity',
}

In [ ]:


In [ ]:
df[['title', 'type', 'approach']]

In [ ]:
df.groupby(["color", "day_info"])['level'].count().unstack(level=0).plot.bar(rot=0, cmap='inferno')

In [ ]: