In [1]:

    
import requests as rq
import pandas as pd
import matplotlib.pyplot as mpl
import bs4
import os

from tqdm import tqdm_notebook

from datetime import time
%matplotlib inline



In [2]:

    
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

Query Data

Grab schedule page:



In [3]:

    
base_url = "https://pydata.org"
r = rq.get(base_url + "/berlin2017/schedule/")
bs = bs4.BeautifulSoup(r.text, "html.parser")

Let's query every talk description:



In [4]:

    
data = {}
for ahref in tqdm_notebook(bs.find_all("a")):
    if 'schedule/presentation' in ahref.get("href"):
        url = ahref.get("href")
    else:
        continue
    data[url] = {}
    resp = bs4.BeautifulSoup(rq.get(base_url + url).text, "html.parser")
    title = resp.find("h2").text
    resp = resp.find_all(attrs={'class':"container"})[1]

    when, who = resp.find_all("h4")
    date_info = when.string.split("\n")[1:]
    day_info = date_info[0].strip()
    time_inf = date_info[1].strip()
    room_inf = date_info[3].strip()[3:]
    speaker = who.find("a").text
    level =  resp.find("dd").text
    abstract = resp.find(attrs={'class':'abstract'}).text
    description = resp.find(attrs={'class':'description'}).text
    data[url] = {
        'day_info': day_info, 
        'title': title,
        'time_inf': time_inf, 
        'room_inf': room_inf, 
        'speaker': speaker, 
        'level': level, 
        'abstract': abstract, 
        'description': description
    }

Okay, make a dataframe and add some helpful columns:



In [5]:

    
df = pd.DataFrame.from_dict(data, orient='index')
df.reset_index(drop=True, inplace=True)

Show Profile Report of Pandas DF



In [6]:

    
import pandas_profiling as pp
pfr = pp.ProfileReport(df)









    



/Users/nico.albers/anaconda2/lib/python2.7/site-packages/matplotlib/__init__.py:1405: UserWarning: 
This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)



In [7]:

    
from IPython.display import display, HTML
display(HTML(
    pfr.html.replace("<h3", "<h4").replace("<h2", "<h3").replace("<h1", "<h2")
))









    









    
        Overview
    
    
    
        Dataset info
        
            
            
                Number of variables
                8 
            
            
                Number of observations
                48 
            
            
                Total Missing (%)
                0.0% 
            
            
                Total size in memory
                3.1 KiB 
            
            
                Average record size in memory
                65.5 B 
            
            
        
    
    
        Variables types
        
            
            
                Numeric
                0 
            
            
                Categorical
                4 
            
            
                Date
                0 
            
            
                Text (Unique)
                4 
            
            
                Rejected
                0 
            
            
        
    
    
        Warnings
         
    

    
        Variables
    
    
    
        abstract

            Categorical, Unique
        
    

  
    
      First 3 values
    
  
  
    
      Debugging is a daily activity of any programme...
    
    
      A fast paced high-level overview of speed opti...
    
    
      For every button on the webpage that is clicke...
    
  


  
    
      Last 3 values
    
  
  
    
      Representing words as vectors in a high-dimens...
    
    
      Supervised models are trained on labelled data...
    
    
      Test-driven data analysis fuses and builds upo...
    
  


    
        Toggle details
    


    First 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        
Introduction to the OpenCV library (loading images, plotting with matplotlib, etc…)
Starting with single images, introduce Gaussian blur, region of interest filtering, canny edge detection, Hough transform and draw the lane lines.
Create a lane line detection pipeline with those functions; extrapolating the lines to represent the boundaries of the lane.
Process dash cam video using the single image techniques with the pipeline.
Stitch together images from the processing pipeline to create a sweet video!

        1
        2.1%
        
             
        

        
Overview of the Jupyter project + setup to get everyone on board.
Handling the UI, know the shortcuts
Different type of cells
Exporting notebooks for presentations
Handling different kernels
Set styles for visualizations for professional quality
Mod the style of the web interface yourself via CSS 
Profiling code in notebooks, use Cython
Debugging in notebooks

        1
        2.1%
        
             
        

        A fast paced high-level overview of speed optimisation in Python.
We will start by looking systematically at the most common causes of poor speed, highlighting which resources are being the bottleneck in each case and giving practical advice on how to find out.
We will then discuss parallelism with threads and processes, both in the standard library and using celery.
We will discuss Pypy and Cython as alternatives to regular Python for CPU intensive tasks.
We finish our tour with asynchronous processing in Python 3 using async.io.
        1
        2.1%
        
             
        

        A new wave of creative applications of AI has arrived, making science fiction authors struggle to keep up with reality. Recent advances in Deep Learning, especially generative models, make it possible to generate text, audio, speech, and images. There's a wonderfully trippy world of neural nets "going wild" out there, with AI choreographed dancing moves, freestyle raps, impressionist paintings, and Trump impersonating bots.
Such "bots" and experiments are but one novel use of this kind of "Creative AI". Taking a more human-centered approach, allowing for control and agency, has the potential to turn these content-generating neural nets, into tools for creative use and explorations of human-machine interaction, where the main theorem is "augmentation, not automation".
The talk will particularly focus on "generative" models, and show the python fanatic how to make your move with these particular forms of Deep Neural Nets.
        1
        2.1%
        
             
        

        About H2O.ai
H2O.ai is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance and healthcare companies to deploy machine learning and predictive analytics to solve complex problems. More than 8,500+ organizations and 75,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company accelerates business transformation for 107 Fortune 500 enterprises, 8 of the world’s 12 largest banks, 7 of the 10 largest auto insurance companies and all 5 major telecommunications providers. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy's, Walgreens, Kaiser Permanente, and Aetna.
This tutorial aims to demonstrate the basic usage of H2O with worked examples in Python. Code and data for the worked examples will be provided.
Learning Objectives
By the end of the tutorial, participants will be able to:

Start and connect to a local H2O cluster from Python.
Start and connect to H2O cluster(s) on the cloud (e.g. AWS) (i.e. straight-forward distributed machine learning)
Import data from Python data frames, local files or web.
Perform basic data transformation and exploration.
Train classification and regression models using H2O machine learning algorithms.
Evaluate model performance and make predictions.

Agenda

About H2O.ai
H2O machine learning platform & algorithms
H2O + Python API
Basic Extract, Transform and Load (ETL) procedures
Worked examples: classification and regression

        1
        2.1%
        
             
        


    Last 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Topic Modelling is a great way to analyse completely unstructured textual data - and with the python NLP framework Gensim, it's very, very easy to do this. The purpose of this tutorial is to guide one through the whole process of topic modelling - right from pre-processing your raw textual data, creating your topic models, evaluating the topic models, to visualising them. Advanced topic modelling techniques will also be covered in this tutorial, such as Dynamic Topic Modelling, Topic Coherence, Document Word Coloring, and LSI/HDP.
The python packages used during the tutorial will be spaCy (for pre-processing), gensim (for topic modelling), and pyLDAvis (for visualisation). The interface for the tutorial will be an Jupyter notebook. 
The takeaway from the tutorial would be the participants ability to get their hands dirty with analysing their own textual data, through the entire lifecycle of cleaning raw data to visualising topics.
        1
        2.1%
        
             
        

        Visual saliency models aim at describing human eye fixations and finding the most relevant features in a visual scene. From experience one can justify two important processes that drive saliency: First, low level features like color, intensity or orientation contrast and second, high-level features like objects, faces or signs. 
Eye fixations serve as an experimental setup to study saliency phenomena and give insight into the ways we attend scenes. Hereby, computational modelling is used to explain what information processing might be responsible for saliency. Evaluating how well models explain observed fixations give a framework of identifying what features contribute to saliency by designing models with different, e.g. high- or low-level feature extractors. 
Many saliency models have used low-level features but were faced with drawbacks in explaining pronounced saliency caused by high-level contributions. Simply adding face or object detectors has been a plausible follow-up but revealed little about the underlying mechanisms.
Recent advances in object classification by training convolutional neural networks (CNN) have revealed rich filter representations in a wide range of high-level features and are therefore a promising candidate in building models of visual processing (e.g. VGG-19 trained on ImageNet).
        1
        2.1%
        
             
        

        When working with personal data, we need to make sure that our algorithms treat people fairly, are transparent and can be held accountable for their decisions. When using complex techniques like deep learning on very large datasets, it is not easy to prove that our algorithms behave they way we intend them to and e.g. do not discriminate against certain groups of people.
In my talk, I will discuss why ensuring transparency and fairness in machine learning is not easy, and how we can use Python tools to investigate our machine learning systems and makre sure they behave they way they should.

Introduction: Why you should care about this (->EU-Data Protection Directive)
What kind of problems can occur in machine learning systems (bias in the input data, leakage of sensitive information into the training data, hidden usage of protected attributes by the algorithm)?
How can we measure and correct for bias in our systems (certifying and removing disparate impact)?
How can we understand the decisions that our algorithms make (perturbation analysis, simplified modeling, blackbox testing)?
How can we design our machine learning systems to make sure they're compliant and accountable (anonymization of data, monitoring of outcomes, auditing of algorithms)?
Outlook: The future of transparency and accountability in machine learning

        1
        2.1%
        
             
        

        Whenever you have a machine learning module in your pipeline, persisting and serving the model is not yet a trivial task. This tutorial shows how an open source framework using several open source technologies could potentially solve the problem.
My journey started with this[1] question on StackOverflow. I wanted to be able to do my usual data science stuff, mostly in python, and then deploy them somewhere serving like a REST API, responding to requests in real-time, using the output of the trained models. My original line of thought was this workflow:

train the model in python or pyspark or in scala in apache spark.
get the model, put it in an apache flink stream and serve.

This was the point at which I had been reading and watching tutorials and attending meetups related to these technologies. I was looking for a solution which is better than:

train models in python
write a web-service using flask, put it behind a apache2 server, and put a bunch of them behind a load balancer.

This just sounded wrong, or at its best, not scalable. After a bit of research, I came across PipelineIO[2,3] which seems to promise exactly what I'm looking for. In this tutorial we use PipelineIO, to deply a cluster on the cloud, which gives us a JupyterHub to develop our method, and uses PMML to persist and deploy and serve the model. My own jurney and take from PipelineIO are documented github[4]. I'll use Amazon AWS, but PipelineIO uses Kubernetes and you can easily deploy in any environment in which you can use Kubernetes.
If you work in an environment in which you have different machine learning modules, which should be used in production in real time and as a part of a stream processing pipeline, this talk is for you.
[1] http://stackoverflow.com/questions/42719953/how-to-develop-a-rest-api-using-an-ml-model-trained-on-apache-spark
[2] http://pipeline.io
[3] https://github.com/fluxcapacitor/pipeline
[4] https://github.com/adrinjalali/pipeline-docs
        1
        2.1%
        
             
        

        Writing a Python script from scratch is fairly straightforward if you have some experience working in Python. You can usually get by with very little boilerplate code. Starting a new Python project, however, can be tiring if you decide to stick to best practices and plan on submitting it to PyPI later on.  It requires great diligence and occasionally gets pretty cumbersome when if you're creating new tools on a regular basis.
So why not use a template for your projects?
Cookiecutter is a command-line utility that creates projects from templates. It is free and open-source software distributed under the terms of a permissive BSD-3 license. With around 150 individual contributors, more than 1000 public templates on GitHub, and multiple talks at conferences, it is fair to say that there is a solid community around it.
In this talk, I will give an introduction on Cookiecutter, how to install it, how to use it in the CLI, and finally how to author your own template. You can use Cookiecutter for all sorts of projects: command-line scripts, Django webapps, and even non-Python projects.
The community has authored and published several templates for Data Science projects, for instance widget-cookiecutter. I will demonstrate how to use Cookiecutter to create a custom Jupyter widget using that template.
The goal of this talk is to teach how to integrate Cookiecutter into your own workflow and share learnings in the form of templates with your team at work and the open source community.
        1
        2.1%
        
             
        




    
        day_info

            Categorical
        
    

    
        
            Distinct count
            3
        
        
            Unique (%)
            6.2%
        
        
            Missing (%)
            0.0%
        
        
            Missing (n)
            0
        
    


    
        
    Sunday
    
        
            20
        
        
    

    Saturday
    
        
            19
        
        
    

    Friday
    
        
            9
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Sunday
        20
        41.7%
        
             
        

        Saturday
        19
        39.6%
        
             
        

        Friday
        9
        18.8%
        
             
        




    
        description

            Categorical, Unique
        
    

  
    
      First 3 values
    
  
  
    
      Shearlab is a Julia Library with toolbox for t...
    
    
      Those folks in computer vision keep publishing...
    
    
      Developed at MIT, with a focus on fast numeric...
    
  


  
    
      Last 3 values
    
  
  
    
      Parametric uncertainty is broadly difficult to...
    
    
      The talk is going to present, with examples, h...
    
    
      In todays world of online business, it is diff...
    
  


    
        Toggle details
    


    First 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        A fast paced high-level overview of speed optimisation in Python.
What makes a program "slow"? How to tell what is making your program slow.
Common speed-up paradigms: parallelization, alternatives to the regular Python interpreter and asynchronous processing.
        1
        2.1%
        
             
        

        A new wave of creative applications of AI has arrived, making science fiction authors struggle to keep up with reality. Recent advances in Deep Learning, especially generative models, make it possible to generate text, audio, speech, and images. There's a wonderfully trippy world of neural nets "going wild" out there, which you, the python enthusiastic, can be part of...
        1
        2.1%
        
             
        

        Analyzing millions of images and enormous text sources using machine learning and deep learning techniques is simple and straightforward in the Python ecosystem. Powerful machine learning algorithms and interactive visualization frameworks make it easy to conduct and communicate large scale experiments. Exploring this data can yield new insights for researchers, journalists, and businesses.
        1
        2.1%
        
             
        

        As Germany’s largest online vehicle marketplace mobile.de uses recommendations at scale to help users find the perfect car. We elaborate on collaborative & content-based filtering as well as a hybrid approach addressing the problem of a fast-changing inventory. We then dive into the technical implementation of the recommendation engine, outlining the various challenges faced and experiences made.
        1
        2.1%
        
             
        

        As a data scientist, one of the challenges after you develop and train your model, is to deploy it in production where other systems would use the output of the model in real time. In this tutorial we use PipelineIO, to deploy a cluster on the cloud, which gives us a JupyterHub to develop our method, and uses PMML to persist and deploy and serve the model.
        1
        2.1%
        
             
        


    Last 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Unsupervised models in natural language processing (NLP) have become very popular recently. Word2vec, GloVe and LDA provide powerful computational tools to deal with natural language and make exploring large document collections feasible. We would like to be able to say if a model is objectively good or bad, and compare different models to each other, this is often tricky to do in practice.
        1
        2.1%
        
             
        

        What is the closest word to "king"? Is it "Canute" or is it "crowned"? There are many ways to define "similar words" and "similar texts". Depending on your definition you should choose a word embedding to use. There is a new generation of word embeddings added to Gensim open source NLP package using morphological information and  learning-to-rank: Facebook's FastText, VarEmbed and WordRank.
        1
        2.1%
        
             
        

        Which features in an image draw our focus to a specific area while neglecting others entirely? This fascinating question has been motivating researchers for decades but also sparked interest in design and marketing. Thus, saliency models aim at identifying locations that stand out from their visual neighbourhood. Using tensorflow and matplotlib this talk will shed some light on these features..
        1
        2.1%
        
             
        

        Wikidata is a Knowledge Base where anybody can add new information. Unfortunately, it is targeted by vandals, who put inaccurate or offensive information there. To fight them, Wikidata employs moderators, who manually inspect each suggested edit. In this talk we will look into how we can use Machine Learning to automatically detect vandalic revisions and help the moderators.
        1
        2.1%
        
             
        

        https://github.com/bhargavvader/personal/tree/master/notebooks/text_analysis_tutorial 
This tutorial will guide you through the process of analysing your textual data through topic modelling - from finding and cleaning your data, pre-processing using spaCy, applying topic modelling algorithms using gensim - before moving on to more advanced textual analysis techniques.
        1
        2.1%
        
             
        




    
        level

            Categorical
        
    

    
        
            Distinct count
            3
        
        
            Unique (%)
            6.2%
        
        
            Missing (%)
            0.0%
        
        
            Missing (n)
            0
        
    


    
        
    Intermediate
    
        
            33
        
        
    

    Novice
    
        
            13
        
        
    

    Experienced
    
        
             
        
        2
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Intermediate
        33
        68.8%
        
             
        

        Novice
        13
        27.1%
        
             
        

        Experienced
        2
        4.2%
        
             
        




    
        room_inf

            Categorical
        
    

    
        
            Distinct count
            5
        
        
            Unique (%)
            10.4%
        
        
            Missing (%)
            0.0%
        
        
            Missing (n)
            0
        
    


    
        
    A238
    
        
            17
        
        
    

    A208
    
        
            13
        
        
    

    D105 Audimax
    
        
            13
        
        
    

    Other values (2)
    
        
            5
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        A238
        17
        35.4%
        
             
        

        A208
        13
        27.1%
        
             
        

        D105 Audimax
        13
        27.1%
        
             
        

        A130
        4
        8.3%
        
             
        

        A239
        1
        2.1%
        
             
        




    
        speaker

            Categorical, Unique
        
    

  
    
      First 3 values
    
  
  
    
      Alexander Hendorf
    
    
      Adrin Jalali
    
    
      Aisha Bello
    
  


  
    
      Last 3 values
    
  
  
    
      Vincent D. Warmerdam
    
    
      Emily Gorcenski
    
    
      Stefan Otte
    
  


    
        Toggle details
    


    First 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Abhishek Thakur
        1
        2.1%
        
             
        

        Adrin Jalali
        1
        2.1%
        
             
        

        Aisha Bello
        1
        2.1%
        
             
        

        Alexander Hendorf
        1
        2.1%
        
             
        

        Alexander Weiss
        1
        2.1%
        
             
        


    Last 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Tom Bocklisch
        1
        2.1%
        
             
        

        Trent McConaghy
        1
        2.1%
        
             
        

        Ulrike Thalheim
        1
        2.1%
        
             
        

        Vaibhav Singh
        1
        2.1%
        
             
        

        Vincent D. Warmerdam
        1
        2.1%
        
             
        




    
        time_inf

            Categorical
        
    

    
        
            Distinct count
            17
        
        
            Unique (%)
            35.4%
        
        
            Missing (%)
            0.0%
        
        
            Missing (n)
            0
        
    


    
        
    16:00–16:45
    
        
             
        
        6
    

    15:15–16:00
    
        
             
        
        5
    

    10:30–11:15
    
        
             
        
        3
    

    Other values (14)
    
        
            34
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        16:00–16:45
        6
        12.5%
        
             
        

        15:15–16:00
        5
        10.4%
        
             
        

        10:30–11:15
        3
        6.2%
        
             
        

        11:15–12:00
        3
        6.2%
        
             
        

        12:15–13:00
        3
        6.2%
        
             
        

        17:00–17:45
        3
        6.2%
        
             
        

        10:45–11:30
        3
        6.2%
        
             
        

        12:00–12:45
        3
        6.2%
        
             
        

        17:45–18:30
        3
        6.2%
        
             
        

        14:30–15:15
        3
        6.2%
        
             
        

        Other values (7)
        13
        27.1%
        
             
        




    
        title

            Categorical, Unique
        
    

  
    
      First 3 values
    
  
  
    
      Analysing user comments on news articels with ...
    
    
      When the grassroots grow stronger - 2017 throu...
    
    
      Introductory tutorial on data exploration and ...
    
  


  
    
      Last 3 values
    
  
  
    
      Data Science for Digital Humanities: Extractin...
    
    
      Introduction to Machine Learning with H2O and ...
    
    
      Machine Learning to moderate ads in real world...
    
  


    
        Toggle details
    


    First 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        A word is worth a thousand pictures: Convolutional methods for text
        1
        2.1%
        
             
        

        AI assisted creativity
        1
        2.1%
        
             
        

        Advanced Metaphors in Coding with Python
        1
        2.1%
        
             
        

        Analysing user comments on news articels with Doc2Vec and Machine Learning classification
        1
        2.1%
        
             
        

        Best Practices for Debugging
        1
        2.1%
        
             
        


    Last 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Towards Pythonic Innovation in Recommender Systems
        1
        2.1%
        
             
        

        What does it all mean? - Compositional distributional semantics for modelling natural language
        1
        2.1%
        
             
        

        When the grassroots grow stronger - 2017 through the eyes of German open data activists
        1
        2.1%
        
             
        

        Where are we looking? Prediciting human gaze using deep networks.
        1
        2.1%
        
             
        

        “Which car fits my life?”  - mobile.de’s approach to recommendations
        1
        2.1%
        
             
        




    
        Sample
    
    
    
        
  
    
      
      day_info
      time_inf
      speaker
      room_inf
      title
      abstract
      level
      description
    
  
  
    
      0
      Friday
      13:45–15:15
      Alexandru Agachi
      A238
      Introductory tutorial on data exploration and ...
      I would be happy to conduct an introductory le...
      Novice
      This tutorial will focus on analyzing a datase...
    
    
      1
      Friday
      15:30–17:00
      Adrin Jalali
      A130
      The path between developing and serving machin...
      Whenever you have a machine learning module in...
      Experienced
      As a data scientist, one of the challenges aft...
    
    
      2
      Friday
      9:00–10:30
      David Higgins
      A130
      Introduction to Julia for Scientific Computing...
      Julia is a new and exciting language, sponsore...
      Intermediate
      Developed at MIT, with a focus on fast numeric...
    
    
      3
      Friday
      10:45–12:15
      Gerrit Gruben
      A130
      Leveling up your Jupyter notebook skills
      \nOverview of the Jupyter project + setup to g...
      Intermediate
      Most of us regularly work with Jupyter noteboo...
    
    
      4
      Friday
      9:00–10:30
      Alexander Hendorf
      A238
      Introduction to Data-Analysis with Pandas
      Pandas is the Swiss-Multipurpose Knife for Dat...
      Novice
      Pandas is the Swiss-Multipurpose Knife for Dat...

Further Processing



In [8]:

    
# Tutorials on Friday
df.loc[df.day_info=='Friday', 'tutorial'] = True
df['tutorial'].fillna(False, inplace=True)



In [9]:

    
# time handling
df['time_from'], df['time_to'] = zip(*df.time_inf.str.split(u'\u2013'))
df.time_from = pd.to_datetime(df.time_from).dt.time
df.time_to = pd.to_datetime(df.time_to).dt.time
del df['time_inf']



In [10]:

    
df.head(3)









    Out[10]:







  
    
      
      day_info
      speaker
      room_inf
      title
      abstract
      level
      description
      tutorial
      time_from
      time_to
    
  
  
    
      0
      Friday
      Alexandru Agachi
      A238
      Introductory tutorial on data exploration and ...
      I would be happy to conduct an introductory le...
      Novice
      This tutorial will focus on analyzing a datase...
      True
      13:45:00
      15:15:00
    
    
      1
      Friday
      Adrin Jalali
      A130
      The path between developing and serving machin...
      Whenever you have a machine learning module in...
      Experienced
      As a data scientist, one of the challenges aft...
      True
      15:30:00
      17:00:00
    
    
      2
      Friday
      David Higgins
      A130
      Introduction to Julia for Scientific Computing...
      Julia is a new and exciting language, sponsore...
      Intermediate
      Developed at MIT, with a focus on fast numeric...
      True
      09:00:00
      10:30:00



In [11]:

    
# Example: Let's query all non-novice talks on sunday, starting at 4 pm
tmp = df.query("(level!='Novice') & (day_info=='Sunday')")
tmp[tmp.time_from >= time(16)]









    Out[11]:







  
    
      
      day_info
      speaker
      room_inf
      title
      abstract
      level
      description
      tutorial
      time_from
      time_to
    
  
  
    
      7
      Sunday
      Hendrik Heuer
      D105 Audimax
      Data Science for Digital Humanities: Extractin...
      The focus of this talk is extracting meaning f...
      Intermediate
      Analyzing millions of images and enormous text...
      False
      17:00:00
      17:45:00
    
    
      12
      Sunday
      Alexey Grigorev
      A238
      Large Scale Vandalism Detection in Knowledge B...
      Knowledge bases are an important source of inf...
      Intermediate
      Wikidata is a Knowledge Base where anybody can...
      False
      17:00:00
      17:45:00
    
    
      14
      Sunday
      Jonathan Ronen
      D105 Audimax
      Social Networks and Protest Participation: Evi...
      Pinning down the role of social ties in the de...
      Intermediate
      Data mining social networks for evidence of po...
      False
      17:45:00
      18:30:00
    
    
      24
      Sunday
      Daniele Rapati
      A208
      Engage the Hyper-Python - a rattle-through man...
      A fast paced high-level overview of speed opti...
      Intermediate
      A fast paced high-level overview of speed opti...
      False
      17:45:00
      18:30:00
    
    
      26
      Sunday
      Vaibhav Singh
      A238
      Machine Learning to moderate ads in real world...
      In an online classified's business, one may en...
      Intermediate
      In todays world of online business, it is diff...
      False
      17:45:00
      18:30:00
    
    
      30
      Sunday
      Oliver Eberle
      A208
      Where are we looking? Prediciting human gaze u...
      Visual saliency models aim at describing human...
      Intermediate
      Which features in an image draw our focus to a...
      False
      17:00:00
      17:45:00
    
    
      39
      Sunday
      Lev Konstantinovskiy
      A208
      Find the text similiarity you need with the ne...
      There are many ways to find similar words/docs...
      Intermediate
      What is the closest word to "king"? Is it "Can...
      False
      16:00:00
      16:45:00
    
    
      45
      Sunday
      Roelof Pieters
      D105 Audimax
      AI assisted creativity
      A new wave of creative applications of AI has ...
      Intermediate
      A new wave of creative applications of AI has ...
      False
      16:00:00
      16:45:00

visualize some relations



In [12]:

    
ax = df.level.value_counts().plot.pie(figsize=(3,3), autopct="%1.1f %%")
ax.axis("equal")
ax.set_ylabel("")
ax.set_title("levels of the talks where:")
plt.show()



In [13]:

    
ax = df.groupby("tutorial")['level'].value_counts().unstack(level=0).plot.pie(
    subplots=True, legend=False, autopct="%1.1f %%", startangle=90, labels=["","",""])
for axx in ax:
    axx.axis("equal")
    axx.set_ylabel("")
ax[0].set_xlabel("Not tutorial")   
ax[1].set_xlabel("tutorial")
plt.gcf().suptitle("Level of the talks where (splitted by tutorial or not):", fontsize=16)
plt.tight_layout()
axx.legend(
    df.groupby("tutorial")['level'].value_counts().unstack(level=0).index.tolist(),
    loc='center left', bbox_to_anchor=(1, .8))
plt.show()



In [14]:

    
categorical_cols = [col for col in df.columns if len(df[col].unique())<=len(df[col]) / 5]
f, ax = plt.subplots(2,2, figsize=(20,10), sharey="row")
for i, axx in enumerate(ax.flatten()):
    col = categorical_cols[i]
    df.groupby("tutorial")[col].value_counts().unstack(level=0).plot.bar(ax=axx, rot=0, stacked=True)
    axx.set_title(col)
    axx.set_facecolor("white")
    axx.grid(True, color="lightgrey")
f.suptitle("Number of talks grouped by day, level, is tutorial, room: ...", fontsize=20)
f.tight_layout()

Test Meadow

The following is just WIP crap - do not read this.



In [ ]:

    
df.title = df.title.str.replace(".", "").replace(":", "").replace(",", "")



In [ ]:

    
from itertools import chain



In [ ]:

    
foo = pd.DataFrame(
    pd.Series(list(chain(*df.title.str.split().apply(lambda x: np.unique(x)).tolist()))).value_counts(normalize=True)
).reset_index().rename(columns={0:'share', 'index': 'word'}).query("share >= 0.001")



In [ ]:

    
foo.head()



In [ ]:

    
foo['len'] = foo.word.str.len()



In [ ]:

    
from gensim import corpora, models, similarities



In [ ]:

    
documents = df.query("tutorial != True").title.tolist()



In [ ]:

    
stoplist = set('''
    for a of the and to in i be on with here we will an each its type as our their then apply them very would this
    make large talk, basic search is are there more than pages it can or that they how by have what from talk use you
    these using which but some not your do used at if like such has about - my one most those should between may good • why
    give way time been need many so does case when also all into lot build features new who often discuss building
    best text * was out
    '''.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
        for document in documents]



In [ ]:

    
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]



In [ ]:

    
pd.DataFrame(dict(nr=dict(frequency))).sort_values('nr', ascending=False).query("nr>1")



In [ ]:

    
keywords = pd.DataFrame(dict(nr=dict(frequency))).sort_values('nr', ascending=False).query("nr>1").index.tolist()



In [ ]:

    
for keyword in keywords[::-1]:
    df.loc[df.title.str.lower().str.contains(keyword), 'color'] = keyword
df.color.fillna("UNKNOWN", inplace=True)



In [ ]:

    
df.head()



In [ ]:

    
pd.set_option("max.colwidth", 400)



In [ ]:

    
df['approach'] = np.where(df.tutorial, "tutorial", np.nan)
df.loc[df.title.str.lower().str.contains('ai'), 'approach'] = 'artificial intelligence'
df.loc[df.title.str.lower().str.contains('artificial intelligence'), 'approach'] = 'artificial intelligence'
df.loc[df.title.str.lower().str.contains('pandas'), 'approach'] = 'pandas'
df.loc[df.title.str.lower().str.contains('jupyter'), 'approach'] = 'jupyter'
df.loc[df.title.str.lower().str.contains('data scien'), 'approach'] = 'data science'
df.loc[df.title.str.lower().str.contains('machine learn'), 'approach'] = 'machine learning'
df.loc[df.title.str.lower().str.contains('deep'), 'approach'] = 'deep learning'



In [ ]:

    
# attention: may change when querying again
df.loc[11, 'approach'] = 'R'
df.loc[13, 'approach'] = 'Julia'



In [ ]:

    
df.loc[df.title.apply(lambda x: any([y in x.lower() for y in ['question', 'text', 'natural language', 'nlp', 'doc2vec']])), 'type'] = 'text / NLP'
df.loc[df.title.apply(lambda x: any([y in x.lower() for y in ['creativ', 'image', 'signal process', ]])), 'type'] = 'image process'
df['type'] = np.where(df.tutorial, "tutorial", df.type)



In [ ]:

    
print '{'
for title in df.title:
    if df.loc[df.title==title, 'tutorial'].values[0]:
        continue
    print "\t'" + title + "': '',"
print '}'



In [ ]:

    
pd.DataFrame({'bla':rename_dict})



In [ ]:

    
df['type'] = df.title.map(rename_dict)



In [ ]:

    
df.groupby(["type", "level", "approach"]).color.count().reset_index().head(1)



In [ ]:

    
gb = df.groupby(["type", "approach"]).color.count()#.reset_index()



In [ ]:

    
for col in ['type', 'level', 'approach']:
    gb[col] = gb[col].astype("category", )



In [ ]:

    
cats = gb.level.unique()



In [ ]:

    
gb.level.astype()



In [ ]:

    
gb.unstack(level=1).plot.barh()

x axis: mean level
y axis:
size: number of times
color: type?!



In [ ]:

    
rename_dict = {
    'Introduction to Search': 'search',
    'Data Science for Digital Humanities: Extracting meaning from Images and Text': 'text / NLP / *2vec',
    'TNaaS - Tech Names as a Service': 'creativity',
    'Developments in Test-Driven Data Analysis': 'testing',
    'Analysing user comments on news articels with Doc2Vec and Machine Learning classification': 'text / NLP / *2vec',
    'Patsy: The Lingua Franca to and from R': 'R',
    'Large Scale Vandalism Detection in Knowledge Bases': 'network data',
    'Fast Multidimensional Signal Processing using Julia with Shearlabjl': 'images',
    'Social Networks and Protest Participation: Evidence from 130 Million Twitter Users': 'ethics & social',
    'Patterns for Collaboration between Data Scientists And Software Engineers': 'tools / frameworks',
    'Blockchains for Artificial Intelligence': 'blockchain',
    'Data Analytics and the new European Privacy Legislation': 'laws',
    'Building smart IoT applications with Python and Spark': 'iot',
    '“Which car fits my life?”  - mobilede’s approach to recommendations': 'recommendation',
    'Towards Pythonic Innovation in Recommender Systems': 'recommendation',
    'Gold standard data: lessons from the trenches': 'tools / frameworks',
    'Biases are bugs: algorithm fairness and machine learning ethics': 'ethics & social',
    'On Bandits, Bayes, and swipes: gamification of search': 'active learning',
    'Engage the Hyper-Python - a rattle-through many of the ways you can make a Python program faster': 'tools / frameworks',
    'Fairness and transparency in machine learning: Tools and techniques': 'ethics & social',
    "Machine Learning to moderate ads in real world classified's business": '',
    'Size Matters! A/B Testing When Not Knowing Your Number of Trials': '',
    'Is That a Duplicate Quora Question?': 'text / NLP / *2vec',
    'Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics': '',
    'Where are we looking? Prediciting human gaze using deep networks': '',
    'Spying on my Network for a Day: Data Analysis for Networks': 'network data',
    'Deep Learning for detection on a phone: how to stay sane and build a pipeline you can trust': '',
    'A word is worth a thousand pictures: Convolutional methods for text': 'text / NLP / *2vec',
    'Polynomial Chaos: A technique for modeling uncertainty': '',
    'Kickstarting projects with Cookiecutter': 'tools / frameworks',
    'What does it all mean? - Compositional distributional semantics for modelling natural language': 'text / NLP / *2vec',
    'When the grassroots grow stronger - 2017 through the eyes of German open data activists': 'ethics & social',
    'Finding Lane Lines for Self Driving Cars': 'images',
    'Find the text similiarity you need with the next generation of word embeddings in Gensim': 'text / NLP / *2vec',
    'Evaluating Topic Models': 'text / NLP / *2vec',
    'Best Practices for Debugging': 'tools / frameworks',
    'Data Science & Data Visualization in Python How to harness power of Python for social good?': 'ethics & social',
    'Conversational AI: Building clever chatbots': '',
    'AI assisted creativity': 'creativity',
}



In [ ]:



In [ ]:

    
df[['title', 'type', 'approach']]



In [ ]:

    
df.groupby(["color", "day_info"])['level'].count().unstack(level=0).plot.bar(rot=0, cmap='inferno')



In [ ]:

Number of variables	8
Number of observations	48
Total Missing (%)	0.0%
Total size in memory	3.1 KiB
Average record size in memory	65.5 B

Value	Count	Frequency (%)
Introduction to the OpenCV library (loading images, plotting with matplotlib, etc…) Starting with single images, introduce Gaussian blur, region of interest filtering, canny edge detection, Hough transform and draw the lane lines. Create a lane line detection pipeline with those functions; extrapolating the lines to represent the boundaries of the lane. Process dash cam video using the single image techniques with the pipeline. Stitch together images from the processing pipeline to create a sweet video!	1	2.1%
Overview of the Jupyter project + setup to get everyone on board. Handling the UI, know the shortcuts Different type of cells Exporting notebooks for presentations Handling different kernels Set styles for visualizations for professional quality Mod the style of the web interface yourself via CSS Profiling code in notebooks, use Cython Debugging in notebooks	1	2.1%
A fast paced high-level overview of speed optimisation in Python. We will start by looking systematically at the most common causes of poor speed, highlighting which resources are being the bottleneck in each case and giving practical advice on how to find out. We will then discuss parallelism with threads and processes, both in the standard library and using celery. We will discuss Pypy and Cython as alternatives to regular Python for CPU intensive tasks. We finish our tour with asynchronous processing in Python 3 using async.io.	1	2.1%
A new wave of creative applications of AI has arrived, making science fiction authors struggle to keep up with reality. Recent advances in Deep Learning, especially generative models, make it possible to generate text, audio, speech, and images. There's a wonderfully trippy world of neural nets "going wild" out there, with AI choreographed dancing moves, freestyle raps, impressionist paintings, and Trump impersonating bots. Such "bots" and experiments are but one novel use of this kind of "Creative AI". Taking a more human-centered approach, allowing for control and agency, has the potential to turn these content-generating neural nets, into tools for creative use and explorations of human-machine interaction, where the main theorem is "augmentation, not automation". The talk will particularly focus on "generative" models, and show the python fanatic how to make your move with these particular forms of Deep Neural Nets.	1	2.1%
About H2O.ai H2O.ai is focused on bringing AI to businesses through software. Its flagship product is H2O, the leading open source platform that makes it easy for financial services, insurance and healthcare companies to deploy machine learning and predictive analytics to solve complex problems. More than 8,500+ organizations and 75,000+ data scientists depend on H2O for critical applications like predictive maintenance and operational intelligence. The company accelerates business transformation for 107 Fortune 500 enterprises, 8 of the world’s 12 largest banks, 7 of the 10 largest auto insurance companies and all 5 major telecommunications providers. Notable customers include Capital One, Progressive Insurance, Transamerica, Comcast, Nielsen Catalina Solutions, Macy's, Walgreens, Kaiser Permanente, and Aetna. This tutorial aims to demonstrate the basic usage of H2O with worked examples in Python. Code and data for the worked examples will be provided. Learning Objectives By the end of the tutorial, participants will be able to: Start and connect to a local H2O cluster from Python. Start and connect to H2O cluster(s) on the cloud (e.g. AWS) (i.e. straight-forward distributed machine learning) Import data from Python data frames, local files or web. Perform basic data transformation and exploration. Train classification and regression models using H2O machine learning algorithms. Evaluate model performance and make predictions. Agenda About H2O.ai H2O machine learning platform & algorithms H2O + Python API Basic Extract, Transform and Load (ETL) procedures Worked examples: classification and regression	1	2.1%

16:00–16:45	6
15:15–16:00	5
10:30–11:15	3
Other values (14)	34

	day_info	time_inf	speaker	room_inf	title	abstract	level	description
0	Friday	13:45–15:15	Alexandru Agachi	A238	Introductory tutorial on data exploration and ...	I would be happy to conduct an introductory le...	Novice	This tutorial will focus on analyzing a datase...
1	Friday	15:30–17:00	Adrin Jalali	A130	The path between developing and serving machin...	Whenever you have a machine learning module in...	Experienced	As a data scientist, one of the challenges aft...
2	Friday	9:00–10:30	David Higgins	A130	Introduction to Julia for Scientific Computing...	Julia is a new and exciting language, sponsore...	Intermediate	Developed at MIT, with a focus on fast numeric...
3	Friday	10:45–12:15	Gerrit Gruben	A130	Leveling up your Jupyter notebook skills	\nOverview of the Jupyter project + setup to g...	Intermediate	Most of us regularly work with Jupyter noteboo...
4	Friday	9:00–10:30	Alexander Hendorf	A238	Introduction to Data-Analysis with Pandas	Pandas is the Swiss-Multipurpose Knife for Dat...	Novice	Pandas is the Swiss-Multipurpose Knife for Dat...

	day_info	speaker	room_inf	title	abstract	level	description	tutorial	time_from	time_to
7	Sunday	Hendrik Heuer	D105 Audimax	Data Science for Digital Humanities: Extractin...	The focus of this talk is extracting meaning f...	Intermediate	Analyzing millions of images and enormous text...	False	17:00:00	17:45:00
12	Sunday	Alexey Grigorev	A238	Large Scale Vandalism Detection in Knowledge B...	Knowledge bases are an important source of inf...	Intermediate	Wikidata is a Knowledge Base where anybody can...	False	17:00:00	17:45:00
14	Sunday	Jonathan Ronen	D105 Audimax	Social Networks and Protest Participation: Evi...	Pinning down the role of social ties in the de...	Intermediate	Data mining social networks for evidence of po...	False	17:45:00	18:30:00
24	Sunday	Daniele Rapati	A208	Engage the Hyper-Python - a rattle-through man...	A fast paced high-level overview of speed opti...	Intermediate	A fast paced high-level overview of speed opti...	False	17:45:00	18:30:00
26	Sunday	Vaibhav Singh	A238	Machine Learning to moderate ads in real world...	In an online classified's business, one may en...	Intermediate	In todays world of online business, it is diff...	False	17:45:00	18:30:00
30	Sunday	Oliver Eberle	A208	Where are we looking? Prediciting human gaze u...	Visual saliency models aim at describing human...	Intermediate	Which features in an image draw our focus to a...	False	17:00:00	17:45:00
39	Sunday	Lev Konstantinovskiy	A208	Find the text similiarity you need with the ne...	There are many ways to find similar words/docs...	Intermediate	What is the closest word to "king"? Is it "Can...	False	16:00:00	16:45:00
45	Sunday	Roelof Pieters	D105 Audimax	AI assisted creativity	A new wave of creative applications of AI has ...	Intermediate	A new wave of creative applications of AI has ...	False	16:00:00	16:45:00

Table of Contents

Query Data

Show Profile Report of Pandas DF

Overview

Variables

Sample

Further Processing

visualize some relations

Test Meadow