In [ ]:
# %load 'custom.css'

@font-face {
    font-family: "Computer Modern";
    src: url('');
#notebook_panel { /* main background */
    background: #888;
    color: #f6f6f6;
#notebook li { /* More space between bullet points */
Feb 17-20, 2015
San Jose, CA

Big Data Analytics Trends

Kyle Polich

DataScience, Inc

The Data Skeptic Podcast


My slides are available online at:

Thanks to...

Where I won a free conference badge
Who provide the past and organized the conference
IBM and Alex Liu who host this meetup

My High Level Take-aways

  • Hadoop is becoming to data what C/C++ are to programming
  • Spark is growing in support
  • Streaming and in-memory are strong themes
  • R continues to grow, notably with Spark support
  • Continued growth in deep learning applications and tools
  • Nothing exciting about text analysis (NLP winter?)
  • Data Cleansing becoming a more noteworthy topic
    • Trifacta
    • Paxata
    • Open Refine
    • And more

Thursday Keynotes

Amr Awadallah (Cloudera)

  • Hadoop enterprise data hub
  • Centralized place, built on hadoop, to unify all data infrastructure
  • Enables true 360 view and ability to move applications to the data.
  • Create a "data lake"

Lisa Hammitt (Salesforce)

Some interesting anecdotes and case studies

  • SCSF using fitbit to study alzheimer's disease
  • Puffer fish takes 360 image with your phone and crowdsources wardrobe feedback

"Systems of engagement lead to systems of intelligence"

Eric Frenkiel (MemSQL)

  • MemSQL Spark connector announced for more real time effort
  • Pinterest identifying 10k events per second to find geographic trends with real time search


DJ Patil

  • Announced as first Chief Data Officer for US
  • "Data science is a team sport"
  • Medicine and medical records should be a focus
  • Data scientists should be focused on responsibility and making a difference
  • Commitment to open data

The Human-Data Interface: How to Design for Irrational Data Consumers

  • Cathy Tanimura (Okta)
  • Small schools fallacy
  • Random charts are convincing


The Two Cultures of People Science

  • Michelangelo D'Agostino (Civis Analytics)
  • Civis pioneered data science on political campaigns, starting with Obama 2012
  • Talked about the overlap an disconnect between data science and social science

Pro Bono Data Science in Action

  • Crisis Text Line started data driven
  • Noelle Sio (Pivotal) participated as part of Pivotal for Good via DataKind
  • Bob Filbin is Chief Data Scientist at Crisis Text Line
  • Basic analytics gave them an understanding of repeat users
  • Text analytics helped them detect intervention early


Project Jupyter

  • Jupyter is the next evolution of iPython Notebook
  • Providing language invariance similar to JVM
  • Notebooks currently allow one language; this is unlikely to change
  • New versions will focus more on collaboration and hosted notebooks.
  • They are working on authentication and easy docker integration
  • Multi-user server is currently in beta
  • Support for notebooks on Google Drive

  • Perhaps a demo at the end of my talk



  • Eddie Garcia (Cloudera) pleading for secure by default
  • Eden Medina - Project Cybersyn
  • Roman Shaposhnik (Pivotal) - Open data platform alliance to create unified hadoop distributions

Matei Zaharia (Spark creator)

  • Growth of Spark is insane
  • 100 TB sort record set in 2014, 1/3 time, 1/10 machines compared to Hadoop record
  • In 2015:
    • better high-level interfaces for data science, abstract cluster as if single machine; more platform interfaces - plugin data sources and algos
    • Data frames being added to Spark in 1.3!
    • Has an optimizer comparable to sql optimizer
    • Spark 1.4 introduces interface to R. Joining Scala, Java, and Python

Spark is the most active top level Apache Project

Joseph Sirosh (Microsoft)

  • Connected cows - data into cloud for analysis
  • Able to predict estrus
  • Time of insemination effects success rate and gender

Jeffrey Heer (Trifecta)

  • Interactive data lab at University of Washington
  • Original team creating D3
  • People celebrate bespoke graphics, but most graphics use defaults
  • Studies of effectiveness of graphics
    • Accuracy of Visual Decoding
    • Comparing Quantities - best is position
    • Color is leas accurate for quantitative data
    • Augmenting ambiguity with parallel presentation
  • Trifecta working on algos to propose clearest presentation of data
  • Better data viz on penicilin data might have earlier identified an error made in the scientific community

Source: Animated Transitions in Statistical Data Graphics by Jeffrey Heer, George G. Robertson

Designing Delightful Data Products

  • Alonzo Canda (Interana)
  • Observations - watching, asking, doing, and reflecting
  • Start with a need
  • Defer judgement
  • Gather different voices
  • Conduct experiments to validate assumptions
  • Borrowing good will - how splunk leveraged google

Data Science vs. The Bad Guys

  • David Freeman - Fraud detection at linkedin
  • Their challenges include fake accounts, stolen accounts, and scraping
  • Precision/recall tradeoffs are a challenge
  • Online/offline trade off
  • Models are pretty simple bayesian approaches, key insight is activity abstraction
  • They often apply heuristics to get 90%, then ML on top

Thank you!
We are hiring