In [ ]:

    
# %load 'custom.css'
<style>

@font-face {
    font-family: "Computer Modern";
    src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');
}
#notebook_panel { /* main background */
    background: #888;
    color: #f6f6f6;
}
#notebook li { /* More space between bullet points */
margin-top:0.8em;
}
</style>

Feb 17-20, 2015
San Jose, CA

Big Data Analytics Trends

Kyle Polich

DataScience, Inc

The Data Skeptic Podcast

@dataskeptic

kyle@datascience.com

My slides are available online at:
http://dataskeptic.com/talks/strata2015/

Thanks to...

	Where I won a free conference badge
	Who provide the past and organized the conference
	IBM and Alex Liu who host this meetup

My High Level Take-aways

Hadoop is becoming to data what C/C++ are to programming
Spark is growing in support
Streaming and in-memory are strong themes
R continues to grow, notably with Spark support
Continued growth in deep learning applications and tools
Nothing exciting about text analysis (NLP winter?)
Data Cleansing becoming a more noteworthy topic
- Trifacta
- Paxata
- Open Refine
- And more

Source: http://redmonk.com/dberkholz/2015/02/17/strata-2015-reaching-for-the-business-user/

Thursday Keynotes

Amr Awadallah (Cloudera)

Hadoop enterprise data hub
Centralized place, built on hadoop, to unify all data infrastructure
Enables true 360 view and ability to move applications to the data.
Create a "data lake"

Lisa Hammitt (Salesforce)

Some interesting anecdotes and case studies

SCSF using fitbit to study alzheimer's disease
Puffer fish takes 360 image with your phone and crowdsources wardrobe feedback

"Systems of engagement lead to systems of intelligence"

Eric Frenkiel (MemSQL)

MemSQL Spark connector announced for more real time effort
Pinterest identifying 10k events per second to find geographic trends with real time search

Source: http://blog.memsql.com/pinterest-apache-spark-use-case/

DJ Patil

Announced as first Chief Data Officer for US
"Data science is a team sport"
Medicine and medical records should be a focus
Data scientists should be focused on responsibility and making a difference
Commitment to open data

The Human-Data Interface: How to Design for Irrational Data Consumers

Cathy Tanimura (Okta)
Small schools fallacy
Random charts are convincing

Source: http://mic.com/articles/102546/want-to-make-change-someone-s-mind-just-show-them-a-random-chart

The Two Cultures of People Science

Michelangelo D'Agostino (Civis Analytics)
Civis pioneered data science on political campaigns, starting with Obama 2012
Talked about the overlap an disconnect between data science and social science

Pro Bono Data Science in Action

Crisis Text Line started data driven
Noelle Sio (Pivotal) participated as part of Pivotal for Good via DataKind
Bob Filbin is Chief Data Scientist at Crisis Text Line
Basic analytics gave them an understanding of repeat users
Text analytics helped them detect intervention early

Source: http://blog.pivotal.io/data-science-pivotal/features/pivotal-for-good-with-crisis-text-line-a-first-look

Project Jupyter

Jupyter is the next evolution of iPython Notebook
Providing language invariance similar to JVM
Notebooks currently allow one language; this is unlikely to change
New versions will focus more on collaboration and hosted notebooks.
They are working on authentication and easy docker integration
Multi-user server is currently in beta
Support for notebooks on Google Drive
bit.ly/inature
Perhaps a demo at the end of my talk

FRIDAY

Keynotes

Eddie Garcia (Cloudera) pleading for secure by default
Eden Medina - Project Cybersyn
Roman Shaposhnik (Pivotal) - Open data platform alliance to create unified hadoop distributions

Matei Zaharia (Spark creator)

Growth of Spark is insane
100 TB sort record set in 2014, 1/3 time, 1/10 machines compared to Hadoop record
In 2015:
- better high-level interfaces for data science, abstract cluster as if single machine; more platform interfaces - plugin data sources and algos
- Data frames being added to Spark in 1.3!
- Has an optimizer comparable to sql optimizer
- Spark 1.4 introduces interface to R. Joining Scala, Java, and Python
http://databricks.com/moocs

Spark is the most active top level Apache Project

from joseph-rickert http://www.r-bloggers.com/a-first-look-at-spark/

Joseph Sirosh (Microsoft)

Connected cows - data into cloud for analysis
Able to predict estrus
Time of insemination effects success rate and gender

Re that connected cow: scientists can determine gender of cow on timing of artificial insemination #StrataHadoop #IoT pic.twitter.com/CuI3mcc1PO
— Kathy Yu (@kathykmy) February 20, 2015

Jeffrey Heer (Trifecta)

Interactive data lab at University of Washington
Original team creating D3
People celebrate bespoke graphics, but most graphics use defaults
Studies of effectiveness of graphics
- Accuracy of Visual Decoding
- Comparing Quantities - best is position
- Color is leas accurate for quantitative data
- Augmenting ambiguity with parallel presentation
Trifecta working on algos to propose clearest presentation of data
Better data viz on penicilin data might have earlier identified an error made in the scientific community

Source: Animated Transitions in Statistical Data Graphics by Jeffrey Heer, George G. Robertson

Designing Delightful Data Products

Alonzo Canda (Interana)
Observations - watching, asking, doing, and reflecting
Start with a need
Defer judgement
Gather different voices
Conduct experiments to validate assumptions
Borrowing good will - how splunk leveraged google

Data Science vs. The Bad Guys

David Freeman - Fraud detection at linkedin
Their challenges include fake accounts, stolen accounts, and scraping
Precision/recall tradeoffs are a challenge
Online/offline trade off
Models are pretty simple bayesian approaches, key insight is activity abstraction
They often apply heuristics to get 90%, then ML on top

Source: David Freeman's slides

Thank you!

kyle@datascience.com We are hiring http://bit.ly/1LvAAVc		@DataSkeptic

Big Data Analytics Trends

Kyle Polich

Thanks to...

My High Level Take-aways

Thursday Keynotes

Amr Awadallah (Cloudera)

Lisa Hammitt (Salesforce)

Eric Frenkiel (MemSQL)

DJ Patil

The Human-Data Interface: How to Design for Irrational Data Consumers

The Two Cultures of People Science

Pro Bono Data Science in Action

Project Jupyter

FRIDAY

Keynotes

Matei Zaharia (Spark creator)

Spark is the most active top level Apache Project

Joseph Sirosh (Microsoft)

Jeffrey Heer (Trifecta)

Designing Delightful Data Products

Data Science vs. The Bad Guys

Thank you!