"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doign it ..." Dan Ariely, Duke University
How big is Big Data?
Big enough the traidtional techniques can't handle it.
PostgreSQL can handle tables with size up to 32TB. Does it mean Big Data has to be bigger than this?
How big is Big Data?
Big enough that one computer can't store or process.
Largest single-memory computer built by HP in 2017 has 160 TB of RAM.Does it mean Big Data has to be bigger than this?
What is big data?
Big data problems are problems whose not only the processing power, but the size of the data is also the limiting factor in being able to find a timely solution.
Big Data
Input data carrying characteristics of Big Data (the 4V)
Computational process can be simple and straightforward, with minimal intermediate data being generated
Data Intensive Computing
Input data may or may not be big data
Computational process produces massive and complex intermediate data that needs to be analyzed during the process
How did Big Data come to be in science?
Thousands of years ago, science was empirical describing natural phenomena.
Last few hundred years, science developed theoretical branch using models, equations, and generalizations.
Last few decades, science has created a new computational branch that specializes in simulating complex phenomena based on theoretical models and equations.
Today, large-scale simulations and advanced technologies enabled the fourth pillar of scientific discovery called data-driven science, where scientist synthesize theory, experiment, and computation with statistics.
Scientific data are doubling every year, reaching petabytes of data.
CERN reported 22PB in 2012, 125PB in 2015, and is currently at 200PB.
Scientific data are being generated from different institutions.
Traditional HPC approach (separation of computation and storage) will slow down due to the I/O bottle neck.
How did Big Data come to be in industry?
Better, faster way of generating and collecting data online (Google, Facebook, Amazon, eBay ...)
The emergence of embedded sensors across all industry (smart devices)
GE: In 2020, up to 50 billions devices (many of them industrial machines) will be connected to the internet.
Amazon: Use big data analytic to analyze sale and provide recommendation/suggesiton to buyers ("Customers who bought this ...")
Smart cars:
Up to more than 100 sensors
Can generate up to 25Gb of data per hour (McKinsey report)
Big Data Applications
Consumer Services: Web search, recommendation engines(Amazon, Netflix), Social networks, video analytics (YouTube), Internet of Things (NEST, wearables, fitness tracker, connected vehicles …)
Industrial Manufacturing: Supply Chain and Logistics, Assembly Quality, Smart Machines