Big Data Overview

Linh B. Ngo

"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doign it ..." Dan Ariely, Duke University

How big is Big Data?

  • Big enough the traidtional techniques can't handle it.

PostgreSQL can handle tables with size up to 32TB. Does it mean Big Data has to be bigger than this?

How big is Big Data?

  • Big enough that one computer can't store or process.

Largest single-memory computer built by HP in 2017 has 160 TB of RAM.Does it mean Big Data has to be bigger than this?

What is big data?

Big data problems are problems whose not only the processing power, but the size of the data is also the limiting factor in being able to find a timely solution.

Big Data

  • Input data carrying characteristics of Big Data (the 4V)
  • Computational process can be simple and straightforward, with minimal intermediate data being generated

Data Intensive Computing

  • Input data may or may not be big data
  • Computational process produces massive and complex intermediate data that needs to be analyzed during the process

How did Big Data come to be in science?

  • Thousands of years ago, science was empirical describing natural phenomena.
  • Last few hundred years, science developed theoretical branch using models, equations, and generalizations.
  • Last few decades, science has created a new computational branch that specializes in simulating complex phenomena based on theoretical models and equations.
  • Today, large-scale simulations and advanced technologies enabled the fourth pillar of scientific discovery called data-driven science, where scientist synthesize theory, experiment, and computation with statistics.
  • Scientific data are doubling every year, reaching petabytes of data.
    • CERN reported 22PB in 2012, 125PB in 2015, and is currently at 200PB.
  • Scientific data are being generated from different institutions.
  • Traditional HPC approach (separation of computation and storage) will slow down due to the I/O bottle neck.

How did Big Data come to be in industry?

  • Better, faster way of generating and collecting data online (Google, Facebook, Amazon, eBay ...)
  • The emergence of embedded sensors across all industry (smart devices)
  • GE: In 2020, up to 50 billions devices (many of them industrial machines) will be connected to the internet.
  • Amazon: Use big data analytic to analyze sale and provide recommendation/suggesiton to buyers ("Customers who bought this ...")
  • Smart cars:
    • Up to more than 100 sensors
    • Can generate up to 25Gb of data per hour (McKinsey report)

Big Data Applications

  • Consumer Services: Web search, recommendation engines(Amazon, Netflix), Social networks, video analytics (YouTube), Internet of Things (NEST, wearables, fitness tracker, connected vehicles …)
  • Industrial Manufacturing: Supply Chain and Logistics, Assembly Quality, Smart Machines
  • Government: Census, Archiving, Image Surveillance, Situation Assessment
  • Sciences: Genome sequencing, astrophysics, particle physics

Programming Paradigm for Big Data

  • Multi-faceted challenges:
    • Require not only parallel computation but also parallel data processing
  • New computational tools and strategies
  • New data intensive scalable architectures
  • Science is moving increasingly from hypothesis-driven to data-driven discoveries
  • Industry is at a stage where big data infrastructures are integrated and big data sets are beginning to be analyzed to produce business insights

Data Intensive Approach

  • Scale “out”, not “up”
  • It is easier and cheaper to add nodes to an existing cluster than to build a faster cluster.
  • Move computation to the data
  • Reduce data movement.
  • Sequential processing, avoid random access
  • Reduce seek movement on disks.
  • Seamless scalability