Memory-centric Data Intensive Computing

Linh B. Ngo

Why do we want in-memory computing?

What can MapReduce Offer?

  • Pros:

    • Inherently parallel
    • Data locality
    • Resiliency and Error Recovery
  • Cons:

    • High overhead costs for setting up (JVMs)
    • Difficulty in setting up iterative jobs due to inability to transition between MR stages (having to write back to HDFS first)

Do we not like MapReduce anymore?

  • “If all you have is a hammer, everything becomes a nail”
  • MapReduce is good for massive scale batch-oriented, single iteration jobs (initial filtering, data cleaning, …)
  • Additional tools must be developed

What have been developed?

  • Apache Tez
    • Enable automated “chaining” of MR jobs via dataflow graphs
  • Apache Spark
    • Fast processing via read-only in-memory data cache
  • Tachyon/Alluxio
    • RAMdisk-based storage on top of persistent storage (HDFS) to support fast in-memory data access

Apache Tez: Overview

  • Extensible framework for building high performance batch and interactive data processing applications
  • User-defined flows of data and tasks to support complex query logic that typically require multiple MR jobs
  • Current status: Apache top-project, 0.9.0 (https://tez.apache.org/releases/index.html)

Apache Tez: Architecture

  • Data processing is represented as a dataflow graph
  • Vertices represent the processing of data
  • Edges represent movement of data between the processing
  • For MR jobs, Tez provides the ability to run a chain of Reduce stages as compared to a single Reduce stage

Apache Tez: Building Blocks

Vertex: Combination of a set of Inputs, a Processor, and a set of Outputs

  • Input: A pipe through which a processor can accept input data from a data source
  • Processor: The entity responsible for consuming inputs and producing outputs
  • Output: A pipe through which a processor can generate output data

Apache Tez: Building Blocks

Edge: A logical entity that represents a number of physical connections between the tasks of two connected vertices

  • Logical: representing the logical edge between 2 vertices
  • Physical: representing the connection between a task of the Source vertex and a task of a Destination vertex
  • A Logical Edge can contain multiple Physical Connections

Apache Tez: WordCount DAG

Apache Spark: Overview

Apache Spark: Target Systems and Applications

  • Systems:

    • COTS Clusters for large scale data analytics
    • Highly scalable
  • Applications:

    • Reuse intermediate results across multiple computations
    • Common in many iterative machine learning and graph algorithms

Apache Spark: Building Blocks

  • Resilient Distributed Datasets
  • An abstraction to enable efficient data reuse in a broad range of application
  • Everything in Spark is built from RDDs

Apache Spark: Building Blocks

  • Read-only, partitioned collection of records
  • Created (aka written) through deterministic operations on data
    • In stable storage
    • From other RDDs
  • Coarse-grained operations: map, join, filter …
  • Do not need to be materialized at all time
  • RDDs are recoverable via data lineage

Apache Spark: A hammer for ...

  • Applications that make asynchronous fine-grained updates to shared state:
  • Storage system for web applications (Traditional applications for standard databases)
    • Incremental web crawler

Apache Spark: Suitable for ...

  • Interactive Analytics/Pleasantly Parallel Applications
  • Examples:
    • Logistic Regression
    • PageRank
    • Machine Learning

Tachyon/Alluxio: Overview

  • Reliable data sharing at memory speed across cluster computing framework
  • Up to 0.8: Tachyon
  • From 1.0: Alluxio (currently 1.6)

Alluxio: Key architectural details

  • Two layers
  • Master-slave Architecture
  • Lineage Overhead Garbage Collection
  • Data Eviction
  • Master Fault-Tolerance
  • Check-pointing
  • Resource Allocation

Alluxio: Key architectural details

  • Two layers
    • Lineage Layer: In-memory Tachyon daemons
    • Persstent Layer: Replication-based storage systems (HDFS, S3, …)
  • Master-slave Architecture
    • Master: Lineage metadata and workflow manager
    • Workers: memory-mapped files in RAMdisk

Alluxio: Key architectural details

  • Lineage Overhead:
    • Metadata
    • Job binaries
    • Garbage collected
  • Data Eviction: based on access frequency and access temporal locality

Alluxio: Key architectural details

  • Master Fault-Tolerance
    • Standby secondary master
    • Zookeeper to elect new master
    • Persistent layer to rebuild from lineage
  • Check-pointing
    • Check-point for the entire Tachyon system, not just application
    • Check-point key elements in the lineage chain
    • Check-point hot files
    • Avoid check-point temporary files

Alluxio: key architectural details

  • Resource Allocation
    • How to balance resources between actual work and recomputation work
    • Priority compatibility: recomputation priority follow job priority
  • Resource sharing (no fixed allocation to recomputation)
    • Avoid cascading recomputation

Future Driection for the Hadoop Ecosystem: