Big Data Analytic Tools

What is Hadoop?

Apache Hadoop is an open source programming framework designed for sotring as well as analyzing and processing very large datasets.

cluster of commodity hardware.

Created by Doug Cutting and Mike Cafarella 2005

Moving computaiton to data

Commodity PCs: the difference between HPC and Hadoop cluster.

Compute nodes and storage nodes are the same

  • Cheap architecture
  • Scailibility
  • Reliability (handling node failures)

Data Analysis Limitation: Having this framework allows us to use massive data sets. But the the type of analysis is a bit different. In machine learning, you may have a relatively small dataset and perform very complex analysis on them. Using Hadoop, allows us to use tremendous amount of data, but there is a limitation imposed on complexity of analysis.

HBASE: uses HDFS and MapReduce to organize data by columns

HIVE: uses a meta store to apply schema information, to provide sql-like interface

PIG: a scirpting language, providing more flexibility than SQL

SPARK: a more general engine for programming with interface to HDFS, replacing MapReduce

SPLUNK: focused on machine generated data

HBASE

Read and write to HDFS, read and write directly from user interface

Key-value store, length of values can vary

HIVE

A higher level data processing language

Managing strutured and unstructured data

Provess queries in a SQL-like language

First developed at Facebook, now is an Apache software

Features:

* OLAP (online analytical processing)
* SQL-like language for queries (HiveQL or HQL)
* Provides data warehouse infrastructure for hadoop


Different data units:

  1. Tables: maps to HDFS directories
  2. Partitions: maps to sub-directories under tables
  3. Buckets: maps to files under each partition

HIVE data structures:

  • Associative arrays (map < key-type, value-type >)
  • Lists (list < element type >)
  • Structs

In [ ]: