Apache Hadoop is an open source programming framework designed for sotring as well as analyzing and processing very large datasets.
cluster of commodity hardware.
Created by Doug Cutting and Mike Cafarella 2005
Moving computaiton to data
Commodity PCs: the difference between HPC and Hadoop cluster.
Compute nodes and storage nodes are the same
Data Analysis Limitation: Having this framework allows us to use massive data sets. But the the type of analysis is a bit different. In machine learning, you may have a relatively small dataset and perform very complex analysis on them. Using Hadoop, allows us to use tremendous amount of data, but there is a limitation imposed on complexity of analysis.
HBASE: uses HDFS and MapReduce to organize data by columns
HIVE: uses a meta store to apply schema information, to provide sql-like interface
PIG: a scirpting language, providing more flexibility than SQL
SPARK: a more general engine for programming with interface to HDFS, replacing MapReduce
SPLUNK: focused on machine generated data
A higher level data processing language
Managing strutured and unstructured data
Provess queries in a SQL-like language
First developed at Facebook, now is an Apache software
Features:
* OLAP (online analytical processing)
* SQL-like language for queries (HiveQL or HQL)
* Provides data warehouse infrastructure for hadoop
Different data units:
HIVE data structures:
In [ ]: