Contents:-

  1. HDFS setup
  2. Alluxio(Tachyoon) Setup
  3. Spark Streaming Basics
  4. Exercise 2 Solutions

All Hadoop functionalities in pyspark

http://spark.apache.org/docs/latest/api/python/search.html?q=Hadoop&check_keywords=yes&area=default

All functionalities in Scala Spark

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

1. HDFS

Download

http://hadoop.apache.org/releases.html

Download 2.7.x binary tarball and folow the installation instructions below. (Written on Apr 5 2017)

Steps to installation

1. Download the tar file and extract it in a directory you wish
    tar -xzvf hadoop-2.7.3.tar.gz 
2. Now set the following environment variables for hadoop
    export HADOOP_INSTALL=/home/hadoop-x.y.z
    export PATH=$PATH:$HADOOP_INSTALL/bin
3. Now in a terminal check if the path is set using the following command
    hadoop version
    If the above command is not working then check your JAVA_HOME settings. 

References :- http://archive.oreilly.com/pub/a/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html

Explore the Hadoop folder:

1. All scripts to run can be found here - Similar to Spark
    /hadoop-2.7.3/sbin
2. Configuration settigns can be found here
    /hadoop-2.7.3/etc/hadoop
3. Now change two configuration files, 
    etc/hadoop/core-site.xml
        <configuration>
            <property>
                <name>fs.defaultFS</name>
                <value>hdfs://localhost:9000</value>
            </property>
        </configuration>
    etc/hadoop/hdfs-site.xml
        <configuration>
            <property>
                <name>dfs.replication</name>
                <value>1</value>
            </property>
        </configuration>
4. Now Format the file system (First time setup only)
    bin/hdfs namenode -format
5. Start your HDFS,
    sbin/start-dfs.sh
6. By default the NameNode will run in the following address
    http://localhost:50070/
Hint: 
    While starting the server, it will ask for the local host password. If you are using the UKKO, use your ssh key password and not the CS password. 

References : https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

Getting Started https://wiki.apache.org/hadoop/GettingStartedWithHadoop

Steps to creating HDFS:-

1. Download the and extract the carat csv file to a location
2. Now create a new directory for in hdfs to move the csv file there
    hdfs dfs -mkdir /input/
3. Copy the carat CSV file to the new location
    hdfs dfs -put /cs/home/user_name/carat/carat-context-factors-percom.csv /input/carat-context-factors-percom.csv
4. Now start your spark shell, and use the following command to load the csv file.
    sc.textFile("hdfs://localhost:9000/input/carat-context-factors-percom.csv")
5. After using spark shell, shutdown your HDFS, using the following command
    sbin/stop-dfs.sh
6. Check all services are down using the following command,
    ps aux | grep namenode
http://stackoverflow.com/questions/28213116/hadoop-copy-a-local-file-system-folder-to-hdfs


Examples

http://www.ccs.neu.edu/home/cbw/spark.html

http://stackoverflow.com/questions/27478096/cannot-read-a-file-from-hdfs-using-spark

Alluxio

Not supported in Windows yet

Source Code :

https://github.com/Alluxio/alluxio

Website:

http://www.alluxio.org/

Official Documentation:

http://www.alluxio.org/docs/1.4/en/Getting-Started.html


Running Spark on Alluxio

http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html

Start from here - Steps to install

http://www.alluxio.org/docs/1.4/en/Getting-Started.html
http://www.alluxio.org/docs/1.4/en/Running-Alluxio-Locally.html

Connecting to Spark

http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html
http://www.alluxio.org/docs/1.0/en/Running-Spark-on-Alluxio.html

Remember to set SPARK_CLASSPATH, ALLUXIO_HOME values.
The set the core-site.xml file,
    cp core-site.xml.template core-site.xml

For Example,
Moving a file into alluxio file system(this requires your localhost to be running),

./alluxio fs copyFromLocal /Users/mohanprasanth/Documents/alluxio-1.4.0/LICENSE /LICENSE

Test it suing a Spark example,

In [ ]: