http://spark.apache.org/docs/latest/api/python/search.html?q=Hadoop&check_keywords=yes&area=default
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
Download
http://hadoop.apache.org/releases.html
Download 2.7.x binary tarball and folow the installation instructions below. (Written on Apr 5 2017)
Steps to installation
1. Download the tar file and extract it in a directory you wish
tar -xzvf hadoop-2.7.3.tar.gz
2. Now set the following environment variables for hadoop
export HADOOP_INSTALL=/home/hadoop-x.y.z
export PATH=$PATH:$HADOOP_INSTALL/bin
3. Now in a terminal check if the path is set using the following command
hadoop version
If the above command is not working then check your JAVA_HOME settings.
References :- http://archive.oreilly.com/pub/a/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html
Explore the Hadoop folder:
1. All scripts to run can be found here - Similar to Spark
/hadoop-2.7.3/sbin
2. Configuration settigns can be found here
/hadoop-2.7.3/etc/hadoop
3. Now change two configuration files,
etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
4. Now Format the file system (First time setup only)
bin/hdfs namenode -format
5. Start your HDFS,
sbin/start-dfs.sh
6. By default the NameNode will run in the following address
http://localhost:50070/
Hint:
While starting the server, it will ask for the local host password. If you are using the UKKO, use your ssh key password and not the CS password.
References : https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
Getting Started https://wiki.apache.org/hadoop/GettingStartedWithHadoop
Steps to creating HDFS:-
1. Download the and extract the carat csv file to a location
2. Now create a new directory for in hdfs to move the csv file there
hdfs dfs -mkdir /input/
3. Copy the carat CSV file to the new location
hdfs dfs -put /cs/home/user_name/carat/carat-context-factors-percom.csv /input/carat-context-factors-percom.csv
4. Now start your spark shell, and use the following command to load the csv file.
sc.textFile("hdfs://localhost:9000/input/carat-context-factors-percom.csv")
5. After using spark shell, shutdown your HDFS, using the following command
sbin/stop-dfs.sh
6. Check all services are down using the following command,
ps aux | grep namenode
http://stackoverflow.com/questions/28213116/hadoop-copy-a-local-file-system-folder-to-hdfs
http://www.ccs.neu.edu/home/cbw/spark.html
http://stackoverflow.com/questions/27478096/cannot-read-a-file-from-hdfs-using-spark
Not supported in Windows yet
Source Code :
https://github.com/Alluxio/alluxio
Website:
http://www.alluxio.org/
Official Documentation:
http://www.alluxio.org/docs/1.4/en/Getting-Started.html
Running Spark on Alluxio
http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html
http://www.alluxio.org/docs/1.4/en/Getting-Started.html
http://www.alluxio.org/docs/1.4/en/Running-Alluxio-Locally.html
http://www.alluxio.org/docs/1.4/en/Running-Spark-on-Alluxio.html
http://www.alluxio.org/docs/1.0/en/Running-Spark-on-Alluxio.html
Remember to set SPARK_CLASSPATH, ALLUXIO_HOME values.
The set the core-site.xml file,
cp core-site.xml.template core-site.xml
For Example,
Moving a file into alluxio file system(this requires your localhost to be running),
./alluxio fs copyFromLocal /Users/mohanprasanth/Documents/alluxio-1.4.0/LICENSE /LICENSE
Test it suing a Spark example,
In [ ]: