Mahout

K-means clustering with mahout

Options:
- -i : input directory (can use tf-vectors or tfidf-vectors)
- -o: output directory
- -k: number of clusters
- -x: maximum number of iterations to execute k-means
- -c: initial centroids (if k is specified, a random set of points will be selected and written to this directory)
- -dm: distance measure
- -cl: assigns the input docs to clusters at the end of the process and puts the results in outputdir/clusteredPoints directory

Document clustering woth Mahout

Getting ready

Upload the data onto HDFS

hadoop dfs -copyFromLocal email /user/hduser/email/input

Step 1: Create a sequence file

mahout seqdirectory -i /user/hduser/email/input -o email/seqdir -ow -c UTF-8

Display the sequence

mahout seqdumper -s /user/hduser/email/seqdir/chucnk-0

Step 2: Generate sparse feature vectors

mahout seq2sparse -i email/seqdir -o email/sparse -nv -ow

List the sparse directory

hadoop dfs -ls /user/hduser/email/sparse

Display the dictionary of feature vectors

mahout seqdumper -s email/sparse/dictionary.file-0

View the TFIDF vector

mahout seqdumper -s email/sparse/tfidf-vectors/part-r-00000

Step 3: apply k-means clustering

mahout kmeans -i email/sparse/tfidf-vectors -c email/centroids -o email/output -dm org.apache.mahout.common.distance.CosineDistanceMeasure -k 2 -x 10 -ow -cl



In [ ]: