Document clustering woth Mahout
Getting ready
- Upload the data onto HDFS
hadoop dfs -copyFromLocal email /user/hduser/email/input
Step 1: Create a sequence file
mahout seqdirectory -i /user/hduser/email/input -o email/seqdir -ow -c UTF-8
mahout seqdumper -s /user/hduser/email/seqdir/chucnk-0
Step 2: Generate sparse feature vectors
mahout seq2sparse -i email/seqdir -o email/sparse -nv -ow
- List the sparse directory
hadoop dfs -ls /user/hduser/email/sparse
- Display the dictionary of feature vectors
mahout seqdumper -s email/sparse/dictionary.file-0
mahout seqdumper -s email/sparse/tfidf-vectors/part-r-00000
Step 3: apply k-means clustering
mahout kmeans -i email/sparse/tfidf-vectors -c email/centroids -o email/output -dm org.apache.mahout.common.distance.CosineDistanceMeasure -k 2 -x 10 -ow -cl