Mahout

K-means clustering with mahout

  • Options:
    • -i : input directory (can use tf-vectors or tfidf-vectors)
    • -o: output directory
    • -k: number of clusters
    • -x: maximum number of iterations to execute k-means
    • -c: initial centroids (if k is specified, a random set of points will be selected and written to this directory)
    • -dm: distance measure
    • -cl: assigns the input docs to clusters at the end of the process and puts the results in outputdir/clusteredPoints directory

Document clustering woth Mahout

Getting ready

  • Upload the data onto HDFS
hadoop dfs -copyFromLocal email /user/hduser/email/input

Step 1: Create a sequence file

mahout seqdirectory -i /user/hduser/email/input -o email/seqdir -ow -c UTF-8
  • Display the sequence
mahout seqdumper -s /user/hduser/email/seqdir/chucnk-0

Step 2: Generate sparse feature vectors

mahout seq2sparse -i email/seqdir -o email/sparse -nv -ow
  • List the sparse directory
hadoop dfs -ls /user/hduser/email/sparse
  • Display the dictionary of feature vectors
mahout seqdumper -s email/sparse/dictionary.file-0
  • View the TFIDF vector
mahout seqdumper -s email/sparse/tfidf-vectors/part-r-00000

Step 3: apply k-means clustering

mahout kmeans -i email/sparse/tfidf-vectors -c email/centroids -o email/output -dm org.apache.mahout.common.distance.CosineDistanceMeasure -k 2 -x 10 -ow -cl

In [ ]: