0. Load the test dataset
POST http://localhost:5001/api/v0/datasets/treclegal09_2k_subset
1.a Load dataset and initalize feature extraction
POST http://localhost:5001/api/v0/feature-extraction
=> received ['filenames', 'id']
=> dsid = 906d6a5e4b634fb882bd64d0d975e66f
1.b Run feature extraction
POST http://localhost:5001/api/v0/feature-extraction/906d6a5e4b634fb882bd64d0d975e66f
1.d. check the parameters of the extracted features
GET http://localhost:5001/api/v0/feature-extraction/906d6a5e4b634fb882bd64d0d975e66f
- n_features: 30001
- max_df: 0.75
- ngram_range: [1, 1]
- chunk_size: 2000
- sublinear_tf: True
- n_samples_processed: 2465
- analyzer: word
- stop_words: english
- binary: False
- norm: l2
- use_hashing: False
- n_samples: 2465
- use_idf: True
- n_jobs: -1
- data_dir: /shared/code/wking_code/freediscovery_shared/treclegal09_2k_subset/data
- min_df: 4.0
2.a. Document clustering (LSI + K-means)
POST http://localhost:5001/api/v0/clustering/k-mean/
=> model id = 7f19bf164a4a47408519e3bebcc3e964
2.b. Computing cluster labels
POST http://localhost:5001/api/v0/clustering/k-mean/7f19bf164a4a47408519e3bebcc3e964
.. computed in 2.1s
N_documents cluster_names
4 486 ['enron', 'energy', 'trading', 'services', 'co...
3 482 ['shackleton', 'test', 'recipients', 'group', ...
2 425 ['tenet', 'test', 'oct', 'nov', 'tue', 'wed']
5 311 ['ect', 'hou', 'nemec', 'shackleton', 'enron_d...
1 225 ['ect', 'recipients', 'group', 'haedicke', 'ad...
9 178 ['teneo', 'recipients', 'administrative', 'ric...
0 135 ['shall', 'party', 'agreement', 'transaction',...
7 102 ['sanders', 'nov', 'ect', 'test', 'meeting', '...
8 64 ['migration', 'outlook', 'team', 'mtg', 'oct',...
6 57 ['rewrite', 'server', 'address', 'smtp', 'mail...
2.a. Document clustering (LSI + Ward HC)
POST http://localhost:5001/api/v0/clustering/ward_hc/
=> model id = 1cbfeea563c7431d8c17072f8e65b84a
2.b. Computing cluster labels
POST http://localhost:5001/api/v0/clustering/ward_hc/1cbfeea563c7431d8c17072f8e65b84a
.. computed in 3.4s
N_documents cluster_names
5 443 ['tenet', 'test', 'oct', 'nov', 'tue', 'mon']
1 423 ['recipients', 'administrative', 'group', 'tes...
2 398 ['enron', 'energy', 'power', 'trade', 'market'...
0 393 ['ect', 'hou', 'tana', 'group', 'recipients', ...
6 342 ['shackleton', 'ect', 'test', 'group', 'recipi...
4 166 ['shall', 'party', 'agreement', 'transaction',...
8 95 ['sanders', 'nov', 'ect', 'test', 'lunch', 'me...
3 85 ['enron_development', 'ect', 'shackleton', 'ho...
9 64 ['migration', 'outlook', 'team', 'mtg', 'oct',...
7 56 ['rewrite', 'server', 'address', 'smtp', 'mail...