| notebook.community

notebook.community



In [1]:

    
%autosave 10









    














    



Autosaving every 10 seconds

Inputs

Want one, canonical ID for every unique company.
Fields
- Account name
- Contact name
- Premises address
- Billing address
Companies are duplicated, with different premises and/or billing addresses.
What about companies with same account name, in different places?
What about edge cases - almost same company names, different places - same or different company?

Tools

Just 1 week work, basic pre-processing to help interns.
Want to do clustering (are these different rows actually the same company?)

Input rows -> nltk.PunktTokeniser. -> Cleaned, parsed, tokensized strings.
- Sometimes need to preserve non-alphanumeric characters (e.g. "203-205 Upper Street").
-> sklearn.TfidVectorizer -> TF-IDF 2D matrix
-> sklearn.TruncatedSVD -> SVD N-D matrix

Now, we can:

Suggest similar accounts t- be grouped into manageable chunks for people to look at.
- sklearn.MiniBatchKmeans (ridiculously fast, coupled with grid search for hyperparameters)
- sklearn.AffinityPropagation
Human validation + verification
Incorporate and propagate valid groupings
- sklearn.RadiusNeighborsClassifier
- Supervised, passing through human knowledge

Results

93% accuracy, compared to human experts who validated the results
nltk and scikit-learn allowed rapid development and testing
Human input is important - no ML problem is an island.