In [1]:
%autosave 10


Autosaving every 10 seconds

Inputs

  • Want one, canonical ID for every unique company.
  • Fields
    • Account name
    • Contact name
    • Premises address
    • Billing address
  • Companies are duplicated, with different premises and/or billing addresses.
  • What about companies with same account name, in different places?
  • What about edge cases - almost same company names, different places - same or different company?

Tools

  • Just 1 week work, basic pre-processing to help interns.
  • Want to do clustering (are these different rows actually the same company?)
  1. Input rows -> nltk.PunktTokeniser. -> Cleaned, parsed, tokensized strings.
    • Sometimes need to preserve non-alphanumeric characters (e.g. "203-205 Upper Street").
  2. -> sklearn.TfidVectorizer -> TF-IDF 2D matrix
  3. -> sklearn.TruncatedSVD -> SVD N-D matrix

Now, we can:

  1. Suggest similar accounts t- be grouped into manageable chunks for people to look at.
    • sklearn.MiniBatchKmeans (ridiculously fast, coupled with grid search for hyperparameters)
    • sklearn.AffinityPropagation
  2. Human validation + verification
  3. Incorporate and propagate valid groupings
    • sklearn.RadiusNeighborsClassifier
    • Supervised, passing through human knowledge

 Results

  • 93% accuracy, compared to human experts who validated the results
  • nltk and scikit-learn allowed rapid development and testing
  • Human input is important - no ML problem is an island.