In [1]:
%autosave 10


Autosaving every 10 seconds

What is Conversocial

  • Aggregate social data for Sainsbury's, etc large companies
  • Web platform, make it easy for large teams to collaborate.
  • Know where to focus efforts, both short and long term.
  • Diverse customer base.
    • One solution for everyone doesn't (cannot) exist.
    • Learn what data each particular person (and business) find important, autonomously.
    • What do they tend to read and respond to? Return or raise importance of those types of elements.
    • Still need heuristics to establish priority of items, most popular doesn't always work either.

Process

  • Iterative.
    • Constantly seek feedback from customers, monitor performance.
    • This is useful across customers; particular concerns, when addresed, improve experience for others.
    • Plan, Do, Check, Act, Plan, ...
  • Customers have metrics
    • Given a customer tweet, how long does it take a company respond to it.
    • Need to minimise this to prevent crises.

 How

  • Feature engineering
    • Term Frequency, Inverse Document Frequency (TF-IDF), bag of words
    • Strip out stopwords
    • ngrams
    • length of text
    • contains a URL
    • contains question mark
    • ...
  • Converto binary matrix (0/1)
  • Singular Value Decomposition (SVD)
    • Reduces size of matrix you're working on
    • !!AI feature reduction, kind of like Principle Component Analysis (PCA).
  • Feature selection same for all customers
  • But different machine learning models for each customers.
  • Use a read-only slave copy of data, away from production, safter.
    • Experiment with different models

Training schedule

  • Daily training with queue
  • Low priority; production already has good models, just trying to find if better ones can be found.

 Models in production

  • Dedicated servers to classifying live data
  • Chef, AWS.
  • In-memory as much as possible.

Validation

  • Compare to crowd-sourced data to confirm models are useful.
  • True positive rate vs false positive rate.

 Questions

  • What models and libraries do you use?

    • Bayesian, SVMs, linear regression.
    • scikit-learn and nltk
    • Not very mysterious, obvious methods.
  • Stack, how to scale nltk to production?

    • Yes, there are problems with scaling, then switch to something else.
    • redis didn't scale, switching to rabbitmq.
    • !!AI shame, no specific feedback on scaling nltk.
  • What customer feedback specifically?

    • Heuristic - customer who posts, then posts again, defaults to high priority.
    • Some customers like this, but "super fans" tend to multiple post just because, not due to high priority.
    • Other customers need to filter out super fans.
  • Is Python a bottleneck for high throughput?

    • Yes, systems did collapse at peak throughput sometimes.
    • Didn't have DevOps or monitoring to warn about these events, but now do. Part of natural lifecycle of a company.
  • Do you do picture analysis?

    • No, not yet.
    • Lots of startups got acquired that did picture analysis, but went quiet.
  • Have you considered feature reduction using word2vec https://code.google.com/p/word2vec/:

    • No. Use SVD for feature reduction for now.
    • word2vec used for feature engineering raw text into numeric arrays.
  • Time series analysis? Seasonality, old documents?

    • No.
  • Disambiguation of language, slang?

    • Use heuristics, manual dictionary, to convert words to standard vocabulary. This is a vital part of feature engineering.
    • Use stopword lists to filter out uninformative words.
    • Rely on SVD to do the rest.
  • Do any categorisation or sentiment analysis?

    • No, don't do sentiment analysis. Customers think manual tagging is more accurate and don't want automatic tagging.
    • No, don't do automatic tagging, but are looking into it. Need to confirm exactly what customers want; variable tags, fixed set of customer-defined tags.
    • Different customers want different things, still at product stage.

In [ ]: