What is Conversocial

Aggregate social data for Sainsbury's, etc large companies
Web platform, make it easy for large teams to collaborate.
Know where to focus efforts, both short and long term.
Diverse customer base.
- One solution for everyone doesn't (cannot) exist.
- Learn what data each particular person (and business) find important, autonomously.
- What do they tend to read and respond to? Return or raise importance of those types of elements.
- Still need heuristics to establish priority of items, most popular doesn't always work either.

Process

Iterative.
- Constantly seek feedback from customers, monitor performance.
- This is useful across customers; particular concerns, when addresed, improve experience for others.
- Plan, Do, Check, Act, Plan, ...
Customers have metrics
- Given a customer tweet, how long does it take a company respond to it.
- Need to minimise this to prevent crises.

How

Feature engineering
- Term Frequency, Inverse Document Frequency (TF-IDF), bag of words
- Strip out stopwords
- ngrams
- length of text
- contains a URL
- contains question mark
- ...
Converto binary matrix (0/1)
Singular Value Decomposition (SVD)
- Reduces size of matrix you're working on
- !!AI feature reduction, kind of like Principle Component Analysis (PCA).
Feature selection same for all customers
But different machine learning models for each customers.
Use a read-only slave copy of data, away from production, safter.
- Experiment with different models

Training schedule

Daily training with queue
Low priority; production already has good models, just trying to find if better ones can be found.

Models in production

Dedicated servers to classifying live data
Chef, AWS.
In-memory as much as possible.

Validation

Compare to crowd-sourced data to confirm models are useful.
True positive rate vs false positive rate.

Questions

What models and libraries do you use?
- Bayesian, SVMs, linear regression.
- scikit-learn and nltk
- Not very mysterious, obvious methods.
Stack, how to scale nltk to production?
- Yes, there are problems with scaling, then switch to something else.
- redis didn't scale, switching to rabbitmq.
- !!AI shame, no specific feedback on scaling nltk.
What customer feedback specifically?
- Heuristic - customer who posts, then posts again, defaults to high priority.
- Some customers like this, but "super fans" tend to multiple post just because, not due to high priority.
- Other customers need to filter out super fans.
Is Python a bottleneck for high throughput?
- Yes, systems did collapse at peak throughput sometimes.
- Didn't have DevOps or monitoring to warn about these events, but now do. Part of natural lifecycle of a company.
Do you do picture analysis?
- No, not yet.
- Lots of startups got acquired that did picture analysis, but went quiet.
Have you considered feature reduction using word2vec https://code.google.com/p/word2vec/:
- No. Use SVD for feature reduction for now.
- word2vec used for feature engineering raw text into numeric arrays.
Time series analysis? Seasonality, old documents?
- No.
Disambiguation of language, slang?
- Use heuristics, manual dictionary, to convert words to standard vocabulary. This is a vital part of feature engineering.
- Use stopword lists to filter out uninformative words.
- Rely on SVD to do the rest.
Do any categorisation or sentiment analysis?
- No, don't do sentiment analysis. Customers think manual tagging is more accurate and don't want automatic tagging.
- No, don't do automatic tagging, but are looking into it. Need to confirm exactly what customers want; variable tags, fixed set of customer-defined tags.
- Different customers want different things, still at product stage.