Discuss your motivations and reasons for choosing this project, especially any background or research interests that may have influenced your decision.
Our interest is primarily motivated by our access to a unique and very interesting dataset: up-to-the-second TV viewership data (since the data is collected from streaming online TV, it’s far more granular than Nielsen and Comscore data). As part of his previous role at Philo, Ryan has experience with this dataset. Ryan wants to better understand how models can be built around cable TV in the same way that Netflix built models around on-demand video.
In addition, Jody’s interest in consumer flows across the television network space stems from her background analyzing consumer behavior online for a web analytics company. She wants to better understand how content drives consumer attention.
David and Vladimir’s interest in the general insights that can be gain from observing social signals like TV viewership patterns.
What are the scientific and inferential goals for this project? What would you like to learn and accomplish? List the benefits.
This project seeks to answer the following research question:
How is TV viewership patterns affected by commercials?
We seek to understand the effects of commercials on the flow of users across the television space. In particular, we want to identify what type of content is “sticky” and what content is not. Generally, this type of research is of interest to advertisers, because they can benefit by choosing the “stickier” shows on which to advertise. Networks can benefit from understanding which content is the stickiest both in choosing their programming and in selling advertisements. In addition, show stickiness is a useful feature that could be incorporated in a larger recommendation system.
These are features or calculations without which you would consider your project to be a failure.
A measure of ‘stickiness’ of a show (ie. measuring users’ attention and contiguous time watching a show) A report of stickiness by genre and by show A measure of similarity among shows A visualization of user flows across shows
Those features or calculations which you consider would be nice to have, but not critical.
A precise model for determining the start of commercials on shows based on flows across channels. (and ideally, a similar model for detecting the end of commercial, though that’s a much harder problem) A measure of user-specific shows stickiness (as opposed to global stickiness) Inclusion of additional demographic component in the visualization and accompanying model (eg. gender, locale, etc.) Development of a Markov model from edge probabilities to determine which show a user will most likely stay on.
From where and how are you collecting your data?
We already have our dataset, which Ryan acquired during his work as an analyst for Philo. We have permission from the company to use this data for our final project (note: Philo isn’t sure whether they will allow us to publicly disclose the source of the data; please keep this data confidential)
The data includes:
This data is captured every 5 seconds for a 3-month period in early 2012 (64,000 data points).
List the statistical and computational methods you plan to use.
We could use reinforcement learning to find the stickiness of each show based on user viewership. We would train the model on the first month of data and then reinforce it with later data. Although we have never used this method before, we believe it may be the right approach here.
We intend to use viewing times among commonly supported shows to determine which shows are most like one another, using either a Gibbs sampler model with latent factors or something else, but not KNN, which Ryan already tried (and which gave poor results.)
How will you verify your project's results? In other words, how do you know that your project does well?
One way to verify our project’s results is to segregate our data by time: one time-frame of data will serve as training data and another time-frame as test data. If we see that the flows of users in the test data move around our calculated commercial times, then we will know our project is correct.
In addition, as TV viewers ourselves, we have built good intuitions about expected outcomes. For example, we will use a coarse sanity check of comparing which shows have high stickiness factors to the shows we expect to have high stickiness factors based on our own experiences. We expect that drama will be more sticky than news, for example.
How will you visualize and communicate your results?
We plan on visualizing the data as time-series forced-directed layout graph in D3.js. Similarity between shows will be conveyed with the two-dimensional position on the graph. Users will be shown flowing across the edges of the graph.
The visualization will employ linked views: a timeline will enable users to scrub for particular time-period and fast-forward through historical data.
Make sure that you plan your work so that you can avoid a big rush right before the final project deadline, and delegate different modules and responsibilities among your team members. Write this in terms of weekly deadlines.
We form a great team because our skills are very complimentary : David has strong experience with D3 and visualization, Ryan is intimately familiar with the dataset and Jody has a strong analytics background.
We have already created a private Github repository to coordinate our work (https://github.com/DavidChouinard/flowexplorer; we are happy to provide access to TFs).
Week 1: Focus on exploratory data analysis. Ryan, Vladimir, Jody: Experimentation with reinforcement learning and different “sticky” measures. David: Begins visualization work of user flows across networks.
Week 2: Ryan: Markov model development on training data to determine which show a user will stay on. David: Works with Ryan to being incorporating predictions into visualization. Vladimir and Jody: Implementation of decided algorithm for sticky measure - building reports.
Week 3: Verification of results Ryan: Finalizes predictions of user behavior for test data. David: Adds prediction data from Ryan to visualization, finalizes look. Possibly adds demographic element to visualization. Vladimir and Jody: Finalizes rankings of networks, genres, and programs by stickiness. Creates this element of the website.
Week 4: Screencast Ryan, Vladimir and Jody: Records the screencast and finalizes the process book. David: Writes the public project website and puts final touches on visualization.
In [ ]: