Data Science Technical Interview

Important: Please ask and discuss !

If, at any point, you feel that important information in these questions is missing, then ask ! These tasks are realistic in that they are not neccessarily completely specified. Sometimes the answer might be "It depends" - in these cases, discuss the potential solutions with me.

Supervised Learning: Choice of Approach

Lets's assume you are given a task to predict some kind of variable given fully observed supervised training examples. Please discuss the different choices of models and/or algorithms you would consider or try given the following scenarios. Don't go into details, just tell me what your initial thoughts are on how to approach these.

  • Predict the gender using a combination of height, weight and eye-colour given a training dataset consisting of 4 million examples.
  • Predicting height given gender, weight and eye-colour given a training dataset consisting of 4 million examples.
  • Learning an arbitrary, possibly noisy function mapping two binary variables to a single binary outcome given 4 million examples.
  • Learning to discover a set of 4 object classes in medium resolution images, given 100.000 example images, 25.000 per class.
  • Learning to locate objects of 1000 categories in high-res images given 2000 example images per category.
  • Predicting the category (one of 20 categories) of a text from a webpage, given a total of 10.000 training examples.
  • Predicting the price of a product given 10,000 potential features (some numeric, some binary) given 1,000,000 training examples.
  • Classifying the topic of short twitter messages into one of 4 categories given a continuous stream of these short natural language messages.

Graph Completion

Let's assume you are given a large social graph from a social network site such as Facebook or Linkedin. The data contains the information who is already connected to whom, and who has been a member of the site for how long. There might be additional information about members, such as stated hobbies, location and so on.

An assumption is that people who are members for a long time have had time to connect with their true contacts, i.e. their graphs are more complete. Newer members need help to find contacts they would like to have. Someone proposed to develop a connection recommendation engine.

  • Given the task to develop software to solve this task, what would your first steps be ?
  • Which kinds of algorithmic approaches might be suited to solve this problem ?
  • Which kind of software / hardware infrastructure might be appropriate ?

Conditional Probabilities

Let's assume you have been randomly chosen to participate in a study for a test for a rare disease. This study is well paid, and since all that's required from you is a small blood sample and you get a free medical diagnosis, you agree to participate.

You are told that the rare disease affects 1 among 100,000 people in the general populace, and that you are by no means in any special kind of risk or no-risk group. The test has a true positive rate of 99.9%, i.e. there's only a 0.1% chance that the test tells you you are healthy if you have in fact the disease.

What would you think are your chances of actually being sick, if your test was positive, and either:

  • The false positive rate of the test is known to be 2%
  • The false positive rate of the test is known to be 5%, but you have been tested positive twice.
  • The false positive rate of the test is unknown, but of the 50,000 random people tested in the study, you are the only one who was tested positive (once)

You can use the Python Console below as a small inline calculator.


In [2]:
a = 100000.5
b = 1.5 # This is a comment
print a/b


66667.0

Bonus Question

How would you go about if you were asked to calculate some kind of 95% confidence interval for the unknown false positive rate, assuming that 50,000 people were tested and you were tested positive.

The only pre-assumption you should make here is that the false positive rate is somewhere from 0.0 % to 100.0 %

Now, let's assume you have to make a hard decision: You were the only one tested positive, but the false positive rate is unknown. There is a treatment for the disease. But the following things are true:

  • If you have the disease, you have a 20% probability to die from it.
  • If you have the disease and take the treatment, your probability of die is reduced to 0.5%
  • If you don't have the disease but take the treatment nevertheless, your probability to die from the treatmenht is 0.5%.

You don't get any re-tests, the decision has to be made immediately, otherwise the possibility of the treatment is gone.

You don't need to calculate this through, but how would you approach this decision, if you try to be completely rational about it ?

Expectation Maximization Algorithm

Expectation maximization (EM) uses iterative optimization along with a latent variable model to obtain maximum likelihood estimates for models whose parameters are difficult to estimate directly. It may not be intuitive how introducing latent (missing) elements to a problem will facilitate its solution, but it works essentially by breaking the optimization into two steps:

  1. generating an expectation over the missing variable(s) based on current estimates of parameters
  2. maximizing the log-likelihood from the expectation step, thereby generating updated estimates of parameters

(borrowed from Christopher Fonnesbeck excellent Course on Advanced Statistical Computing ).

EM is particularly suited to estimating the parameters of mixture models, where we do not know from which component each observation is derived.

EM is also one of the most popular learning algorithms to use in models with only partially observed data. Generally and loosely speaking, we just iterate between filling in missing data with our best guess (in the case of EM: The expected value) given the current best model we have. Then we update (train) our model given this completed data set (i.e. we use it as training data as in completely supervised learning) and iterate. In the case of EM, this training means finding the model parameters which maximize the likelihood of the completed training.

The approach is very general and not limited to using expectations and maximizing likelihoods. EM is just one of multiple iterative algorithms which proceed like that, other important ones are Alternating Least Squares (ALS) or the more general Concave Convex Procedure which generalzes EM and several other algorithms.

It is important to note that EM and other algorithms which proceed like this just find a local maximum. The most common solution to this is to use multiple restarts of the algorithms using random starting points and then to select the best found solution.

Classification of Web Documents given incomplete Training Data

So let's assume you have the task of classifying web documents given their text contents. The problem is, paying people to manually classify a large number of web documents as training examples is expensive. But we also don't want a completely unsupervised clustering of documents to occur. We have a pretty good idea of some clusters we would like to see, for example because there is a pre-existing system of how to categorize websites.

So we have a pretty sparse selection of maybe 200 pre-classified example websites in english which have been classified into 10 categories, and we would like to use these to find a general way to classify all english websites on the web.

  • Which approaches might be plausible for this system ?

  • What if the only parallel processing platform available to you were a large but oldstyle (pre-YARN) Hadoop Cluster which allows you to run Map/Reduce jobs ?

  • Can you think of an online algorithm for this, i.e. an algorithm which (after some initial offline training) is fed a continuous stream of websites, returns a classification for these, and is expected to get better and better over time ?

Java Code Analysis

Please look through the code below and tell me what you consider to be problematic about it. Go for actual bugs first. There are at least two bugs in the code. Once you have found them, tell me what else you would consider to be bad practice in the code, and how you would change it.

The code has never been compiled, so please ignore problems that a compiler or IDE would catch. The intention of the code should be clear. Where it is not, please ask.

You might want to look at the Javadocs of ConcurrentLinkedQueue and Timer.


In [ ]:
/**
 Class to log an incoming (asynchronous) stream 
 of events to a single file. Decouples event reception from the writing, which is done in a single timer thread.
 Should be able to handle thousands of asynchronous requests per second.
*/
public class EventLogger  {
 
    private File logFile;
    private Timer timer = null;
    private Writer logWriter = null; 
    private ConcurrentQueue<Event> queue = null;
    public static final int QUEUE_CAPACITY = 100000;

    /**
      * Create a new EventLogger instance
      * @param logFile Logfile to write to. Has to be writable
      */
    public EventLogger(File logFile) {
         this.logFile = logFile;
         this.queue = new ConcurrentQueue<Event>();
         this.timer = new Timer(true);
         this.timer.schedule(new TimerTask() {
              try {
                 logQueuedEvents();
               } catch(Exception ex) {
                 throw new RuntimeException(ex);
               }
         }, 100);

    }
 
   /**
    * Called on arrival of a new Event.
    * Threadsafe: May safely be called at a high rate from
    * multiple threads. Uses a threadsafe queue internally.
    * @param e Event to process
    * @throws java.io.IOException on I/O error
    * @returns true on success, false if the event queue is full (size exceeds QUEUE_CAPACITY)
    */
   public boolean onLogEvent(Event e) throws java.io.IOException {
         ensureOpenWriter(); // We don't want to create empty logfiles, so we open it here..
         if (this.queue.size()>QUEUE_CAPACITY) {
              return false;  
         }
         this.queue.add(e);
         return true;
   }
   
   /**
     * Logs all queued events. This is *NOT* Thread safe. Don't call from multiple threads
     */
   private void logQueuedEvents() throws java.io.IOException {
        Event e = this.queue.poll();
        while (e!=null) {
             this.logWriter.println(e.toString());
             e = this.queue.poll();
        }
  }

   /**
     * Ensure that the logfile is open.
     */
   private void ensureOpenWriter() throws java.io.IOException {
         if (this.logWriter==null) {
             this.logWriter = new BufferedWriter(new FileWriter(this.logFile));
             this.logWriter.println("-------------- Event Log ("+System.currentTimeMillis() + " ---------------");
         }
   }
   
  
}

Code and data reading

Below we analyze some simple stock market data. Please read the code. It is in Python, but should be somewhat understandable. Tell me what you see and read, and how you would interpret the results (i.e. the graph below).


In [19]:
from wakaridata.yahoofinance import Stocks
import matplotlib.pylab as plt
import pandas

stocks = Stocks()
# Get list of dataset keys
keys = stocks.keys()

def stock_correlation_info(symbol1, symbol2, stocks):
    
    d1 = stocks[symbol1]
    d2 = stocks[symbol2]

    # Get NumPy array of all records
    d1_df = pandas.DataFrame(data=d1[:], columns=d1.field_names)
    d1_df['change'] = d1_df['Close'] - d1_df['Open']

    d2_df = pandas.DataFrame(data=d2[:], columns=d2.field_names)
    d2_df['change'] = d2_df['Close'] - d2_df['Open']

    i1 = 'change_'+symbol1
    i2 = 'change_'+symbol2
    
    df = pandas.merge(d1_df, d2_df, how='left', on=['Date'],
          suffixes=('_'+symbol1,'_'+symbol2), copy=True)

    change = df[[i1,i2]]

    print "Correlation Coefficients of Intraday change between %s and %s" % (symbol1, symbol2)
    print change.corr()

    pandas.scatter_matrix( change, figsize=(12,8), diagonal='hist', hist_kwds={ "bins" : 30 })
    return (df, change)

df1, change1 = stock_correlation_info('AMZN', 'GOOGL', stocks)
print "---"
df2, change2 = stock_correlation_info('AAPL', 'GFJ.DE', stocks)
print "---"
df2, change2 = stock_correlation_info('AAPL', 'AA', stocks)


Correlation Coefficients of Intraday change between AMZN and GOOGL
              change_AMZN  change_GOOGL
change_AMZN      1.000000      0.450576
change_GOOGL     0.450576      1.000000
---
Correlation Coefficients of Intraday change between AAPL and GFJ.DE
               change_AAPL  change_GFJ.DE
change_AAPL       1.000000       0.069131
change_GFJ.DE     0.069131       1.000000
---
Correlation Coefficients of Intraday change between AAPL and AA
             change_AAPL  change_AA
change_AAPL     1.000000   0.117036
change_AA       0.117036   1.000000

Risk

If you are trying to minimize your risk, which combination of stocks would be preferrable, if you assume all of them are independently equally profitable:

  • Google (GOOGL) and Amazon (AMZN)
  • Apple (APPL) and Gagfah (GFJ.DE)

In [ ]: