If, at any point, you feel that important information in these questions is missing, then ask ! These tasks are realistic in that they are not neccessarily completely specified. Sometimes the answer might be "It depends" - in these cases, discuss the potential solutions with me.
Lets's assume you are given a task to predict some kind of variable given fully observed supervised training examples. Please discuss the different choices of models and/or algorithms you would consider or try given the following scenarios. Don't go into details, just tell me what your initial thoughts are on how to approach these.
Let's assume you are given a large social graph from a social network site such as Facebook or Linkedin. The data contains the information who is already connected to whom, and who has been a member of the site for how long. There might be additional information about members, such as stated hobbies, location and so on.
An assumption is that people who are members for a long time have had time to connect with their true contacts, i.e. their graphs are more complete. Newer members need help to find contacts they would like to have. Someone proposed to develop a connection recommendation engine.
Let's assume you have been randomly chosen to participate in a study for a test for a rare disease. This study is well paid, and since all that's required from you is a small blood sample and you get a free medical diagnosis, you agree to participate.
You are told that the rare disease affects 1 among 100,000 people in the general populace, and that you are by no means in any special kind of risk or no-risk group. The test has a true positive rate of 99.9%, i.e. there's only a 0.1% chance that the test tells you you are healthy if you have in fact the disease.
What would you think are your chances of actually being sick, if your test was positive, and either:
You can use the Python Console below as a small inline calculator.
In [2]:
a = 100000.5
b = 1.5 # This is a comment
print a/b
How would you go about if you were asked to calculate some kind of 95% confidence interval for the unknown false positive rate, assuming that 50,000 people were tested and you were tested positive.
The only pre-assumption you should make here is that the false positive rate is somewhere from 0.0 % to 100.0 %
Now, let's assume you have to make a hard decision: You were the only one tested positive, but the false positive rate is unknown. There is a treatment for the disease. But the following things are true:
You don't get any re-tests, the decision has to be made immediately, otherwise the possibility of the treatment is gone.
You don't need to calculate this through, but how would you approach this decision, if you try to be completely rational about it ?
Expectation maximization (EM) uses iterative optimization along with a latent variable model to obtain maximum likelihood estimates for models whose parameters are difficult to estimate directly. It may not be intuitive how introducing latent (missing) elements to a problem will facilitate its solution, but it works essentially by breaking the optimization into two steps:
(borrowed from Christopher Fonnesbeck excellent Course on Advanced Statistical Computing ).
EM is particularly suited to estimating the parameters of mixture models, where we do not know from which component each observation is derived.
EM is also one of the most popular learning algorithms to use in models with only partially observed data. Generally and loosely speaking, we just iterate between filling in missing data with our best guess (in the case of EM: The expected value) given the current best model we have. Then we update (train) our model given this completed data set (i.e. we use it as training data as in completely supervised learning) and iterate. In the case of EM, this training means finding the model parameters which maximize the likelihood of the completed training.
The approach is very general and not limited to using expectations and maximizing likelihoods. EM is just one of multiple iterative algorithms which proceed like that, other important ones are Alternating Least Squares (ALS) or the more general Concave Convex Procedure which generalzes EM and several other algorithms.
It is important to note that EM and other algorithms which proceed like this just find a local maximum. The most common solution to this is to use multiple restarts of the algorithms using random starting points and then to select the best found solution.
So let's assume you have the task of classifying web documents given their text contents. The problem is, paying people to manually classify a large number of web documents as training examples is expensive. But we also don't want a completely unsupervised clustering of documents to occur. We have a pretty good idea of some clusters we would like to see, for example because there is a pre-existing system of how to categorize websites.
So we have a pretty sparse selection of maybe 200 pre-classified example websites in english which have been classified into 10 categories, and we would like to use these to find a general way to classify all english websites on the web.
Which approaches might be plausible for this system ?
What if the only parallel processing platform available to you were a large but oldstyle (pre-YARN) Hadoop Cluster which allows you to run Map/Reduce jobs ?
Can you think of an online algorithm for this, i.e. an algorithm which (after some initial offline training) is fed a continuous stream of websites, returns a classification for these, and is expected to get better and better over time ?
Please look through the code below and tell me what you consider to be problematic about it. Go for actual bugs first. There are at least two bugs in the code. Once you have found them, tell me what else you would consider to be bad practice in the code, and how you would change it.
The code has never been compiled, so please ignore problems that a compiler or IDE would catch. The intention of the code should be clear. Where it is not, please ask.
You might want to look at the Javadocs of ConcurrentLinkedQueue and Timer.
In [ ]:
/**
Class to log an incoming (asynchronous) stream
of events to a single file. Decouples event reception from the writing, which is done in a single timer thread.
Should be able to handle thousands of asynchronous requests per second.
*/
public class EventLogger {
private File logFile;
private Timer timer = null;
private Writer logWriter = null;
private ConcurrentQueue<Event> queue = null;
public static final int QUEUE_CAPACITY = 100000;
/**
* Create a new EventLogger instance
* @param logFile Logfile to write to. Has to be writable
*/
public EventLogger(File logFile) {
this.logFile = logFile;
this.queue = new ConcurrentQueue<Event>();
this.timer = new Timer(true);
this.timer.schedule(new TimerTask() {
try {
logQueuedEvents();
} catch(Exception ex) {
throw new RuntimeException(ex);
}
}, 100);
}
/**
* Called on arrival of a new Event.
* Threadsafe: May safely be called at a high rate from
* multiple threads. Uses a threadsafe queue internally.
* @param e Event to process
* @throws java.io.IOException on I/O error
* @returns true on success, false if the event queue is full (size exceeds QUEUE_CAPACITY)
*/
public boolean onLogEvent(Event e) throws java.io.IOException {
ensureOpenWriter(); // We don't want to create empty logfiles, so we open it here..
if (this.queue.size()>QUEUE_CAPACITY) {
return false;
}
this.queue.add(e);
return true;
}
/**
* Logs all queued events. This is *NOT* Thread safe. Don't call from multiple threads
*/
private void logQueuedEvents() throws java.io.IOException {
Event e = this.queue.poll();
while (e!=null) {
this.logWriter.println(e.toString());
e = this.queue.poll();
}
}
/**
* Ensure that the logfile is open.
*/
private void ensureOpenWriter() throws java.io.IOException {
if (this.logWriter==null) {
this.logWriter = new BufferedWriter(new FileWriter(this.logFile));
this.logWriter.println("-------------- Event Log ("+System.currentTimeMillis() + " ---------------");
}
}
}
In [19]:
from wakaridata.yahoofinance import Stocks
import matplotlib.pylab as plt
import pandas
stocks = Stocks()
# Get list of dataset keys
keys = stocks.keys()
def stock_correlation_info(symbol1, symbol2, stocks):
d1 = stocks[symbol1]
d2 = stocks[symbol2]
# Get NumPy array of all records
d1_df = pandas.DataFrame(data=d1[:], columns=d1.field_names)
d1_df['change'] = d1_df['Close'] - d1_df['Open']
d2_df = pandas.DataFrame(data=d2[:], columns=d2.field_names)
d2_df['change'] = d2_df['Close'] - d2_df['Open']
i1 = 'change_'+symbol1
i2 = 'change_'+symbol2
df = pandas.merge(d1_df, d2_df, how='left', on=['Date'],
suffixes=('_'+symbol1,'_'+symbol2), copy=True)
change = df[[i1,i2]]
print "Correlation Coefficients of Intraday change between %s and %s" % (symbol1, symbol2)
print change.corr()
pandas.scatter_matrix( change, figsize=(12,8), diagonal='hist', hist_kwds={ "bins" : 30 })
return (df, change)
df1, change1 = stock_correlation_info('AMZN', 'GOOGL', stocks)
print "---"
df2, change2 = stock_correlation_info('AAPL', 'GFJ.DE', stocks)
print "---"
df2, change2 = stock_correlation_info('AAPL', 'AA', stocks)
In [ ]: