The amount of information that is being generated daily (minutely!) is pretty astounding. So, whether we like it or not, we have entered the realm of "Big Data".

The question is "What kinds of things can we learn from this data and how do we do it?"

What we can learn depends on what you might hope to learn. But how we can interact with Big Data is what Data Mining and Machine Learning are all about. Data mining is exactly what it sounds like: sifting through piles of data in order to find something useful---like digging rock from the ground and extracting metal ores from it. Machine learning is about how to do this using computers to leverage our ability to extract useful (hopefully!) information from the data.

Who does data mining and uses machine learning? Seems like everyone these days

- Amazon and Target to predict things that you might buy or ads that may be of interest
- Google for everything, but most interesting these days (I think) is self-driving cars: http://dataconomy.com/how-data-science-is-driving-the-driverless-car/
- Insurance companies to predict how much of a risk it is to insure you
- Financial institutions to predict the future prices of their investments
- Goverments for all sorts of things
- Election prognosticators, e.g., http://fivethirtyeight.com/, http:election.princeton.edu
- Netflix to predict what movies you are likely to want to watch: https://en.wikipedia.org/wiki/Netflix_Prize
- Sports teams, e.g., https://en.wikipedia.org/wiki/Moneyball
- Face recognition tools
- and, of course, physicists!

So there's definitely some interesting things to be learned and some money to be made if you have knowledge of the basics of data mining and machine learning.

But don't get too cocky as things can go wrong, the world is made of people and not data, and machine learning by itself doesn't intrinsically make our lives better

My interest in the topic stems from the new generation of astronomical sky surveys like the Large Synoptic Survey Telescope (LSST), of which Drexel is a member institution. While the Big Data examples above are impressive, in many ways, projects like LSST define Big Data.

What do I mean by that? Well, LSST is a project that is going to generate about 200 PB of data by the end of its 10 year mission. During that time, it will have measured a hundred or more properties for some 40 billion objects---every 3 nights.

To put that into perspective, there are 7.4 billion people on the Earth today. So, LSST will be equivalent to collecting dozens of pieces of information per day for every person on the Earth. Google would go crazy with that!

Even that is small potatoes for particle physicists. In astronomy (at least up to now), we at least keep all of the data. In particle physics, they throw away most of the data because there is so much. Instead they have the notion of a trigger, which is basically as series of "if-then" statements that decide whether or not an "event" is worth saving (for future analsyis) or not.

What the class is NOT

  • A Statistics Class
  • A Math Methods Class
  • A Computer Science Class
  • A Programming Class

There are other classes for those things and I am NOT a Stats or CS professor. There will definitely be some things that we are learning together rather than me transferring my knowledge to you! Moreover, we only have 10 weeks, so I can't cover everything.

What the class IS

  • An introduction to Big Data and (python-based) Machine Learning algorithms needed to handle it
  • Whatever stats we need to get there
  • My field of expertise and the book will bias things towards astronomy, but I'll try to make it as general as possible.

In just 10 weeks we won't be able to go into as much detail as you might really need to do "real" data mining/machine learning. My assumption is that if you need this, you'll follow through and make sure to understand the details later (otherwise, I'm just creating monsters!).

This figure from drewconway.com nicely illustrates where we are trying to get: somewhere close to the middle!

At some level, you don't need this class at all. You could just go through the book on your own (but how often does that happen without someone making you?) But I found that it isn't entirely self-contained and that you really need other resources too. So, what I am trying to do with this class is to bring those resources together in one place.

A group project is intended to help you delve into details for one algorithm/method (and learn from your classmates about others). See the syllabus for more information on it.

Almost everything that we will do can be categorized into one of two different pairs of things.

  • Supervised learning vs. unsupervised learning

    • (Where it is the learning process, not you that is supervised or not.)
    • Supervised learning means that we know the "truth" for (some) of the data that we are analyzing. It is this "truth" that is "supervising" the learning process
    • Unsupervised learning means that we do NOT know the "right" answer. We are trying to find it. Or, generally speaking, not so much as the right answer as any answer that increases our knowledge.
  • Classification vs. Regression

    • Classification means that we are trying to put our data into different discrete categories
    • Regression is the limit where the classification "bins" become continuous.

These can be combined together:

  • supervised classification
  • unsupervised classification (aka clustering)
  • supervised regression
  • unsupervised regression (aka dimensional reduction)

We'll talk about this in more detail when I introduce the scikit-learn package.

Graphically the course can be represented as a tour of the following flowchart (where the link has links to each of these algorithms):

Lastly, here are some reading resources that you might find useful:

Ball & Brunner 2010: astro-specific, but serves as a good general introduction too

Bloom & Richards 2011, where that Richards is not yours truly (in this case).