A long, long time ago, long before anyone can even remember, people started thinking about the idea of randomness. We, as people, wondered how items of our world change and fluctuate. Eventually, someone decided we could numerically keep track of how these items of our world change. We could then track what happens as we observe multiple occurences of an event. How do the numbers fluctuate? What if we fix certain ,parameter, values to be true? How do the numbers fluctuate from this parameter value? This is an idea that relates to Frequentist Statistics, 'what happens to our values in the long run'. This is a fine and dandy way for us to keep track of how information might fluctuate; however, it assumes that our parameter values are true, non-fluctuating values. Our data are then observed values with these underlying parameter values holding true.
We began wondering, what values of a parameter do we think are possible? We observe this data, but what is the true value that underlies what provided this data? What is the probability of observing this data if the true underlying value is number? We advanced further: we could simulate values under some true value of a parameter, see if our data match the simulated values to understand how probable the parameter value. Frequentist statistical inference is conducted under this paradigm, "how probable is our data (under large sample size and independent assumptions) to occur from this value of our null hypothesis".
When we perform hypothesis tests and confidence intervals, we are answering questions about where population parameters 'live'. From the earlier post, you might remember that population parameters are numeric summaries about a population. Therefore, hypothesis tests and confidence intervals, do not aim to provide us the ability to make inference on a specific individual level. Rather, these techniques aim to help us understand an aggregate value in our population of interest.
The above is not to say we should throw out statistical inference because it does not allow us the abiltiy to make predictions at an individual level, but it is to say, statistical inference has its place. Randomized, well designed experiments lend themselves nicely to providing the means for comparing groups in terms of underlying parameters, and in turn, hypothesis testing and confidence intervals. However, there are more advanced techniques to determine drivers for sales or click through rates than using hypothesis testing on slope coefficients. Obtaining p-values to assess these relationships could be a first step, but relationships are far too complicated to use this as all the 'statistical evidence' necessary to be convinced of a driver of one variable on another.
Many of the advances using statistical modeling and methods are not related to frequentist approaches of hypothesis testing. Although, you might not know that considering many of the traditional graduate level education programs aimed at analytics and statistics do little to discuss new areas of statistical modeling, how they relate to techniques with a longer history, and why one might be used as compared to another. Machine learning, using scikit learn and caret are all the rave, but what about generalized linear models, markov processes, and other methods of prediction. When might one of these be ideal to use? Why do "cool" machine learning methods frequently outperform these traditional methods in terms of prediction? And why do you need to go beyond being able to throw an xgboost algorithm at your data to become really effective?
Now more than ever before, we have the means to collect tons of data and analyze that data to draw conclusions at an individual level. When collecting data about human genomes, Internet behavior, or what people say or write or do, it doesn't make sense to take a traditional frequentist approach to solving these problems. Do we want to understand the drivers? - yes, of course! However, are we going to understand what disease occurs in an individual and why, or what product a consumer buys a product and why going to be determined by p-values and confidence intervals? - no, almost certainly not. The interactions are complex, the result we desire is not an aggregate for a group, and our traditional frequentist approaches are not aimed to make predictions at an individual level.
Do we have approaches that (at least potentially) leave us the ability to understand these complex interactions, as well as the driving inputs? - yes, we do. The most powerful and promising of these techniques is likely neural networks. However, tree based methods of random forests and gradient boosting also have a ton of potential (and are much easier to train currently). All of these methods allow for us to have complex interactions of our inputs that would be difficult to model in one of the more traditional parametric based modeling techniques. There are also ways to understand the drivers of a particular outcome from these methods (more on this in a future post), which can take away the 'black box-i-ness' of what are the important variables for making predictions, and how are they related to the response (positively, negatively, convexly, concavely, etc.).
Needless to say, the application of these methods to problems is just beginning, and our ability to turn data into action (by action I do note mean colorful dashboards, but more on this later).
A post about the same topic is made here. I would argue with the statement of:
"Using statistics we model the data and try to understand the underlying processes. In machine learning we don’t need to understand the processes underlying the data. We just send the data to an algorithm and it predicts, learning form the data we trained it with."
We have techniques for understanding processes underlying machine learning methods, which really makes them the driving force for the future. However, people haven't been doing much to spread the excitement in understanding how to understand the internals of these techniques, as the focus in industry has been on how to scale these existing techniques for more data and for making faster decisions on streams of data.
However, I agree with the following:
“There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools” - Leo Brieman
And, I have some quips with, but feel at least it is providing a reasonable relationship of math-stats and stats-machine learning as a means for contributing:
“Machine Learning is to statistics what statistics is to mathematics, a more practical abstraction”. - Anonymous
Additional sources:
I will leave you with one last graphic from the famous, Tibshirani of Stanford:
In [ ]: