Tools Used

What this mega-tutorial does differently from other books is that it consistently makes use of a variety of different tools. The advantage of this strategy is that it lets us work more quickly once we know how the tools work. A major disadvantage is that it requires learning several different libraries and tools rather than just NumPy and Sklearn

The benefits of using a Single Dataset

Throughout this mega-tutorial I make use of only a single dataset. Focusing on just one dataset the entire time allows us to

In the real world, data scientists are typically going to be inspecting several different aspects of a dataset for a relatively long period of time.

Why the Yelp Dataset?

Good Size

A Note on Terminology and Notation

Analytics has become such an interdisciplinary subject over the years that it seems like every single concept has at least 5 names. In the literature the $\vec{x}$ in $p(y|\vec{x})$ can be referred to as the following things:

Regressors
Features (Machine Learning)
Input Variable (Machine Learning)
Independent Variable
Exogenous variable
Predictor variable
Explanatory variable

As I am an econometrician by training, I will usually be calling $\vec{x}$. I believe that Practioners in the field will inevitably have to put up with the mixed up jargon for another

See the glossary in this mega-tutorial for synonyms for terms and acryonyms used in this book.

The Appendix

The appendix covers topics that are, strictly speaking, not necessary to apply the modeling techniques implemented in this tutorial.

They are, important, however for those who are using these modeling techniques in a context that is important.