The aim of these labs is to reinforce the concepts that we discuss in each module's videos. These labs will revolve around the use case of food recipes, and together, we will walk through the process that data scientists usually follow when trying to solve a problem. Let's get started!
In this lab, we will start learning about the data science methodology, and focus on the Business Understanding and the Analytic Approach stages.
This is the Data Science Methodology, a flowchart that begins with business understanding.
Double-click here for the solution.
Double-click here for the solution.
Say, we are interested in automating the process of figuring out the cuisine of a given dish or recipe. Let's apply the business understanding stage to this problem.
Double-click here for the solution.
Double-click here for the solution.
Double-click here for the solution.
At this point, we realize that automating the process of determining the cuisine of a given dish is not a straightforward problem as we need to come up with a way that is very robust to the many cuisines and their variations.
Double-click here for the solution.
As you guessed, yes determining the cuisine of a given dish based on its ingredients seems like a viable solution as some ingredients are unique to cuisines. For example:
When we talk about American cuisines, the first ingredient that comes to one's mind (or at least to my mind =D) is beef or turkey.
When we talk about British cuisines, the first ingredient that comes to one's mind is haddock or mint sauce.
When we talk about Canadian cuisines, the first ingredient that comes to one's mind is bacon or poutine.
When we talk about French cuisines, the first ingredient that comes to one's mind is bread or butter.
When we talk about Italian cuisines, the first ingredient that comes to one's mind is tomato or ricotta.
When we talk about Japanese cuisines, the first ingredient that comes to one's mind is seaweed or soy sauce.
When we talk about Chinese cuisines, the first ingredient that comes to one's mind is ginger or garlic.
When we talk about indian cuisines, the first ingredient that comes to one's mind is masala or chillis.
Double-click here for the solution.
Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem. This step entails expressing the problem in the context of statistical and machine-learning techniques, so that the entity or stakeholders with the problem can identify the most suitable techniques for the desired outcome.
Double-click here for the solution.
This is a decision tree that a naive person might create manually. Starting at the top with all the recipes for all the cuisines in the world, if a recipe contains rice, then this decision tree would classify it as a Japanese cuisine. Otherwise, it would be classified as not a Japanese cuisine.
Double-click here for the solution.
The following decision tree would be considered the optimal model for classifying the data into a node for green triangles and a node for red circles.
Each of the classes in the leaf nodes are completely pure – that is, each leaf node only contains datapoints that belong to the same class.
On the other hand, the following decision tree is an example of the worst-case scenario that the model could output.
Each leaf node contains datapoints belonging to the two classes resulting in many datapoints ultimately being misclassified.
Now let's put what we learned about decision trees to use. Let's try and build a much better version of the decision tree for our recipe problem.
I hope you agree that the above decision tree is a much better version than the previous one. Although we are still using Rice as the ingredient in the first decision node, recipes get divided into Asian Food and Non-Asian Food. Asian Food is then further divided into Japanese and Not Japanese based on the Wasabi ingredient. This process of splitting leaf nodes continues until each leaf node is pure, i.e., containing recipes belonging to only one cuisine.
Accordingly, decision trees is a suitable technique or algorithm for our recipe case study.
This notebook was created by Alex Aklson. I hope you found this lab session interesting. Feel free to contact me if you have any questions!
This notebook is part of a course on Coursera called Data Science Methodology. If you accessed this notebook outside the course, you can take this course, online by clicking here.