What is exploratory data analysis?

Why we EDA

Sometimes the consumer of your analysis won't understand why you need the time for EDA and will want results NOW! Here are some of the reasons you can give to convince them it's a good use of time for everyone involved.

Reasons for the analyst

  • Identify patterns and develop hypotheses.
  • Test technical assumptions.
  • Inform model selection and feature engineering.
  • Build an intuition for the data.

Reasons for consumer of analysis

  • Ensures delivery of technically-sound results.
  • Ensures right question is being asked.
  • Tests business assumptions.
  • Provides context necessary for maximum applicability and value of results.
  • Leads to insights that would otherwise not be found.

Things to keep in mind

  • You're never done with EDA. With every analytical result, you want to return to EDA, make sure the result makes sense, test other questions that come up because of it.
  • Stay open-minded. You're supposed to be challenging your assumptions and those of the stakeholder who you're performing the analysis for.
  • Repeat EDA for every new problem. Just because you've done EDA on a dataset before doesn't mean you shouldn't do it again for the next problem. You need to look at the data through the lense of the problem at hand and you will likely have different areas of investigation.

The game plan

Exploratory data analysis consists of the following major tasks, which we present linearly here because each task doesn't make much sense to do without the ones prior to it. However, in reality, you are going to constantly jump around from step to step. You may want to do all the steps for a subset of the variables first. Or often, an observation will bring up a question you want to investigate and you'll branch off and explore to answer that question before returning down the main path of exhaustive EDA.

  1. Form hypotheses/develop investigation themes to explore
  2. Wrangle data
  3. Assess data quality and profile
  4. Explore each individual variable in the dataset
  5. Assess the relationship between each variable and the target
  6. Assess interactions between variables
  7. Explore data across many dimensions

Throughout the entire analysis you want to:

  • Capture a list of hypotheses and questions that come up for further exploration.
  • Record things to watch out for/ be aware of in future analyses.
  • Show intermediate results to colleagues to get a fresh perspective, feedback, domain knowledge. Don't do EDA in a bubble! Get feedback throughout especially from people removed from the problem and/or with relevant domain knowledge.
  • Position visuals and results together. EDA relies on your natural pattern recognition abilities so maximize what you'll find by putting visualizations and results in close proximity.

1. Brainstorm areas of investigation

Yes, you're exploring, but that doesn't mean it's a free for all.

  • What do you need to understand the question you're trying to answer?
  • List before diving in and update throughout the analysis

2. Wrangle the data

  • Make your data tidy.
    1. Each variable forms a column
    2. Each observation forms a row
    3. Each type of observational unit forms a table
  • Transform data
    • Log
    • Binning
    • Aggegration into higher level categories

3. Assess data quality and profile

  • What data isn’t there?
  • Is the data that is there right?
  • Is the data being generated in the way you think?

4. Explore each individual variable in the dataset

  • What does each field in the data look like?
  • How can each variable be described by a few key values?
  • Are the assumptions often made in modeling valid?

5. Assess the relationship between each variable and the target

How does each variable interact with the target variable?

Assess each relationship’s:

  • Linearity
  • Direction
  • Rough size
  • Strength

6. Assess interactions between the variables

  • How do the variables interact with each other?
  • What is the linearity, direction, rough size, and strength of the relationships between pairs of variables?

7. Explore data across many dimensions

Are there patterns across many of the variables?

Our objectives for this tutorial

Our objectives for this tutorial are to help you:

  • Develop the EDA mindset
    • Questions to consider while exploring
    • Things to look out for
  • Learn basic methods for effective EDA
    • Slicing and dicing
    • Calculating summary statistics
    • Basic plotting
    • Basic mapping
    • Using widgets for interactive exploration

The actual exploration you do in this tutorial is yours. We have no answers or set of conclusions we think you should come to about the datasets. Our goal is simply to aid in making your exploration as effective as possible.