Unsupervised Learning - Principal Components Analysis

Timothy Helton

NOTE:
This notebook uses code found in the k2datascience.pca module. To execute all the cells do one of the following items:

Install the k2datascience package to the active Python interpreter.
Add k2datascience/k2datascience to the PYTHON_PATH system variable.
Create a link to the pca.py file in the same directory as this notebook.

Imports



In [ ]:

    
from k2datascience import pca

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Exercise 1 - Crowdedness at the Campus Gym

The dataset consists of 26,000 people counts (about every 10 minutes) over the last year. In addition, I gathered extra info including weather and semester-specific information that might affect how crowded it is. The label is the number of people, which I'd like to predict given some subset of the features.

Label:

Number of people

Features:

timestamp (int; number of seconds since beginning of day)
day_of_week (int; 0 - 6)
is_weekend (int; 0 or 1)
is_holiday (int; 0 or 1)
apparent_temperature (float; degrees fahrenheit)
temperature (float; degrees fahrenheit)
is_start_of_semester (int; 0 or 1)

Based off the Kaggle dataset.

Task - We are going to apply Principal Component Analysis on the given dataset using scikit-learn (bonus points if you use your own optimized Python version). We want to find the components with the maximum variance. Features with little or no variance are dropped and then the data is trained on transformed dataset to apply machine learning models.

Read in the gym dataset.
Explore the data, the summay statistics and identify any strong positive or negative correlations between the features.
Convert temperature and apparent temperature from Fahrenheit to Celcius.
Extract the features to a new dataframe. The column you would eventually predict is number_people.
Make a heatmap of the correlation.
Run PCA on the feature dataframe, and plot the explained variance ratio of the principal components.
Which components would you drop and why?
Re-run PCA on the feature dataframe, restricting it to the number of principal components you want and plot the explained variance ratios again.



In [ ]:

    
gym = pca.Gym(label_column='people')
gym.data_name = 'Gym'
gym.feature_columns = [
    'day_number',
    'weekend',
    'holiday',
    'apparent_temp',
    'temp',
    'start_of_semester',
    'seconds',
]

header = '#' * 25
print(f'\n\n{header}\n### Data Head\n{header}')
gym.data.head()

print(f'\n\n{header}\n### Data Overview\n{header}')
gym.data.info()

print(f'\n\n{header}\n### Summary Statistics\n{header}')
gym.data.describe()

print(f'\n\n{header}\n### Absolute Correlation\n{header}')
(gym.data.corr()
 .people
 .abs()
 .sort_values(ascending=False))



In [ ]:

    
gym.plot_correlation_heatmap()



In [ ]:

    
gym.plot_correlation()

Findings

The two temperature variables show a week correlation to correlation to number of people in the gym.
The following variables show minimal correlation to number of people in the gym.
- day_number
- weekend
- start_of_semester
The holiday variable shows no correlation.

Run PCA



In [ ]:

    
gym.plot_variance(fig_size=(14,4))



In [ ]:

    
gym.scree_plot()

Findings

From the PCA analysis the last two principle componenets will be neglected.
- For initial investigations neglecting the last three priniciple components would be justifiable.



In [ ]:

    
gym.n_components = 5
gym.calc_components()
gym.plot_variance()
gym.scree_plot()

Exercise 2 - IMDB Movie Data

How can we tell the greatness of a movie before it is released in cinema?

This question puzzled me for a long time since there is no universal way to claim the goodness of movies. Many people rely on critics to gauge the quality of a film, while others use their instincts. But it takes the time to obtain a reasonable amount of critics review after a movie is released. And human instinct sometimes is unreliable.

To answer this question, I scraped 5000+ movies from IMDB website using a Python library called "scrapy".

The scraping process took 2 hours to finish. In the end, I was able to obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables:

"movie_title" "color" "num_critic_for_reviews" "movie_facebook_likes" "duration" "director_name" "director_facebook_likes" "actor_3_name" "actor_3_facebook_likes" "actor_2_name" "actor_2_facebook_likes" "actor_1_name" "actor_1_facebook_likes" "gross" "genres" "num_voted_users" "cast_total_facebook_likes" "facenumber_in_poster" "plot_keywords" "movie_imdb_link" "num_user_for_reviews" "language" "country" "content_rating" "budget" "title_year" "imdb_score" "aspect_ratio"

Based off the Kaggle dataset.

Read in the movie dataset.
Explore the data, the summay statistics and identify any strong positive or negative correlations between the features.
Some columns contain numbers, while others contain words. Do some filtering to extract only the numbered columns and not the ones with words into a new dataframe.
Remove null values and standardize the values.
Create hexbin visualizations to get a feel for how the correlations between different features compare to one another. Can you draw any conclusions about the features?
Create a heatmap of the pearson correlation of movie features. Detail your observations.
Perform PCA on the dataset, and plot the individual and cumulative explained variance superimposed on the same graph.
How many components do you want to use? Implement PCA and transform the dataset.
Create a 2D and 3D scatter plot of the the 1st 2 and the 1st 3 components.
Do you notice any distinct clusters in the plots? (For future clustering assignment)



In [ ]:

    
movie = pca.Movies(label_column='imdb_score')
movie.data_name = 'Movie'
movie.feature_columns = movie.data_numeric.columns

header = '#' * 25
print(f'\n\n{header}\n### Data Head\n{header}')
movie.data.head()

print(f'\n\n{header}\n### Data Overview\n{header}')
movie.data.info()

print(f'\n\n{header}\n### Summary Statistics\n{header}')
movie.data.describe()

print(f'\n\n{header}\n### Absolute Correlation\n{header}')
(movie.data.corr()
 .imdb_score
 .abs()
 .sort_values(ascending=False))



In [ ]:

    
movie.top_correlation_joint_plots()

Findings

From the top four correlation joint plots it appears that the IMBD score does not have any dominate drivers.



In [ ]:

    
movie.plot_correlation_heatmap()



In [ ]:

    
movie.plot_variance(fig_size=(14,4))



In [ ]:

    
movie.scree_plot()

Findings

For this dataset the first ten principle components.
- This will capture just shy of 90% of the variance.



In [ ]:

    
movie.n_components = 10
movie.calc_components()
movie.plot_variance()
movie.scree_plot()



In [ ]:

    
movie.plot_component_2_vs_1()



In [ ]:

    
movie.plot_componets_1_2_3()

Findings

Possible clusters appear to exist for movies with low and high IMDB sores.
The middle region of the data does not exhibit an identifiable pattern.