Timothy Helton
NOTE:
This notebook uses code found in the
k2datascience.pca module.
To execute all the cells do one of the following items:
In [ ]:
from k2datascience import pca
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
The dataset consists of 26,000 people counts (about every 10 minutes) over the last year. In addition, I gathered extra info including weather and semester-specific information that might affect how crowded it is. The label is the number of people, which I'd like to predict given some subset of the features.
Label:
Features:
Task - We are going to apply Principal Component Analysis on the given dataset using scikit-learn (bonus points if you use your own optimized Python version). We want to find the components with the maximum variance. Features with little or no variance are dropped and then the data is trained on transformed dataset to apply machine learning models.
number_people.
In [ ]:
gym = pca.Gym(label_column='people')
gym.data_name = 'Gym'
gym.feature_columns = [
'day_number',
'weekend',
'holiday',
'apparent_temp',
'temp',
'start_of_semester',
'seconds',
]
header = '#' * 25
print(f'\n\n{header}\n### Data Head\n{header}')
gym.data.head()
print(f'\n\n{header}\n### Data Overview\n{header}')
gym.data.info()
print(f'\n\n{header}\n### Summary Statistics\n{header}')
gym.data.describe()
print(f'\n\n{header}\n### Absolute Correlation\n{header}')
(gym.data.corr()
.people
.abs()
.sort_values(ascending=False))
In [ ]:
gym.plot_correlation_heatmap()
In [ ]:
gym.plot_correlation()
In [ ]:
gym.plot_variance(fig_size=(14,4))
In [ ]:
gym.scree_plot()
In [ ]:
gym.n_components = 5
gym.calc_components()
gym.plot_variance()
gym.scree_plot()
How can we tell the greatness of a movie before it is released in cinema?
This question puzzled me for a long time since there is no universal way to claim the goodness of movies. Many people rely on critics to gauge the quality of a film, while others use their instincts. But it takes the time to obtain a reasonable amount of critics review after a movie is released. And human instinct sometimes is unreliable.
To answer this question, I scraped 5000+ movies from IMDB website using a Python library called "scrapy".
The scraping process took 2 hours to finish. In the end, I was able to obtain all needed 28 variables for 5043 movies and 4906 posters (998MB), spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. Below are the 28 variables:
"movie_title" "color" "num_critic_for_reviews" "movie_facebook_likes" "duration" "director_name" "director_facebook_likes" "actor_3_name" "actor_3_facebook_likes" "actor_2_name" "actor_2_facebook_likes" "actor_1_name" "actor_1_facebook_likes" "gross" "genres" "num_voted_users" "cast_total_facebook_likes" "facenumber_in_poster" "plot_keywords" "movie_imdb_link" "num_user_for_reviews" "language" "country" "content_rating" "budget" "title_year" "imdb_score" "aspect_ratio"
Task - We are going to apply Principal Component Analysis on the given dataset using scikit-learn (bonus points if you use your own optimized Python version). We want to find the components with the maximum variance. Features with little or no variance are dropped and then the data is trained on transformed dataset to apply machine learning models.
In [ ]:
movie = pca.Movies(label_column='imdb_score')
movie.data_name = 'Movie'
movie.feature_columns = movie.data_numeric.columns
header = '#' * 25
print(f'\n\n{header}\n### Data Head\n{header}')
movie.data.head()
print(f'\n\n{header}\n### Data Overview\n{header}')
movie.data.info()
print(f'\n\n{header}\n### Summary Statistics\n{header}')
movie.data.describe()
print(f'\n\n{header}\n### Absolute Correlation\n{header}')
(movie.data.corr()
.imdb_score
.abs()
.sort_values(ascending=False))
In [ ]:
movie.top_correlation_joint_plots()
In [ ]:
movie.plot_correlation_heatmap()
In [ ]:
movie.plot_variance(fig_size=(14,4))
In [ ]:
movie.scree_plot()
In [ ]:
movie.n_components = 10
movie.calc_components()
movie.plot_variance()
movie.scree_plot()
In [ ]:
movie.plot_component_2_vs_1()
In [ ]:
movie.plot_componets_1_2_3()