This is the sum total of all graduate program knowledge... Compared to what? According to whom?
Ingest a csv file as pure text, then as list of lines, and then as a data frame.
Right now, the file is in a 'narrow' format. In other words, several interesting bits are collapsed into a single field. Let's attempt to make the data frame a 'wide' format. All the collapsed items expanded horizontally.
Wide data sets are good for exploration, but 'long' data sets are good for training. Let's attempt to expand all the collapsed field vertically instead of horizontally. Does this result in data duplication? What do you think about that? Yes and No are both correct -- but what's the context?
In [1]:
import csv
with open('tmdb_5000_movies.csv',newline='') as csvfile:
filereader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in filereader:
# print(', '.join(row))
In [2]:
f = open('tmdb_5000_movies.csv', "r", encoding="utf8")
line=f.readline()
while line:
# print(line)
line = f.readline()
f.close()
In [ ]:
In [4]:
# 1. Read csv file into a dataframe
#
import pandas as pd
df_movies = pd.read_csv('tmdb_5000_movies.csv')
In [5]:
# 2. Print a few rows to check it out.
#
df_movies.head()
Out[5]:
Here is the plan:
First, I wanted to know how many rows were in the dataframe, because I would have to iterate through them.
Second, I was going to determine what all of the unique genres were, because I would have to add columns for each of them. Each column would store a value where "1" means the movie belongs to that genre. A "0" value means that it does not.
This is accomplished by expanding each movie's list of genres, then accumulating a list of all of them. Then, I'll get rid of the duplicates, so that I have a list of all the unique occurences. This turns out to be 20, so I will create a dataframe that is 4803 rows x 20 columns (df_movie_genre_3)
I will also create a dataframe that has each movie and seven columns to hold the genres that are assigned to it. If a movie only has 3 genres, the other columns have a value of 'None.'(df_movie_genre_4)
Then I will iterate through df_movie_genre_4 and put a 1 in the column whose name matches the genre (ex. Action genre puts a '1' in the Action column.)
In [6]:
# 3. Count how many rows you're dealing with.
#
x = df_movies.genres
len(x)
movie_cnt = len(x)
movie_cnt
Out[6]:
In [7]:
# 4. Iterate through indices of genre array to create a list of lists of genre names.
# There are 4803 lists
#
import json
genresAll = []
for k,x in enumerate(df_movies.genres):
genreList = []
genre = json.loads(df_movies.genres[k])
for i,val in enumerate(genre):
genreList.append(genre[i]['name'])
genresAll.append(genreList)
#genresAll
In [8]:
# 5. Use itertools to break up lists into individual genres.
# For example ([1,2,3],[4,5,6]) becomes (1,2,3,4,5,6)
#
from itertools import *
glist = []
for i in chain(genresAll):
for g in i:
# print(g)
# yield g
glist.append(g)
genres = glist
In [11]:
# 6. This gets rid of all the duplicates by making it a set.
#
unique_genres = set(genres)
unique_genres
Out[11]:
In [12]:
#7.
#
df_movie_genre_1 = pd.DataFrame(df_movies.title)
df_movie_genre_1.head()
Out[12]:
In [13]:
# 8. This is the dataframe that is 4803 rows x 20 columns, one for each genre.
#
df_movie_genre_2 = pd.DataFrame(df_movie_genre_1,columns=(unique_genres))
df_movie_genre_2.fillna('0',inplace=True)
df_movie_genre_2.head()
Out[13]:
In [14]:
# 9. This is the dataframe that has the movie title and 20 columns for genre
# indicators.
#
df_movie_genre_3 = pd.concat([df_movie_genre_1, df_movie_genre_2], axis = 1)
df_movie_genre_3.head()
Out[14]:
In [15]:
# 10. This is the dataframe that has up to seven genres per movie.
#
df_movie_genre_4 = pd.DataFrame.from_records(genresAll)
df_movie_genre_4.head()
Out[15]:
In [17]:
# 11. This is the logic for iterating through the two dataframes
# and setting the indicator in the column to '1'.
# I could not think of a clever way to determine that max = 7.
max = 7
#print(max)
for i in range (0,movie_cnt):
# print('i=',i)
for j in range (0,max):
# print('j=',j)
# g = df_movie_genre_4.loc[i][j]
# print(g)
if df_movie_genre_4.loc[i][j] != None:
df_movie_genre_3.loc[i][df_movie_genre_4.loc[i][j]] = '1'
else:
break
df_movie_genre_3.head()
Out[17]:
In [18]:
# 12. The melt function does some pretty cool stuff. Here, it takes the wide
# dataframe and makes it 'long'. In other words, it creates many rows from
# one, repeating the title on each genre for that title.
#
long = pd.melt(df_movie_genre_3, id_vars = 'title', var_name = 'genre')
long.head()
Out[18]:
In [19]:
# 13. This selects only the columns with value = '1', and deletes
# the value column.
long2 = long[long.value == '1']
del long2['value']
long2.head()
Out[19]:
In [20]:
# 14. This final step confirms that the 'Avatar' data worked correctly!
#
long_Avatar = long2[long2.title == 'Avatar']
long_Avatar.head()
Out[20]: