Introduction to Data Structures

  • MSDS5023 - Philip Best
  • Elizabeth Mullen
  • 10/12/2017

Goal One

NightThree

This is the sum total of all graduate program knowledge... Compared to what? According to whom?

Goal One

Ingest a csv file as pure text, then as list of lines, and then as a data frame.

Goal Two

Right now, the file is in a 'narrow' format. In other words, several interesting bits are collapsed into a single field. Let's attempt to make the data frame a 'wide' format. All the collapsed items expanded horizontally.

Goal Three

Wide data sets are good for exploration, but 'long' data sets are good for training. Let's attempt to expand all the collapsed field vertically instead of horizontally. Does this result in data duplication? What do you think about that? Yes and No are both correct -- but what's the context?

Read as text


In [1]:
import csv
with open('tmdb_5000_movies.csv',newline='') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='"')
#    for row in filereader:
#        print(', '.join(row))

Read in Line by line


In [2]:
f = open('tmdb_5000_movies.csv', "r", encoding="utf8")
line=f.readline()

while line:
#    print(line)
    line = f.readline()
f.close()

In [ ]:

Import as pandas dataframe


In [4]:
# 1. Read csv file into a dataframe
#
import pandas as pd

df_movies = pd.read_csv('tmdb_5000_movies.csv')

In [5]:
# 2. Print a few rows to check it out.
#
df_movies.head()


Out[5]:
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 [{"name": "Columbia Pictures", "id": 5}, {"nam... [{"iso_3166_1": "GB", "name": "United Kingdom"... 2015-10-26 880674609 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.3 4466
3 250000000 [{"id": 28, "name": "Action"}, {"id": 80, "nam... http://www.thedarkknightrises.com/ 49026 [{"id": 849, "name": "dc comics"}, {"id": 853,... en The Dark Knight Rises Following the death of District Attorney Harve... 112.312950 [{"name": "Legendary Pictures", "id": 923}, {"... [{"iso_3166_1": "US", "name": "United States o... 2012-07-16 1084939099 165.0 [{"iso_639_1": "en", "name": "English"}] Released The Legend Ends The Dark Knight Rises 7.6 9106
4 260000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://movies.disney.com/john-carter 49529 [{"id": 818, "name": "based on novel"}, {"id":... en John Carter John Carter is a war-weary, former military ca... 43.926995 [{"name": "Walt Disney Pictures", "id": 2}] [{"iso_3166_1": "US", "name": "United States o... 2012-03-07 284139100 132.0 [{"iso_639_1": "en", "name": "English"}] Released Lost in our world, found in another. John Carter 6.1 2124

Goal Two - Expand the data horizontally.

Here is the plan:

First, I wanted to know how many rows were in the dataframe, because I would have to iterate through them.

Second, I was going to determine what all of the unique genres were, because I would have to add columns for each of them. Each column would store a value where "1" means the movie belongs to that genre. A "0" value means that it does not.

This is accomplished by expanding each movie's list of genres, then accumulating a list of all of them. Then, I'll get rid of the duplicates, so that I have a list of all the unique occurences. This turns out to be 20, so I will create a dataframe that is 4803 rows x 20 columns (df_movie_genre_3)

I will also create a dataframe that has each movie and seven columns to hold the genres that are assigned to it. If a movie only has 3 genres, the other columns have a value of 'None.'(df_movie_genre_4)

Then I will iterate through df_movie_genre_4 and put a 1 in the column whose name matches the genre (ex. Action genre puts a '1' in the Action column.)


In [6]:
# 3. Count how many rows you're dealing with.
#
x = df_movies.genres
len(x)
movie_cnt = len(x)
movie_cnt


Out[6]:
4803

In [7]:
# 4. Iterate through indices of genre array to create a list of lists of genre names.
#    There are 4803 lists 
#
import json

genresAll = []
for k,x in enumerate(df_movies.genres):
    genreList = []
    genre = json.loads(df_movies.genres[k])
    for i,val in enumerate(genre):
        genreList.append(genre[i]['name'])
    genresAll.append(genreList)
#genresAll

In [8]:
# 5. Use itertools to break up lists into individual genres.
#    For example ([1,2,3],[4,5,6]) becomes (1,2,3,4,5,6)
#
from itertools import *
glist = []
for i in chain(genresAll):
    for g in i:
#        print(g)
#        yield g
        glist.append(g)
genres = glist

In [11]:
# 6. This gets rid of all the duplicates by making it a set.
#
unique_genres = set(genres)
unique_genres


Out[11]:
{'Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Foreign',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western'}

In [12]:
#7. 
#
df_movie_genre_1 = pd.DataFrame(df_movies.title)
df_movie_genre_1.head()


Out[12]:
title
0 Avatar
1 Pirates of the Caribbean: At World's End
2 Spectre
3 The Dark Knight Rises
4 John Carter

In [13]:
# 8. This is the dataframe that is 4803 rows x 20 columns, one for each genre.
#
df_movie_genre_2 = pd.DataFrame(df_movie_genre_1,columns=(unique_genres))
df_movie_genre_2.fillna('0',inplace=True)
df_movie_genre_2.head()


Out[13]:
Documentary Fantasy Drama Thriller Western Comedy Romance Horror Animation History Adventure Foreign Action Science Fiction War Music Crime Family TV Movie Mystery
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [14]:
# 9. This is the dataframe that has the movie title and 20 columns for genre
#      indicators.
#
df_movie_genre_3 = pd.concat([df_movie_genre_1, df_movie_genre_2], axis = 1)
df_movie_genre_3.head()


Out[14]:
title Documentary Fantasy Drama Thriller Western Comedy Romance Horror Animation ... Adventure Foreign Action Science Fiction War Music Crime Family TV Movie Mystery
0 Avatar 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 Pirates of the Caribbean: At World's End 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Spectre 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 The Dark Knight Rises 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 John Carter 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 21 columns


In [15]:
# 10. This is the dataframe that has up to seven genres per movie.
#
df_movie_genre_4 = pd.DataFrame.from_records(genresAll)
df_movie_genre_4.head()


Out[15]:
0 1 2 3 4 5 6
0 Action Adventure Fantasy Science Fiction None None None
1 Adventure Fantasy Action None None None None
2 Action Adventure Crime None None None None
3 Action Crime Drama Thriller None None None
4 Action Adventure Science Fiction None None None None

In [17]:
# 11. This  is the logic for iterating through the two dataframes
#     and setting the indicator in the column to '1'. 
#     I could not think of a clever way to determine that max = 7.

max = 7

#print(max)
for i in range (0,movie_cnt):
#    print('i=',i)
    for j in range (0,max):
#        print('j=',j)
#        g = df_movie_genre_4.loc[i][j]
#        print(g)
        if df_movie_genre_4.loc[i][j] != None:
            df_movie_genre_3.loc[i][df_movie_genre_4.loc[i][j]] = '1'
        else:
            break
    
df_movie_genre_3.head()


Out[17]:
title Documentary Fantasy Drama Thriller Western Comedy Romance Horror Animation ... Adventure Foreign Action Science Fiction War Music Crime Family TV Movie Mystery
0 Avatar 0 1 0 0 0 0 0 0 0 ... 1 0 1 1 0 0 0 0 0 0
1 Pirates of the Caribbean: At World's End 0 1 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 0 0 0 0
2 Spectre 0 0 0 0 0 0 0 0 0 ... 1 0 1 0 0 0 1 0 0 0
3 The Dark Knight Rises 0 0 1 1 0 0 0 0 0 ... 0 0 1 0 0 0 1 0 0 0
4 John Carter 0 0 0 0 0 0 0 0 0 ... 1 0 1 1 0 0 0 0 0 0

5 rows × 21 columns

Goal 3


In [18]:
# 12. The melt function does some pretty cool stuff. Here, it takes the wide
#     dataframe and makes it 'long'. In other words, it creates many rows from
#     one, repeating the title on each genre for that title.
#
long = pd.melt(df_movie_genre_3, id_vars = 'title', var_name = 'genre')
long.head()


Out[18]:
title genre value
0 Avatar Documentary 0
1 Pirates of the Caribbean: At World's End Documentary 0
2 Spectre Documentary 0
3 The Dark Knight Rises Documentary 0
4 John Carter Documentary 0

In [19]:
# 13. This selects only the columns with value = '1', and deletes
#     the value column.
long2 = long[long.value == '1']
del long2['value']
long2.head()


Out[19]:
title genre
489 Oceans Documentary
719 This Is It Documentary
1730 Winged Migration Documentary
2093 Jackass 3D Documentary
2191 Capitalism: A Love Story Documentary

In [20]:
# 14. This final step confirms that the 'Avatar' data worked correctly!
#
long_Avatar = long2[long2.title == 'Avatar']
long_Avatar.head()


Out[20]:
title genre
4803 Avatar Fantasy
48030 Avatar Adventure
57636 Avatar Action
62439 Avatar Science Fiction

The End