Introduction to Data Structures

MSDS5023 - Philip Best
Elizabeth Mullen
10/12/2017

Goal One

NightThree

This is the sum total of all graduate program knowledge... Compared to what? According to whom?

Goal One

Ingest a csv file as pure text, then as list of lines, and then as a data frame.

Goal Two

Right now, the file is in a 'narrow' format. In other words, several interesting bits are collapsed into a single field. Let's attempt to make the data frame a 'wide' format. All the collapsed items expanded horizontally.

Goal Three

Wide data sets are good for exploration, but 'long' data sets are good for training. Let's attempt to expand all the collapsed field vertically instead of horizontally. Does this result in data duplication? What do you think about that? Yes and No are both correct -- but what's the context?

Read as text



In [1]:

    
import csv
with open('tmdb_5000_movies.csv',newline='') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='"')
#    for row in filereader:
#        print(', '.join(row))

Read in Line by line



In [2]:

    
f = open('tmdb_5000_movies.csv', "r", encoding="utf8")
line=f.readline()

while line:
#    print(line)
    line = f.readline()
f.close()



In [ ]:

Import as pandas dataframe



In [4]:

    
# 1. Read csv file into a dataframe
#
import pandas as pd

df_movies = pd.read_csv('tmdb_5000_movies.csv')



In [5]:

    
# 2. Print a few rows to check it out.
#
df_movies.head()









    Out[5]:







  
    
      
      budget
      genres
      homepage
      id
      keywords
      original_language
      original_title
      overview
      popularity
      production_companies
      production_countries
      release_date
      revenue
      runtime
      spoken_languages
      status
      tagline
      title
      vote_average
      vote_count
    
  
  
    
      0
      237000000
      [{"id": 28, "name": "Action"}, {"id": 12, "nam...
      http://www.avatarmovie.com/
      19995
      [{"id": 1463, "name": "culture clash"}, {"id":...
      en
      Avatar
      In the 22nd century, a paraplegic Marine is di...
      150.437577
      [{"name": "Ingenious Film Partners", "id": 289...
      [{"iso_3166_1": "US", "name": "United States o...
      2009-12-10
      2787965087
      162.0
      [{"iso_639_1": "en", "name": "English"}, {"iso...
      Released
      Enter the World of Pandora.
      Avatar
      7.2
      11800
    
    
      1
      300000000
      [{"id": 12, "name": "Adventure"}, {"id": 14, "...
      http://disney.go.com/disneypictures/pirates/
      285
      [{"id": 270, "name": "ocean"}, {"id": 726, "na...
      en
      Pirates of the Caribbean: At World's End
      Captain Barbossa, long believed to be dead, ha...
      139.082615
      [{"name": "Walt Disney Pictures", "id": 2}, {"...
      [{"iso_3166_1": "US", "name": "United States o...
      2007-05-19
      961000000
      169.0
      [{"iso_639_1": "en", "name": "English"}]
      Released
      At the end of the world, the adventure begins.
      Pirates of the Caribbean: At World's End
      6.9
      4500
    
    
      2
      245000000
      [{"id": 28, "name": "Action"}, {"id": 12, "nam...
      http://www.sonypictures.com/movies/spectre/
      206647
      [{"id": 470, "name": "spy"}, {"id": 818, "name...
      en
      Spectre
      A cryptic message from Bond’s past sends him o...
      107.376788
      [{"name": "Columbia Pictures", "id": 5}, {"nam...
      [{"iso_3166_1": "GB", "name": "United Kingdom"...
      2015-10-26
      880674609
      148.0
      [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...
      Released
      A Plan No One Escapes
      Spectre
      6.3
      4466
    
    
      3
      250000000
      [{"id": 28, "name": "Action"}, {"id": 80, "nam...
      http://www.thedarkknightrises.com/
      49026
      [{"id": 849, "name": "dc comics"}, {"id": 853,...
      en
      The Dark Knight Rises
      Following the death of District Attorney Harve...
      112.312950
      [{"name": "Legendary Pictures", "id": 923}, {"...
      [{"iso_3166_1": "US", "name": "United States o...
      2012-07-16
      1084939099
      165.0
      [{"iso_639_1": "en", "name": "English"}]
      Released
      The Legend Ends
      The Dark Knight Rises
      7.6
      9106
    
    
      4
      260000000
      [{"id": 28, "name": "Action"}, {"id": 12, "nam...
      http://movies.disney.com/john-carter
      49529
      [{"id": 818, "name": "based on novel"}, {"id":...
      en
      John Carter
      John Carter is a war-weary, former military ca...
      43.926995
      [{"name": "Walt Disney Pictures", "id": 2}]
      [{"iso_3166_1": "US", "name": "United States o...
      2012-03-07
      284139100
      132.0
      [{"iso_639_1": "en", "name": "English"}]
      Released
      Lost in our world, found in another.
      John Carter
      6.1
      2124

Goal Two - Expand the data horizontally.

Here is the plan:

First, I wanted to know how many rows were in the dataframe, because I would have to iterate through them.

Second, I was going to determine what all of the unique genres were, because I would have to add columns for each of them. Each column would store a value where "1" means the movie belongs to that genre. A "0" value means that it does not.

This is accomplished by expanding each movie's list of genres, then accumulating a list of all of them. Then, I'll get rid of the duplicates, so that I have a list of all the unique occurences. This turns out to be 20, so I will create a dataframe that is 4803 rows x 20 columns (df_movie_genre_3)

I will also create a dataframe that has each movie and seven columns to hold the genres that are assigned to it. If a movie only has 3 genres, the other columns have a value of 'None.'(df_movie_genre_4)

Then I will iterate through df_movie_genre_4 and put a 1 in the column whose name matches the genre (ex. Action genre puts a '1' in the Action column.)



In [6]:

    
# 3. Count how many rows you're dealing with.
#
x = df_movies.genres
len(x)
movie_cnt = len(x)
movie_cnt









    Out[6]:





4803



In [7]:

    
# 4. Iterate through indices of genre array to create a list of lists of genre names.
#    There are 4803 lists 
#
import json

genresAll = []
for k,x in enumerate(df_movies.genres):
    genreList = []
    genre = json.loads(df_movies.genres[k])
    for i,val in enumerate(genre):
        genreList.append(genre[i]['name'])
    genresAll.append(genreList)
#genresAll



In [8]:

    
# 5. Use itertools to break up lists into individual genres.
#    For example ([1,2,3],[4,5,6]) becomes (1,2,3,4,5,6)
#
from itertools import *
glist = []
for i in chain(genresAll):
    for g in i:
#        print(g)
#        yield g
        glist.append(g)
genres = glist



In [11]:

    
# 6. This gets rid of all the duplicates by making it a set.
#
unique_genres = set(genres)
unique_genres









    Out[11]:





{'Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Foreign',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western'}



In [12]:

    
#7. 
#
df_movie_genre_1 = pd.DataFrame(df_movies.title)
df_movie_genre_1.head()









    Out[12]:







  
    
      
      title
    
  
  
    
      0
      Avatar
    
    
      1
      Pirates of the Caribbean: At World's End
    
    
      2
      Spectre
    
    
      3
      The Dark Knight Rises
    
    
      4
      John Carter



In [13]:

    
# 8. This is the dataframe that is 4803 rows x 20 columns, one for each genre.
#
df_movie_genre_2 = pd.DataFrame(df_movie_genre_1,columns=(unique_genres))
df_movie_genre_2.fillna('0',inplace=True)
df_movie_genre_2.head()









    Out[13]:







  
    
      
      Documentary
      Fantasy
      Drama
      Thriller
      Western
      Comedy
      Romance
      Horror
      Animation
      History
      Adventure
      Foreign
      Action
      Science Fiction
      War
      Music
      Crime
      Family
      TV Movie
      Mystery
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0



In [14]:

    
# 9. This is the dataframe that has the movie title and 20 columns for genre
#      indicators.
#
df_movie_genre_3 = pd.concat([df_movie_genre_1, df_movie_genre_2], axis = 1)
df_movie_genre_3.head()









    Out[14]:







  
    
      
      title
      Documentary
      Fantasy
      Drama
      Thriller
      Western
      Comedy
      Romance
      Horror
      Animation
      ...
      Adventure
      Foreign
      Action
      Science Fiction
      War
      Music
      Crime
      Family
      TV Movie
      Mystery
    
  
  
    
      0
      Avatar
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      Pirates of the Caribbean: At World's End
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      Spectre
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      The Dark Knight Rises
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      John Carter
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 21 columns



In [15]:

    
# 10. This is the dataframe that has up to seven genres per movie.
#
df_movie_genre_4 = pd.DataFrame.from_records(genresAll)
df_movie_genre_4.head()









    Out[15]:







  
    
      
      0
      1
      2
      3
      4
      5
      6
    
  
  
    
      0
      Action
      Adventure
      Fantasy
      Science Fiction
      None
      None
      None
    
    
      1
      Adventure
      Fantasy
      Action
      None
      None
      None
      None
    
    
      2
      Action
      Adventure
      Crime
      None
      None
      None
      None
    
    
      3
      Action
      Crime
      Drama
      Thriller
      None
      None
      None
    
    
      4
      Action
      Adventure
      Science Fiction
      None
      None
      None
      None



In [17]:

    
# 11. This  is the logic for iterating through the two dataframes
#     and setting the indicator in the column to '1'. 
#     I could not think of a clever way to determine that max = 7.

max = 7

#print(max)
for i in range (0,movie_cnt):
#    print('i=',i)
    for j in range (0,max):
#        print('j=',j)
#        g = df_movie_genre_4.loc[i][j]
#        print(g)
        if df_movie_genre_4.loc[i][j] != None:
            df_movie_genre_3.loc[i][df_movie_genre_4.loc[i][j]] = '1'
        else:
            break
    
df_movie_genre_3.head()









    Out[17]:







  
    
      
      title
      Documentary
      Fantasy
      Drama
      Thriller
      Western
      Comedy
      Romance
      Horror
      Animation
      ...
      Adventure
      Foreign
      Action
      Science Fiction
      War
      Music
      Crime
      Family
      TV Movie
      Mystery
    
  
  
    
      0
      Avatar
      0
      1
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      1
      1
      0
      0
      0
      0
      0
      0
    
    
      1
      Pirates of the Caribbean: At World's End
      0
      1
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      Spectre
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      3
      The Dark Knight Rises
      0
      0
      1
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      4
      John Carter
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      1
      1
      0
      0
      0
      0
      0
      0
    
  

5 rows × 21 columns

Goal 3



In [18]:

    
# 12. The melt function does some pretty cool stuff. Here, it takes the wide
#     dataframe and makes it 'long'. In other words, it creates many rows from
#     one, repeating the title on each genre for that title.
#
long = pd.melt(df_movie_genre_3, id_vars = 'title', var_name = 'genre')
long.head()









    Out[18]:







  
    
      
      title
      genre
      value
    
  
  
    
      0
      Avatar
      Documentary
      0
    
    
      1
      Pirates of the Caribbean: At World's End
      Documentary
      0
    
    
      2
      Spectre
      Documentary
      0
    
    
      3
      The Dark Knight Rises
      Documentary
      0
    
    
      4
      John Carter
      Documentary
      0



In [19]:

    
# 13. This selects only the columns with value = '1', and deletes
#     the value column.
long2 = long[long.value == '1']
del long2['value']
long2.head()









    Out[19]:







  
    
      
      title
      genre
    
  
  
    
      489
      Oceans
      Documentary
    
    
      719
      This Is It
      Documentary
    
    
      1730
      Winged Migration
      Documentary
    
    
      2093
      Jackass 3D
      Documentary
    
    
      2191
      Capitalism: A Love Story
      Documentary



In [20]:

    
# 14. This final step confirms that the 'Avatar' data worked correctly!
#
long_Avatar = long2[long2.title == 'Avatar']
long_Avatar.head()









    Out[20]:







  
    
      
      title
      genre
    
  
  
    
      4803
      Avatar
      Fantasy
    
    
      48030
      Avatar
      Adventure
    
    
      57636
      Avatar
      Action
    
    
      62439
      Avatar
      Science Fiction

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...	http://www.thedarkknightrises.com/	49026	[{"id": 849, "name": "dc comics"}, {"id": 853,...	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.312950	[{"name": "Legendary Pictures", "id": 923}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2012-07-16	1084939099	165.0	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.6	9106
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://movies.disney.com/john-carter	49529	[{"id": 818, "name": "based on novel"}, {"id":...	en	John Carter	John Carter is a war-weary, former military ca...	43.926995	[{"name": "Walt Disney Pictures", "id": 2}]	[{"iso_3166_1": "US", "name": "United States o...	2012-03-07	284139100	132.0	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.1	2124

	Documentary	Fantasy	Drama	Thriller	Western	Comedy	Romance	Horror	Animation	History	Adventure	Foreign	Action	Science Fiction	War	Music	Crime	Family	TV Movie	Mystery
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	0	1	2	3	4	5	6
0	Action	Adventure	Fantasy	Science Fiction	None	None	None
1	Adventure	Fantasy	Action	None	None	None	None
2	Action	Adventure	Crime	None	None	None	None
3	Action	Crime	Drama	Thriller	None	None	None
4	Action	Adventure	Science Fiction	None	None	None	None

	title	genre
0	Avatar	Documentary
1	Pirates of the Caribbean: At World's End	Documentary
2	Spectre	Documentary
3	The Dark Knight Rises	Documentary
4	John Carter	Documentary

	title	genre
489	Oceans	Documentary
719	This Is It	Documentary
1730	Winged Migration	Documentary
2093	Jackass 3D	Documentary
2191	Capitalism: A Love Story	Documentary

	title	genre
4803	Avatar	Fantasy
48030	Avatar	Adventure
57636	Avatar	Action
62439	Avatar	Science Fiction

	Documentary	Fantasy	Drama	Thriller	Western	Comedy	Romance	Horror	Animation	History	Adventure	Foreign	Action	Science Fiction	War	Music	Crime	Family	TV Movie	Mystery
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

Introduction to Data Structures

Goal One

NightThree

Goal One

Goal Two

Goal Three

Read as text

Read in Line by line

Import as pandas dataframe

Goal Two - Expand the data horizontally.

Goal 3

The End

	Documentary	Fantasy	Drama	Thriller	Western	Comedy	Romance	Horror	Animation	History	Adventure	Foreign	Action	Science Fiction	War	Music	Crime	Family	TV Movie	Mystery
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0