RSHolt - MSDS

Goal 1 - NightThree



In [1]:

    
import csv
datafile = open('/Users/kra7830/Desktop/MSDS_School/Info_Structures/dev/NightThree/tmdb_5000_movies.csv', 'r')
myreader = csv.reader(datafile)



In [338]:

    
#for i in myreader:
    #print i

    ##### This prints lots of texts



In [2]:

    
import pandas as pd

# Read the CSV into a pandas data frame (df)

df = pd.read_csv('/Users/kra7830/Desktop/MSDS_School/Info_Structures/dev/NightThree/tmdb_5000_movies.csv', delimiter=','
                )
df = pd.DataFrame(df)

"""
a budget 
b genres  -> embedded lists
c homepage
d id 
e keywords -> embedded lists
f original_language
g original_title
h overview
i popularity
j production_companies -> embedded lists
k production_countries -> embedded lists
l release_date 
m revenue
n runtime
o spoken_languages -> embedded lists
p status
q tagline
r title
s vote_average
t vote_count
"""









    Out[2]:





'\na budget \nb genres  -> embedded lists\nc homepage\nd id \ne keywords -> embedded lists\nf original_language\ng original_title\nh overview\ni popularity\nj production_companies -> embedded lists\nk production_countries -> embedded lists\nl release_date \nm revenue\nn runtime\no spoken_languages -> embedded lists\np status\nq tagline\nr title\ns vote_average\nt vote_count\n'

List test

I am testing the formation of a list from one of the embedded json varibles...



In [3]:

    
import json
import pandas as pd

lst = df['genres'].values.tolist()
print lst[1]
print type(lst)









    



[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]
<type 'list'>

Enumeration test

f1 = a dataframe that contain a non-json variable and an embedded json varible. The first test will be to enumerate through genres to unpack the json and loop over two lists simultaneously.

http://treyhunner.com/2016/04/how-to-loop-with-indexes-in-python/



In [4]:

    
f1 = df[['budget','genres']]
f1.head()









    Out[4]:







  
    
      
      budget
      genres
    
  
  
    
      0
      237000000
      [{"id": 28, "name": "Action"}, {"id": 12, "nam...
    
    
      1
      300000000
      [{"id": 12, "name": "Adventure"}, {"id": 14, "...
    
    
      2
      245000000
      [{"id": 28, "name": "Action"}, {"id": 12, "nam...
    
    
      3
      250000000
      [{"id": 28, "name": "Action"}, {"id": 80, "nam...
    
    
      4
      260000000
      [{"id": 28, "name": "Action"}, {"id": 12, "nam...

Testing the unpacking of JSON with enumeration...



In [5]:

    
### GOAL 2 / TEST 1: Unpack Collapsed Variables 
ids =[]
names =[]
mel =[]

dic = lst  #list of lists
budget = list(f1['budget'])      #corresponding list of budgets should match up to iter
for i, d in enumerate(dic):
    d_lst = json.loads(dic[i])  #this json.loads data in usable format
    #dl = pd.DataFrame(d_lst)    #this put id and name in dataframe
    budg = budget[i]            #this was used to enumerate over the json.loads 

    for i, j in enumerate(d_lst): 
        f_ids = (d_lst[i]['id'])  
        f2_names = (d_lst[i]['name'])
        
        mel.append(budg)  
        ids.append(f_ids)
        names.append(f2_names)



In [6]:

    
gf = pd.DataFrame({'Budget':mel, 'ID': ids, 'name': names})
print gf.head()









    



      Budget   ID             name
0  237000000   28           Action
1  237000000   12        Adventure
2  237000000   14          Fantasy
3  237000000  878  Science Fiction
4  300000000   12        Adventure

Looks like it worked... and the variables are unpacked in a individual format.



In [7]:

    
###

Build Out Compressed Data for Goal 2 & 3

Organizing a little bit...



In [8]:

    
"""
a budget 
    b genres  -> embedded lists
c homepage
d id 
    e keywords -> embedded lists
f original_language
g original_title
h overview
i popularity
    j production_companies -> embedded lists
    k production_countries -> embedded lists
l release_date 
m revenue
n runtime
    o spoken_languages -> embedded lists
p status
q tagline
r title
s vote_average
t vote_count
"""

df_new = df[['title','budget', 'homepage', 'id', 'original_language', 'original_title', 'overview', 'popularity',
           'release_date', 'revenue', 'runtime', 'status', 'tagline', 'vote_average', 'vote_count' ]]
#df_new is the dataframe that has all variables EXCLUDING the JSON data...

Making lists of all the json variables...



In [9]:

    
import numpy as np
b_list = df['genres'].values.tolist()    
e_list = df['keywords'].values.tolist()  
j_list = df['production_companies'].values.tolist()  
k_list = df['production_countries'].values.tolist() 
o_list = df['spoken_languages'].values.tolist()

##thought (enumerate to) join to movie title?? make that a primary key? Then left merge data frames, by title ? 

def unpack_json(x):
    global bf, ef, jf, kf, of #global dataframes to be saved for further use
    ids = []
    names = []
    movie_key = []
    
    r_list = list(df_new['title'])         #list of titles will enumerate to collapsed data
    
    n = json.loads(x[1])
    xn = n[0].keys()
    #print xn
    
    for i, j in enumerate(x):
        movie_title = r_list[i]         #movie title is the key
        js = json.loads(x[i])           #json loading
        #print js                       #test for working 
        #es = json.loads(b_list[i])
        #print movie_title   
        for i,j in enumerate(js):
            
            f = js[i][xn[0]]
            f2 = js[i][xn[1]]
            
            movie_key.append(movie_title)
            ids.append(f)
            names.append(f2)
            
    #logic to deterime which list.append goes to which global variable
    if x == e_list:  #keywords
        ef = pd.DataFrame({'Title': movie_key, 'Keyword_ID': ids, 'Keyword_name': names})
        #print gf.head()
        print 'Success'
        
    elif x == b_list: #genres
        bf = pd.DataFrame({'Title': movie_key, 'Genres_ID': ids, 'Genres_name': names})
        #print bf.head()
        print 'Success'
        
    elif x == j_list:   #production_companies
        jf = pd.DataFrame({'Title': movie_key, 'ProdComp_ID': ids, 'ProcComp_name': names})
        #print jf.head() #Remeber these are backwards
        print 'Success'
        
    elif x == k_list:   #production_countries
        kf = pd.DataFrame({'Title': movie_key, 'ProdCty_ID': ids, 'ProdCty_name': names})
        #print kf.head()
        print 'Success'
        
    elif x == o_list:   #spoken_languages
        of = pd.DataFrame({'Title': movie_key, 'Lang_ID': ids, 'Lang_name': names})
        print of.head() 
        print 'Success'
    else:
        print "NOPE"
            
unpack_json(e_list)
unpack_json(b_list)
unpack_json(j_list)
unpack_json(k_list)
unpack_json(o_list)









    



Success
Success
Success
Success
  Lang_ID Lang_name                                     Title
0      en   English                                    Avatar
1      es   Español                                    Avatar
2      en   English  Pirates of the Caribbean: At World's End
3      fr  Français                                   Spectre
4      en   English                                   Spectre
Success

Goal 3 Making Long Data (Goal 2 below)

"Wide data sets are good for exploration, but 'long' data sets are good for training. Let's attempt to expand all the collapsed field vertically instead of horizontally. Does this result in data duplication? What do you think about that? Yes and No are both correct -- but what's the context?"

Yes, it duplicated a lot of values. Yes, it is good for increasing signal in some case models. However, wide data can be used in many statistical measues as well such as anything involving the logistic regression.

-- Now that I have the data unpacked, I will left join to the dataframe df_new on movie title. This should elongate the data vertically as each unique value is added.



In [10]:

    
### JOIN to make long vertical Data Set 

#caller.join(other.set_index('key'), on='key')
r = df_new.join(bf.set_index('Title'), on='title', how='left')   #bf

r1 = r.join(ef.set_index('Title'), on='title', how='left')       #ef

r2 = r1.join(jf.set_index('Title'), on='title', how='left')      #jf

r3 = r2.join(kf.set_index('Title'), on='title', how='left')      #kf

final_long_df = r3.join(of.set_index('Title'), on='title', how='left')      #of = Last join, so final



In [11]:

    
print final_long_df.head(2)
#print final_long_df.count()









    



    title     budget                     homepage     id original_language  \
0  Avatar  237000000  http://www.avatarmovie.com/  19995                en   
0  Avatar  237000000  http://www.avatarmovie.com/  19995                en   

  original_title                                           overview  \
0         Avatar  In the 22nd century, a paraplegic Marine is di...   
0         Avatar  In the 22nd century, a paraplegic Marine is di...   

   popularity release_date     revenue    ...      Genres_ID Genres_name  \
0  150.437577   2009-12-10  2787965087    ...           28.0      Action   
0  150.437577   2009-12-10  2787965087    ...           28.0      Action   

  Keyword_ID   Keyword_name  ProcComp_name              ProdComp_ID  \
0     1463.0  culture clash          289.0  Ingenious Film Partners   
0     1463.0  culture clash          289.0  Ingenious Film Partners   

  ProdCty_ID              ProdCty_name Lang_ID  Lang_name  
0         US  United States of America      en    English  
0         US  United States of America      es    Español  

[2 rows x 25 columns]

Looking at counts...



In [12]:

    
print str("#this is OG data with no JSON" "\n"), df_new.count()









    



#this is OG data with no JSON
title                4803
budget               4803
homepage             1712
id                   4803
original_language    4803
original_title       4803
overview             4800
popularity           4803
release_date         4802
revenue              4803
runtime              4801
status               4803
tagline              3959
vote_average         4803
vote_count           4803
dtype: int64



In [13]:

    
print bf['Genres_name'].count()
print "\n"
print str("#this is generes unpacked with movie" "\n") , bf.head()
print "\n"
print str("#this is Genres unpacked with movie counts" "\n"), bf.count()
print "\n"
#print r.count()









    



12160


#this is generes unpacked with movie
   Genres_ID      Genres_name                                     Title
0         28           Action                                    Avatar
1         12        Adventure                                    Avatar
2         14          Fantasy                                    Avatar
3        878  Science Fiction                                    Avatar
4         12        Adventure  Pirates of the Caribbean: At World's End


#this is Genres unpacked with movie counts
Genres_ID      12160
Genres_name    12160
Title          12160
dtype: int64



In [14]:

    
pd.set_option('display.max_rows', 500)



In [ ]:

Goal 2: Making Wide Data

I want to collect only the key values from the compressed data.



In [15]:

    
#Getting all the unique values of keys to store

def store_keys(x):
    global key_un, gen_un, prod_comp_un, prod_cty_un, lang_un
    ident = []
    n = json.loads(x[1])
    xn = n[0].keys()
    for i, j in enumerate(x):

        js = json.loads(x[i])           #json loading
        for i,j in enumerate(js):
            
            f = js[i][xn[0]]
            f2 = js[i][xn[1]]
            ident.append(f2)
    
    if x == e_list:  #keywords
        key_un = pd.DataFrame(np.array(np.unique(ident)))
        print key_un.head()
        print 'Success'
        
    elif x == b_list: #genres
        gen_un = pd.DataFrame(np.array(np.unique(ident)))
        #print gen_un.head()
        print 'Success'
        
    elif x == j_list:   #production_companies
        prod_comp_un = pd.DataFrame(np.array(np.unique(ident)))
        #print prod_comp_un.head()
        print 'Success'
        
    elif x == k_list:   #production_countries
        prod_cty_un = pd.DataFrame(np.array(np.unique(ident)))
        #print prod_cty_un.head()
        print 'Success'
        
    elif x == o_list:   #spoken_languages
        lang_un = pd.DataFrame(np.array(np.unique(ident)))
        #print lang_un.head()
        print 'Success'
    else:
        print "NOPE"
    
store_keys(e_list)
store_keys(b_list)
store_keys(j_list)
store_keys(k_list)
store_keys(o_list)









    



              0
0  15th century
1  16th century
2  17th century
3  18th century
4         1910s
Success
Success
Success
Success
Success

Testing to see if keys were captured... These should be eventually used as features in the wide data set.



In [16]:

    
lang_un[0].head()









    Out[16]:





0                    
1              ??????
2           Afrikaans
3    Bahasa indonesia
4          Bamanankan
Name: 0, dtype: object

Making Dummy Values

First, make key values as features in a new data set. Using pandas.dummy to create values 1 or 0.



In [17]:

    
df_genres = pd.get_dummies(bf['Genres_name'])
df_keyword = pd.get_dummies(ef['Keyword_name'])
df_prod_name = pd.get_dummies(jf['ProdComp_ID'])
df_prod_country = pd.get_dummies(kf['ProdCty_name'])
df_lang = pd.get_dummies(of['Lang_name'])



In [18]:

    
df_prod_name.head() #nice function... very simple









    Out[18]:







  
    
      
      "DIA" Productions GmbH & Co. KG
      1.85 Films
      10 West Studios
      100 Bares
      1019 Entertainment
      101st Street Films
      10th Hole Productions
      120 Films
      120dB Films
      13 Ghosts Productions Canada Inc.
      ...
      rusty bear entertainment
      thefyzz
      thinkfilm
      uFilm
      unafilm
      verture Films
      warner bross Turkey
      winchester films
      África Filmes
      Österreichischer Rundfunk (ORF)
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 5017 columns

Concat directly back to the unpacked dataframes...



In [19]:

    
genres_wide = pd.concat([bf, df_genres], axis=1)
keyword_wide = pd.concat([ef, df_keyword], axis=1)
prod_name_wide = pd.concat([jf, df_prod_name], axis=1)
prod_country_wide = pd.concat([kf, df_prod_country], axis=1)
lang_wide = pd.concat([of, df_lang], axis=1)

genres_wide.head()









    Out[19]:







  
    
      
      Genres_ID
      Genres_name
      Title
      Action
      Adventure
      Animation
      Comedy
      Crime
      Documentary
      Drama
      ...
      History
      Horror
      Music
      Mystery
      Romance
      Science Fiction
      TV Movie
      Thriller
      War
      Western
    
  
  
    
      0
      28
      Action
      Avatar
      1
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      12
      Adventure
      Avatar
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      14
      Fantasy
      Avatar
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      878
      Science Fiction
      Avatar
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      4
      12
      Adventure
      Pirates of the Caribbean: At World's End
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 23 columns

Removing Names and ID...



In [20]:

    
df_gr = pd.DataFrame(genres_wide.drop(['Genres_name','Genres_ID'], 1))
df_ky = keyword_wide.drop(['Keyword_ID', 'Keyword_name'], 1)
df_pn = prod_name_wide.drop(['ProdComp_ID','ProcComp_name'], 1)
df_pc = prod_country_wide.drop(['ProdCty_name','ProdCty_ID'], 1)
df_ln = lang_wide.drop(['Lang_ID','Lang_name'], 1)



In [21]:

    
df_ln.head()









    Out[21]:







  
    
      
      Title
      
      ??????
      Afrikaans
      Bahasa indonesia
      Bamanankan
      Bosanski
      Català
      Cymraeg
      Dansk
      ...
      বাংলা
      ਪੰਜਾਬੀ
      தமிழ்
      తెలుగు
      ภาษาไทย
      ქართული
      广州话 / 廣州話
      日本語
      普通话
      한국어/조선말
    
  
  
    
      0
      Avatar
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      Avatar
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      Pirates of the Caribbean: At World's End
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      Spectre
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      Spectre
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 63 columns



In [22]:

    
### create pivot tables as DFs as records then format title for simplier join...

def multi_table_creation(x):
    global gr_table, ky_table, pn_table, pc_table, ln_table
    
    if x is df_gr:
        gr_table = pd.DataFrame(pd.pivot_table(x, index=['Title'], aggfunc=np.sum))
        gr_table = pd.DataFrame(gr_table.to_records())
        gr_table = gr_table.rename(index=str, columns={"Title": "title"})
        print "OK"
    elif x is df_ky:
        ky_table = pd.pivot_table(x, index=['Title'], aggfunc=np.sum)
        ky_table = pd.DataFrame(ky_table.to_records())
        ky_table = ky_table.rename(index=str, columns={"Title": "title"})
    elif x is df_pn:
        pn_table = pd.pivot_table(x, index=['Title'], aggfunc=np.sum)
        pn_table = pd.DataFrame(pn_table.to_records())
        pn_table = pn_table.rename(index=str, columns={"Title": "title"})
    elif x is df_pc:
        pc_table = pd.pivot_table(x, index=['Title'], aggfunc=np.sum)
        pc_table = pd.DataFrame(pc_table.to_records())
        pc_table = pc_table.rename(index=str, columns={"Title": "title"})
    elif x is df_ln:
        ln_table = pd.pivot_table(x, index=['Title'], aggfunc=np.sum)
        ln_table = pd.DataFrame(ln_table.to_records())
        ln_table = ln_table.rename(index=str, columns={"Title": "title"})
        
        
        
        
multi_table_creation(df_gr)
multi_table_creation(df_ky)
multi_table_creation(df_pn)
multi_table_creation(df_pc)
multi_table_creation(df_ln)

OK



In [23]:

    
ky_table.head() #title is now a column name, not a pivoted var...









    Out[23]:







  
    
      
      title
      15th century
      16th century
      17th century
      18th century
      1910s
      1920s
      1930s
      1940s
      1950s
      ...
      flipping coin
      gilbert and sullivan
      nightgown
      north carolinam
      nosferatu
      strange noise
      Γη
      卧底肥妈
      绝地奶霸
      超级妈妈
    
  
  
    
      0
      (500) Days of Summer
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      10 Cloverfield Lane
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      10 Days in a Madhouse
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      10 Things I Hate About You
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      102 Dalmatians
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 9814 columns

Now we have tables with unique Titles of movies with wide set of features.



In [24]:

    
t = df_new.join(gr_table.set_index('title'), on='title', how='left')   #bf



In [27]:

    
t.head()









    Out[27]:







  
    
      
      title
      budget
      homepage
      id
      original_language
      original_title
      overview
      popularity
      release_date
      revenue
      ...
      History
      Horror
      Music
      Mystery
      Romance
      Science Fiction
      TV Movie
      Thriller
      War
      Western
    
  
  
    
      0
      Avatar
      237000000
      http://www.avatarmovie.com/
      19995
      en
      Avatar
      In the 22nd century, a paraplegic Marine is di...
      150.437577
      2009-12-10
      2787965087
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      Pirates of the Caribbean: At World's End
      300000000
      http://disney.go.com/disneypictures/pirates/
      285
      en
      Pirates of the Caribbean: At World's End
      Captain Barbossa, long believed to be dead, ha...
      139.082615
      2007-05-19
      961000000
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      Spectre
      245000000
      http://www.sonypictures.com/movies/spectre/
      206647
      en
      Spectre
      A cryptic message from Bond’s past sends him o...
      107.376788
      2015-10-26
      880674609
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      The Dark Knight Rises
      250000000
      http://www.thedarkknightrises.com/
      49026
      en
      The Dark Knight Rises
      Following the death of District Attorney Harve...
      112.312950
      2012-07-16
      1084939099
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
    
    
      4
      John Carter
      260000000
      http://movies.disney.com/john-carter
      49529
      en
      John Carter
      John Carter is a war-weary, former military ca...
      43.926995
      2012-03-07
      284139100
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 35 columns

Not sure why, but values changed to float...



In [27]:

    
t.iloc[:,15].head() ### Values changed to float64... see below for fix









    Out[27]:





0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: Action, dtype: float64

Joining tables back to each other to create very very wide data...

After this, I will remove NANs and convert to type INT (See below).



In [29]:

    
#gr_table, ky_table, pn_table, pc_table, ln_table
t = df_new.join(gr_table.set_index('title'), on='title', how='left')   #bf
ky_table = ky_table.sort_values(by = 'title')
t1 = t.join(ky_table.set_index('title'), on='title', how='left', lsuffix='_left', rsuffix='_right')

t2 = t1.join(pn_table.set_index('title'), on='title', how='left', lsuffix='_left', rsuffix='_right')   #bf

t3 = t2.join(pc_table.set_index('title'), on='title', how='left', lsuffix='_left', rsuffix='_right')   #bf

t4 = t3.join(ln_table.set_index('title'), on='title', how='left', lsuffix='_left', rsuffix='_right')   #bf



In [30]:

    
t4.head()









    Out[30]:







  
    
      
      title
      budget
      homepage_left
      id
      original_language
      original_title
      overview
      popularity_left
      release_date
      revenue
      ...
      বাংলা
      ਪੰਜਾਬੀ
      தமிழ்
      తెలుగు
      ภาษาไทย
      ქართული
      广州话 / 廣州話
      日本語
      普通话
      한국어/조선말
    
  
  
    
      0
      Avatar
      237000000
      http://www.avatarmovie.com/
      19995
      en
      Avatar
      In the 22nd century, a paraplegic Marine is di...
      150.437577
      2009-12-10
      2787965087
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      Pirates of the Caribbean: At World's End
      300000000
      http://disney.go.com/disneypictures/pirates/
      285
      en
      Pirates of the Caribbean: At World's End
      Captain Barbossa, long believed to be dead, ha...
      139.082615
      2007-05-19
      961000000
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      2
      Spectre
      245000000
      http://www.sonypictures.com/movies/spectre/
      206647
      en
      Spectre
      A cryptic message from Bond’s past sends him o...
      107.376788
      2015-10-26
      880674609
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      The Dark Knight Rises
      250000000
      http://www.thedarkknightrises.com/
      49026
      en
      The Dark Knight Rises
      Following the death of District Attorney Harve...
      112.312950
      2012-07-16
      1084939099
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      4
      John Carter
      260000000
      http://movies.disney.com/john-carter
      49529
      en
      John Carter
      John Carter is a war-weary, former military ca...
      43.926995
      2012-03-07
      284139100
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 15015 columns

Filling NaNs before INT conversion...



In [31]:

    
t4.fillna(0, inplace=True)

Final Stretch = converting to INT based on list of key values saved previously

Example: Only doing Genres as example... This needs optimization as it's very slow!!! But it works...



In [32]:

    
#t['History'] = t['History'].apply(int)
#key_un, gen_un, prod_comp_un, prod_cty_un, lang_un .   #the keys I stored earlier
for i in gen_un[0]:
    t4[i] = t4[i].apply(int)

#for i in key_un[0]:
    #t4[i] = t4[i].apply(int)
    
#for i in prod_comp_un[0]:
 #   t4[i] = t4[i].apply(int)   
    
#for i in prod_cty_un[0]:
 #   t4[i] = t4[i].apply(int)

#for i in lang_un[0]:
 #   t4[i] = t4[i].apply(int)

Goal 2 fun... Example of Super Wide Data in Logistic Regression



In [51]:

    
f = pd.DataFrame(t4.iloc[0]) #for Ex. only selecting small data



In [290]:

    
pd.set_option('display.max_columns', 100) # expand to see more variables



In [291]:

    
f.T









    Out[291]:







  
    
      
      title
      budget
      homepage_left
      id
      original_language
      original_title
      overview
      popularity_left
      release_date
      revenue
      runtime
      status
      tagline
      vote_average
      vote_count
      Action
      Adventure
      Animation
      Comedy
      Crime
      Documentary
      Drama
      Family
      Fantasy
      Foreign
      History
      Horror
      Music
      Mystery
      Romance
      Science Fiction
      TV Movie
      Thriller
      War
      Western
      15th century
      16th century
      17th century
      18th century
      1910s
      1920s
      1930s
      1940s
      1950s
      1960s
      1970s
      1980s
      1990s
      19th century
      2000 ad
      ...
      Español
      Esperanto
      Français
      Gaeilge
      Galego
      Hrvatski
      Italiano
      Kiswahili
      Latin
      Magyar
      Nederlands
      No Language
      Norsk
      Polski
      Português
      Pусский
      Română
      Slovenčina
      Slovenščina
      Somali
      Srpski
      Tiếng Việt
      Türkçe
      Wolof
      isiZulu
      shqip
      suomi
      svenska
      Íslenska
      Český
      ελληνικά
      Український
      български език
      қазақ
      עִבְרִית
      اردو
      العربية
      فارسی
      پښتو
      हिन्दी
      বাংলা
      ਪੰਜਾਬੀ
      தமிழ்
      తెలుగు
      ภาษาไทย
      ქართული
      广州话 / 廣州話
      日本語
      普通话
      한국어/조선말
    
  
  
    
      0
      Avatar
      237000000
      http://www.avatarmovie.com/
      19995
      en
      Avatar
      In the 22nd century, a paraplegic Marine is di...
      150.438
      2009-12-10
      2787965087
      162
      Released
      Enter the World of Pandora.
      7.2
      11800
      1
      1
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

1 rows × 15015 columns

Importing some stats packages...



In [71]:

    
import scipy.stats as stats



In [140]:

    
import statsmodels.api as sm
import statsmodels.formula.api as smf



In [177]:

    
log_test = t4[['title','budget', 'original_language','runtime','vote_average','Action', 'Adventure', 
               'Comedy','Fantasy', 'Foreign', 'Family', 'History', 'Horror', 'Music', 'Thriller', 'War']]



In [ ]:



In [320]:

    
results = smf.glm('Action ~ Adventure', 
                  data=log_test, family=sm.families.Binomial()).fit()
#results = smf.glm('Action ~ Adventure + Comedy + Fantasy + Foreign + Family + History + Horror + Thriller + War', 
 #                 data=log_test, family=sm.families.Binomial()).fit()



In [321]:

    
print(results.summary())









    



                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                 Action   No. Observations:                 4803
Model:                            GLM   Df Residuals:                     4801
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2374.1
Date:                Thu, 19 Oct 2017   Deviance:                       4748.1
Time:                        11:05:09   Pearson chi2:                 4.80e+03
No. Iterations:                     4                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.5748      0.042    -37.598      0.000      -1.657      -1.493
Adventure      1.9426      0.084     23.255      0.000       1.779       2.106
==============================================================================



In [322]:

    
results.params









    Out[322]:





Intercept   -1.574834
Adventure    1.942559
dtype: float64

Unadjusted Odds Ratio Example. Logistic regression is one technique that could be used to predict / analyze patterns for all 15015 features in the wide data set.



In [332]:

    
for i, j in enumerate(results.params):
    if i == 0:
        next 
    else:
        print results.model.data.param_names[i],"is" , np.round(np.exp(j),1), "x likely to be also an Action movie."









    



Adventure is 7.0 x likely to be also an Action movie.



In [ ]:



In [ ]:



In [ ]:

	budget	genres
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "nam...
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...

	"DIA" Productions GmbH & Co. KG	1.85 Films	10 West Studios	100 Bares	1019 Entertainment	101st Street Films	10th Hole Productions	120 Films	120dB Films	13 Ghosts Productions Canada Inc.	...	rusty bear entertainment	thefyzz	thinkfilm	uFilm	unafilm	verture Films	warner bross Turkey	winchester films	África Filmes	Österreichischer Rundfunk (ORF)
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	Genres_ID	Genres_name	Title	Action	Adventure	...	Science Fiction
0	28	Action	Avatar	1	0	...	0
1	12	Adventure	Avatar	0	1	...	0
2	14	Fantasy	Avatar	0	0	...	0
3	878	Science Fiction	Avatar	0	0	...	1
4	12	Adventure	Pirates of the Caribbean: At World's End	0	1	...	0

	title	...
0	(500) Days of Summer	...
1	10 Cloverfield Lane	...
2	10 Days in a Madhouse	...
3	10 Things I Hate About You	...
4	102 Dalmatians	...

	title	budget	homepage	id	original_language	original_title	overview	popularity	release_date	revenue	...	Science Fiction	Thriller
0	Avatar	237000000	http://www.avatarmovie.com/	19995	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	2009-12-10	2787965087	...	1.0	0.0
1	Pirates of the Caribbean: At World's End	300000000	http://disney.go.com/disneypictures/pirates/	285	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	2007-05-19	961000000	...	0.0	0.0
2	Spectre	245000000	http://www.sonypictures.com/movies/spectre/	206647	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	2015-10-26	880674609	...	0.0	0.0
3	The Dark Knight Rises	250000000	http://www.thedarkknightrises.com/	49026	en	The Dark Knight Rises	Following the death of District Attorney Harve...	112.312950	2012-07-16	1084939099	...	0.0	1.0
4	John Carter	260000000	http://movies.disney.com/john-carter	49529	en	John Carter	John Carter is a war-weary, former military ca...	43.926995	2012-03-07	284139100	...	1.0	0.0

	"DIA" Productions GmbH & Co. KG	1.85 Films	10 West Studios	100 Bares	1019 Entertainment	101st Street Films	10th Hole Productions	120 Films	120dB Films	13 Ghosts Productions Canada Inc.	...	rusty bear entertainment	thefyzz	thinkfilm	uFilm	unafilm	verture Films	warner bross Turkey	winchester films	África Filmes	Österreichischer Rundfunk (ORF)
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	"DIA" Productions GmbH & Co. KG	1.85 Films	10 West Studios	100 Bares	1019 Entertainment	101st Street Films	10th Hole Productions	120 Films	120dB Films	13 Ghosts Productions Canada Inc.	...	rusty bear entertainment	thefyzz	thinkfilm	uFilm	unafilm	verture Films	warner bross Turkey	winchester films	África Filmes	Österreichischer Rundfunk (ORF)
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0