Feature extraction from GoodReads API

This noteboook contains the step-by-step code to enrich the books dataset using the ISBN key to extract information from the GoodReads API. The GoodReads API can be accessed using the 'goodreads' Python library.

Importing libaries and tools



In [1]:

    
import requests
from goodreads import client
import pandas as pd



In [ ]:

    
# This this is the URL prefix common for each record

url_prefix = 'https://www.goodreads.com/book/isbn_to_id/0441172717,0739467352?key='



In [2]:

    
# Setting up the GoodReads API client

file = open('goodreads_credentials')

key , secret = [element.strip() for element in file.readlines()]

gc = client.GoodreadsClient(key,secret)

Importing the data file

This data file contains the original data that needs to be enriched. It contains the ISBN data for books using which we will be extracting additional information from the API.



In [16]:

    
df = pd.read_csv('Combine.csv',index_col=0)



In [17]:

    
all_isbn = df.isbn.unique()



In [18]:

    
isbn_df = pd.DataFrame(all_isbn,columns=['isbn'])



In [8]:

    
c = 0

def get_details(isbn):
    
    global c
    
    c+=1
    
    if(c%100 == 0):
        print(c)
        
    
    try:
        b = gc.book(isbn=isbn)
        return pd.Series({'title':b.title,'description':b.description,'num_pages':b.num_pages})
    except:
        return pd.Series({'title':'none','description':'none','num_pages':'none'})



In [21]:

    
isbn_df[['description','num_pages','title']] = isbn_df.isbn.apply(get_details)



In [22]:

    
isbn_df.to_pickle('ibsn_features_full.pickle')



In [3]:

    
isbn_df = pd.read_pickle('ibsn_features_full.pickle')

Retrying to get info for null records



In [23]:

    
dfx = isbn_df[isbn_df['title'] == 'none']



In [26]:

    
dfx.head()









    Out[26]:







  
    
      
      isbn
      description
      num_pages
      title
    
  
  
    
      38
      0060010800
      none
      none
      none
    
    
      749
      0263211053
      none
      none
      none
    
    
      2344
      0578124114
      none
      none
      none
    
    
      2459
      0671032976
      none
      none
      none
    
    
      2512
      0739438069
      none
      none
      none



In [33]:

    
dfx[['description','num_pages','title']] = dfx.isbn.apply(get_details)









    



1300






    



/Users/janakajain/anaconda/lib/python3.6/site-packages/pandas/core/frame.py:2352: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]



In [40]:

    
dfx[dfx['title'] == 'none'].shape









    Out[40]:





(34, 4)

There are 34 records that still remain with no information from the API. We shall remove these records from our dataset as they form a negligible portion of our sample of books.

Merging dfx with ibsn_df



In [19]:

    
for i, row in dfx.iterrows():
    isbn_df.loc[i] = dfx.loc[i]



In [42]:

    
# Checking if the newly created dataset contains the same number of empty records as in dfx

isbn_df[isbn_df['title'] == 'none'].shape == dfx[dfx['title'] == 'none'].shape









    Out[42]:





True



In [43]:

    
# dfx = isbn_df[isbn_df['title'] == 'none']

Saving the data file



In [39]:

    
isbn_df.to_pickle('ibsn_features_new_batch.pickle')

	isbn	description	num_pages	title
38	0060010800	none	none	none
749	0263211053	none	none	none
2344	0578124114	none	none	none
2459	0671032976	none	none	none
2512	0739438069	none	none	none