Feature extraction from GoodReads API

This noteboook contains the step-by-step code to enrich the books dataset using the ISBN key to extract information from the GoodReads API. The GoodReads API can be accessed using the 'goodreads' Python library.


Importing libaries and tools


In [1]:
import requests
from goodreads import client
import pandas as pd

In [ ]:
# This this is the URL prefix common for each record

url_prefix = 'https://www.goodreads.com/book/isbn_to_id/0441172717,0739467352?key='

In [2]:
# Setting up the GoodReads API client

file = open('goodreads_credentials')

key , secret = [element.strip() for element in file.readlines()]

gc = client.GoodreadsClient(key,secret)

Importing the data file

This data file contains the original data that needs to be enriched. It contains the ISBN data for books using which we will be extracting additional information from the API.


In [16]:
df = pd.read_csv('Combine.csv',index_col=0)

In [17]:
all_isbn = df.isbn.unique()

In [18]:
isbn_df = pd.DataFrame(all_isbn,columns=['isbn'])

In [8]:
c = 0

def get_details(isbn):
    
    global c
    
    c+=1
    
    if(c%100 == 0):
        print(c)
        
    
    try:
        b = gc.book(isbn=isbn)
        return pd.Series({'title':b.title,'description':b.description,'num_pages':b.num_pages})
    except:
        return pd.Series({'title':'none','description':'none','num_pages':'none'})

In [21]:
isbn_df[['description','num_pages','title']] = isbn_df.isbn.apply(get_details)


100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900

In [22]:
isbn_df.to_pickle('ibsn_features_full.pickle')

In [3]:
isbn_df = pd.read_pickle('ibsn_features_full.pickle')

Retrying to get info for null records


In [23]:
dfx = isbn_df[isbn_df['title'] == 'none']

In [26]:
dfx.head()


Out[26]:
isbn description num_pages title
38 0060010800 none none none
749 0263211053 none none none
2344 0578124114 none none none
2459 0671032976 none none none
2512 0739438069 none none none

In [33]:
dfx[['description','num_pages','title']] = dfx.isbn.apply(get_details)


1300
/Users/janakajain/anaconda/lib/python3.6/site-packages/pandas/core/frame.py:2352: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]

In [40]:
dfx[dfx['title'] == 'none'].shape


Out[40]:
(34, 4)

There are 34 records that still remain with no information from the API. We shall remove these records from our dataset as they form a negligible portion of our sample of books.

Merging dfx with ibsn_df


In [19]:
for i, row in dfx.iterrows():
    isbn_df.loc[i] = dfx.loc[i]

In [42]:
# Checking if the newly created dataset contains the same number of empty records as in dfx

isbn_df[isbn_df['title'] == 'none'].shape == dfx[dfx['title'] == 'none'].shape


Out[42]:
True

In [43]:
# dfx = isbn_df[isbn_df['title'] == 'none']

Saving the data file


In [39]:
isbn_df.to_pickle('ibsn_features_new_batch.pickle')