We are going to analyze two data sets that are related from the Yelp Phoenix data set in order to find patterns that can reveal us valuable informantion. Firstly we will deal with the data from the tips data set
For each record in the data set we are going to extract the text of the tip and then process it using TF-IDF, finally we are going to perform some clustering on the resulting data.
To extract the data, we just have to parse the JSON file, the JSON package will help us to achieve this, its pretty straightforward.
In [11]:
import json
def load_json_file(file_path):
"""
Builds a list of dictionaries from a JSON file
:type file_path: string
:param file_path: the path for the file that contains the businesses
data
:return: a list of dictionaries with the data from the files
"""
records = [json.loads(line) for line in open(file_path)]
return records
In [12]:
import nltk.stem
from sklearn.feature_extraction.text import TfidfVectorizer
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
Once we have defined this class, we can go on and create a matrix which will contain the term frequency-inverse document frequency of each tip.
In [13]:
def tf_idf_tips(file_path):
records = load_json_file(file_path)
data = [record['text'] for record in records]
vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')
vectorized = vectorizer.fit_transform(data)
num_samples, num_features = vectorized.shape
print("#samples: %d, #features: %d" % (
num_samples, num_features))
return vectorized
The matrix we have created in the previous step serves as an input for a clustering algorithm. If we cluster the data we can use it to obtain tips that are similar to a given tip, which can be useful for a search engine.
We are going to define a function that will help us to cluster, and other function to evaluate how good the clusters are using Silhouette score
In [14]:
import nltk
import numpy
import sklearn.cluster as skcluster
import sklearn.metrics as skmetrics
import time
import scipy.cluster as scicluster
import matplotlib.pyplot as plt
def k_means_scikit(matrix):
k_means = skcluster.KMeans(n_clusters=50, init='k-means++',
n_init=1, verbose=0)
k_means.fit(matrix)
return k_means.labels_
def evaluate_performance(data, labels, metric='euclidean'):
score = skmetrics.silhouette_score(data, labels, metric=metric)
print('Score:', score)
return score
def cluster_and_evaluate_data(matrix, metric='euclidean'):
print('Clustering started')
clustering_start = time.time()
labels = k_means_scikit(matrix)
clustering_total = time.time() - clustering_start
print('Clustering time:', clustering_total)
evaluation_start = time.time()
score = evaluate_performance(matrix, labels, metric)
evaluation_total = time.time() - evaluation_start
print('Evaluation time:', evaluation_total)
total = time.time() - clustering_start
print('Total time:', total)
Given the above functions we can now perform clustering and evaluate how good the clusters are using Silhouette score. The closer the value of the score to 1 the better, if its close to -1 is very bad, an if its close to 0 it means that the clusters are overlapping.
We are going to use just 100 samples because we are executing this example in a navigator, be sure to remove the [:100] when executing this in your machine. For the whole data set, the k-means algorithm was executed successfully but the silhouette score ran out of memory.
In [15]:
tip_file_path = 'data/yelp_academic_dataset_tip.json'
tip_matrix = tf_idf_tips(tip_file_path)
cluster_and_evaluate_data(tip_matrix[:100])
For each record in the data set we are going to extract the categories associated to that particular business, we will transform all of the data into a binary matrix of N rows and M columns, where N is the number of businesses and M the number of categories. If the business i is in the category j then the cell at i,j will have a 1, otherwise it will have a 0.
We will also create a list of sets that contains the categories of each business. This set will be very useful when clustering using Jaccard similarity.
First we are going to create the binary matrix, in order to do that we will have to define several auxiliary functions first
In [16]:
def drop_fields(fields, dictionary_list):
"""
Removes the specified fields from every dictionary in the list records
:rtype : void
:param fields: a list of strings, which contains the fields that are
going to be removed from every dictionary in the list records
:param dictionary_list: a list of dictionaries
"""
for record in dictionary_list:
for field in fields:
del (record[field])
def filter_records(records, field, values):
filtered_records = [record for record in records if
record[field] in values]
return filtered_records
def add_transpose_list_column(field, dictionary_list):
"""
Takes a list of dictionaries and adds to every dictionary a new field
for each value contained in the specified field among all the
dictionaries in the field, leaving 1 for the values that are present in
the dictionary and 0 for the values that are not. It can be seen as
transposing the dictionary matrix.
:param field: the field which is going to be transposed
:param dictionary_list: a list of dictionaries
:return: the modified list of dictionaries
"""
values_set = set()
for dictionary in dictionary_list:
values_set |= set(dictionary[field])
for dictionary in dictionary_list:
for value in values_set:
if value in dictionary[field]:
dictionary[value] = 1
else:
dictionary[value] = 0
return dictionary_list
def add_transpose_single_column(field, dictionary_list):
"""
Takes a list of dictionaries and adds to every dictionary a new field
for each value contained in the specified field among all the
dictionaries in the field, leaving 1 for the values that are present in
the dictionary and 0 for the values that are not. It can be seen as
transposing the dictionary matrix.
:param field: the field which is going to be transposed
:param dictionary_list: a list of dictionaries
:return: the modified list of dictionaries
"""
values_set = set()
for dictionary in dictionary_list:
values_set.add(dictionary[field])
for dictionary in dictionary_list:
for value in values_set:
if value in dictionary[field]:
dictionary[value] = 1
else:
dictionary[value] = 0
return dictionary_list
def drop_unwanted_fields(dictionary_list):
"""
Drops fields that are not useful for data analysis in the business
data set
:rtype : void
:param dictionary_list: the list of dictionaries containing the data
"""
unwanted_fields = [
'attributes',
'business_id',
'categories',
'city',
'full_address',
'latitude',
'longitude',
'hours',
'name',
'neighborhoods',
'open',
'review_count',
'stars',
'state',
'type'
]
drop_fields(unwanted_fields, dictionary_list)
Now we can define our function to create the matrix
In [17]:
def create_category_matrix(file_path):
"""
Creates a matrix with all the categories for businesses that are
contained in the Yelp Phoenix Business data set. Each column of the
matrix represents a category, and each row a business. This is a binary
matrix that contains a 1 at the position i,j if the business i contains
the category j, and a 0 otherwise.
:rtype : numpy array matrix
:param file_path: the path for the file that contains the businesses
data
:return: a numpy array binary matrix
"""
records = load_json_file(file_path)
# Now we obtain the categories for all the businesses
records = add_transpose_list_column('categories', records)
drop_unwanted_fields(records)
matrix = numpy.array(
[numpy.array(record.values()) for record in records])
return matrix
We will also define another function to create a list of sets containing the categories for each business
In [18]:
def create_category_sets(file_path):
"""
Creates an array of arrays in which each sub-array contains the
categories of each business in the Yelp Phoenix Business data set
:rtype : numpy array matrix
:param file_path: the path for the file that contains the businesses
data
:return: a numpy array of numpy arrays with the categories that each
business has, for example [['Restaurant', 'Mexican', 'Bar'],
['Bar', 'Disco']]
"""
records = load_json_file(file_path)
sets = numpy.array(
[set(record['categories']) for record in records])
return sets
In [19]:
business_file_path = 'data/yelp_academic_dataset_business.json'
categories_matrix = create_category_matrix(business_file_path)
cluster_and_evaluate_data(tip_matrix[:100])
In [20]:
import nltk
import numpy
import sklearn.cluster as skcluster
import sklearn.metrics as skmetrics
import time
import scipy.cluster as scicluster
import matplotlib.pyplot as plt
import scipy.spatial.distance as distance
__author__ = 'franpena'
class Clusterer:
def __init__(self):
pass
# OK
@staticmethod
def k_means_scikit(matrix):
k_means = skcluster.KMeans(n_clusters=50, init='k-means++',
n_init=1, verbose=1)
k_means.fit(matrix)
return k_means.labels_
# OK
# With cosine distance the algorithm doesn't converge
@staticmethod
def k_means_nltk(matrix):
clusterer = nltk.KMeansClusterer(50, nltk.euclidean_distance,
avoid_empty_clusters=True,
conv_test=10)
labels = numpy.array(clusterer.cluster(matrix, True, trace=True))
return labels
# OK
@staticmethod
def gaac(matrix):
clusterer = nltk.GAAClusterer()
labels = numpy.array(clusterer.cluster(matrix, False, trace=True))
dendrogram = clusterer.dendrogram()
# dendrogram.show()
return labels
# OK
@staticmethod
def k_means_scipy(matrix):
centroids, distortion = scicluster.vq.kmeans(matrix, 50, thresh=0.1)
print('Centroids:', centroids)
print('Distortion:', distortion)
@staticmethod
def linkage(matrix):
linkage_matrix = scicluster.hierarchy.linkage(matrix)
print('Linkage matrix:', linkage_matrix)
dendrogram = scicluster.hierarchy.dendrogram(linkage_matrix)
ax = plt.gca()
xlbls = ax.get_xmajorticklabels()
plt.show()
leaves = dendrogram['leaves']
print(leaves)
@staticmethod
def mean_shift(matrix):
mean_shift = skcluster.MeanShift()
mean_shift.fit(matrix)
labels = mean_shift.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters:', n_clusters_)
return labels
@staticmethod
def ward(matrix):
ward = skcluster.Ward(n_clusters=50, compute_full_tree=False)
ward.fit(matrix)
return ward.labels_
@staticmethod
def dbscan(matrix):
dbscan = skcluster.DBSCAN(eps=0.3, min_samples=50, metric='euclidean')
# dbscan = skcluster.DBSCAN(eps=0.3, min_samples=50,
# metric=nltk.cosine_distance)
dbscan.fit(matrix)
labels = dbscan.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters:', n_clusters_)
return labels
# OK
@staticmethod
def evaluate_performance(data, labels, metric='euclidean'):
score = skmetrics.silhouette_score(data, labels, metric=metric)
print('Labels:', labels)
print('Score:', score)
return score
@staticmethod
def cluster_data(matrix, algorithm):
if algorithm == 'k-means-scikit':
return Clusterer.k_means_scikit(matrix)
elif algorithm == 'k-means-nltk':
return Clusterer.k_means_nltk(matrix)
# elif algorithm == 'k-means-scipy':
# return BusinessClusterer.k_means_scipy(matrix)
elif algorithm == 'mean-shift':
return Clusterer.mean_shift(matrix)
elif algorithm == 'ward':
return Clusterer.ward(matrix)
elif algorithm == 'dbscan':
return Clusterer.dbscan(matrix)
@staticmethod
def cluster_and_evaluate_data(matrix, algorithm, metric='euclidean'):
print('Clustering started using ' + algorithm)
clustering_start = time.time()
labels = Clusterer.cluster_data(matrix, algorithm)
clustering_total = time.time() - clustering_start
print('Clustering time:', clustering_total)
evaluation_start = time.time()
score = Clusterer.evaluate_performance(matrix, labels, metric)
evaluation_total = time.time() - evaluation_start
print('Evaluation time:', evaluation_total)
total = time.time() - clustering_start
print('Total time:', total)
The following are some results after performing clustering on the binary categories matrix:
Clustering started using dbscan min_samples=10 ('Estimated number of clusters:', 474) ('Clustering time:', 269.36252093315125) ('Score:', 0.69331718210204796) ('Evaluation time:', 639.6426520347595) ('Total time:', 909.005243062973)
Clustering started using dbscan min_samples=50 ('Estimated number of clusters:', 48) ('Clustering time:', 236.92400288581848) ('Score:', 0.15291211349926839) ('Evaluation time:', 124.55789494514465) ('Total time:', 361.48195600509644)
Clustering started using k-means-scikit ('Clustering time:', 11.542982816696167) ('Score:', 0.28384880845352284) ('Evaluation time:', 107.42438888549805) ('Total time:', 118.9674379825592)
Clustering started using k-means-nltk ('Clustering time:', 12.875365018844604) ('Labels:', array([42, 24, 32, ..., 10, 45, 7])) ('Score:', 0.24563920653172072) ('Evaluation time:', 120.07143902778625) ('Total time:', 132.9468560218811)
Also to note:
In order to have clearer and more interpretable results, a testing framework should be developed, which presents the scores of the clustering algorithms and the values for each parameter.