Business & Tip Analysis from the Yelp Phoenix Data Set Using Clustering

We are going to analyze two data sets that are related from the Yelp Phoenix data set in order to find patterns that can reveal us valuable informantion. Firstly we will deal with the data from the tips data set

Tip Analysis

For each record in the data set we are going to extract the text of the tip and then process it using TF-IDF, finally we are going to perform some clustering on the resulting data.

ETL

To extract the data, we just have to parse the JSON file, the JSON package will help us to achieve this, its pretty straightforward.


In [11]:
import json

def load_json_file(file_path):
    """
    Builds a list of dictionaries from a JSON file

    :type file_path: string
    :param file_path: the path for the file that contains the businesses
    data
    :return: a list of dictionaries with the data from the files
    """
    records = [json.loads(line) for line in open(file_path)]

    return records

TF-IDF

To apply TF-IDF on the data we are going to use the sklearn package, specifically the class TfidfVectorizer, we will use this class along a Stemmer which will help us group words that have the same root, for instance the words tall, taller and tallest will all be grouped into tall.


In [12]:
import nltk.stem
from sklearn.feature_extraction.text import TfidfVectorizer

english_stemmer = nltk.stem.SnowballStemmer('english')


class StemmedTfidfVectorizer(TfidfVectorizer):

    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

Once we have defined this class, we can go on and create a matrix which will contain the term frequency-inverse document frequency of each tip.


In [13]:
def tf_idf_tips(file_path):
    records = load_json_file(file_path)
    data = [record['text'] for record in records]
    vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')
    vectorized = vectorizer.fit_transform(data)
    num_samples, num_features = vectorized.shape
    print("#samples: %d, #features: %d" % (
        num_samples, num_features))

    return vectorized

Clustering

The matrix we have created in the previous step serves as an input for a clustering algorithm. If we cluster the data we can use it to obtain tips that are similar to a given tip, which can be useful for a search engine.

We are going to define a function that will help us to cluster, and other function to evaluate how good the clusters are using Silhouette score


In [14]:
import nltk
import numpy
import sklearn.cluster as skcluster
import sklearn.metrics as skmetrics
import time
import scipy.cluster as scicluster
import matplotlib.pyplot as plt

def k_means_scikit(matrix):
    k_means = skcluster.KMeans(n_clusters=50, init='k-means++',
                               n_init=1, verbose=0)
    k_means.fit(matrix)

    return k_means.labels_

def evaluate_performance(data, labels, metric='euclidean'):
    score = skmetrics.silhouette_score(data, labels, metric=metric)
    print('Score:', score)

    return score

def cluster_and_evaluate_data(matrix, metric='euclidean'):
    print('Clustering started')
    clustering_start = time.time()
    labels = k_means_scikit(matrix)
    clustering_total = time.time() - clustering_start
    print('Clustering time:', clustering_total)

    evaluation_start = time.time()
    score = evaluate_performance(matrix, labels, metric)
    evaluation_total = time.time() - evaluation_start
    print('Evaluation time:', evaluation_total)
    total = time.time() - clustering_start
    print('Total time:', total)

Given the above functions we can now perform clustering and evaluate how good the clusters are using Silhouette score. The closer the value of the score to 1 the better, if its close to -1 is very bad, an if its close to 0 it means that the clusters are overlapping.

We are going to use just 100 samples because we are executing this example in a navigator, be sure to remove the [:100] when executing this in your machine. For the whole data set, the k-means algorithm was executed successfully but the silhouette score ran out of memory.


In [15]:
tip_file_path = 'data/yelp_academic_dataset_tip.json'
tip_matrix = tf_idf_tips(tip_file_path)
cluster_and_evaluate_data(tip_matrix[:100])


#samples: 113993, #features: 24571
Clustering started
('Clustering time:', 0.20326685905456543)
('Score:', 0.011462435659864297)
('Evaluation time:', 0.04374504089355469)
('Total time:', 0.24753999710083008)

Business Analysis

For each record in the data set we are going to extract the categories associated to that particular business, we will transform all of the data into a binary matrix of N rows and M columns, where N is the number of businesses and M the number of categories. If the business i is in the category j then the cell at i,j will have a 1, otherwise it will have a 0.

We will also create a list of sets that contains the categories of each business. This set will be very useful when clustering using Jaccard similarity.

ETL

First we are going to create the binary matrix, in order to do that we will have to define several auxiliary functions first


In [16]:
def drop_fields(fields, dictionary_list):
    """
    Removes the specified fields from every dictionary in the list records

    :rtype : void
    :param fields: a list of strings, which contains the fields that are
    going to be removed from every dictionary in the list records
    :param dictionary_list: a list of dictionaries
    """
    for record in dictionary_list:
        for field in fields:
            del (record[field])

def filter_records(records, field, values):
    filtered_records = [record for record in records if
                        record[field] in values]
    return filtered_records

def add_transpose_list_column(field, dictionary_list):
    """
    Takes a list of dictionaries and adds to every dictionary a new field
    for each value contained in the specified field among all the
    dictionaries in the field, leaving 1 for the values that are present in
    the dictionary and 0 for the values that are not. It can be seen as
    transposing the dictionary matrix.

    :param field: the field which is going to be transposed
    :param dictionary_list: a list of dictionaries
    :return: the modified list of dictionaries
    """
    values_set = set()
    for dictionary in dictionary_list:
        values_set |= set(dictionary[field])

    for dictionary in dictionary_list:
        for value in values_set:
            if value in dictionary[field]:
                dictionary[value] = 1
            else:
                dictionary[value] = 0

    return dictionary_list

def add_transpose_single_column(field, dictionary_list):
    """
    Takes a list of dictionaries and adds to every dictionary a new field
    for each value contained in the specified field among all the
    dictionaries in the field, leaving 1 for the values that are present in
    the dictionary and 0 for the values that are not. It can be seen as
    transposing the dictionary matrix.

    :param field: the field which is going to be transposed
    :param dictionary_list: a list of dictionaries
    :return: the modified list of dictionaries
    """

    values_set = set()
    for dictionary in dictionary_list:
        values_set.add(dictionary[field])

    for dictionary in dictionary_list:
        for value in values_set:
            if value in dictionary[field]:
                dictionary[value] = 1
            else:
                dictionary[value] = 0

    return dictionary_list

def drop_unwanted_fields(dictionary_list):
    """
    Drops fields that are not useful for data analysis in the business
    data set

    :rtype : void
    :param dictionary_list: the list of dictionaries containing the data
    """
    unwanted_fields = [
        'attributes',
        'business_id',
        'categories',
        'city',
        'full_address',
        'latitude',
        'longitude',
        'hours',
        'name',
        'neighborhoods',
        'open',
        'review_count',
        'stars',
        'state',
        'type'
    ]

    drop_fields(unwanted_fields, dictionary_list)

Now we can define our function to create the matrix


In [17]:
def create_category_matrix(file_path):
    """
    Creates a matrix with all the categories for businesses that are
    contained in the Yelp Phoenix Business data set. Each column of the
    matrix represents a category, and each row a business. This is a binary
    matrix that contains a 1 at the position i,j if the business i contains
    the category j, and a 0 otherwise.

    :rtype : numpy array matrix
    :param file_path: the path for the file that contains the businesses
    data
    :return: a numpy array binary matrix
    """
    records = load_json_file(file_path)

    # Now we obtain the categories for all the businesses
    records = add_transpose_list_column('categories', records)
    drop_unwanted_fields(records)
    matrix = numpy.array(
        [numpy.array(record.values()) for record in records])

    return matrix

We will also define another function to create a list of sets containing the categories for each business


In [18]:
def create_category_sets(file_path):
    """
    Creates an array of arrays in which each sub-array contains the
    categories of each business in the Yelp Phoenix Business data set

    :rtype : numpy array matrix
    :param file_path: the path for the file that contains the businesses
    data
    :return: a numpy array of numpy arrays with the categories that each
    business has, for example [['Restaurant', 'Mexican', 'Bar'],
    ['Bar', 'Disco']]
    """
    records = load_json_file(file_path)
    sets = numpy.array(
        [set(record['categories']) for record in records])

    return sets

Clustering

Now that we have a function that creates the binary matrix that serves as an input to the clustering function, we can go ahead and perform some clustering


In [19]:
business_file_path = 'data/yelp_academic_dataset_business.json'
categories_matrix = create_category_matrix(business_file_path)
cluster_and_evaluate_data(tip_matrix[:100])


Clustering started
('Clustering time:', 0.15648913383483887)
('Score:', 0.01920154263570531)
('Evaluation time:', 0.04501914978027344)
('Total time:', 0.2019340991973877)

Other Clustering Techniques

Among the packages nltk, sklearn and scipy.cluster, there are several other techniques than can be used for clustering. We have gathered some of them in the above class. The fastest algorithm is k-means.


In [20]:
import nltk
import numpy
import sklearn.cluster as skcluster
import sklearn.metrics as skmetrics
import time
import scipy.cluster as scicluster
import matplotlib.pyplot as plt
import scipy.spatial.distance as distance

__author__ = 'franpena'


class Clusterer:
    def __init__(self):
        pass

    # OK
    @staticmethod
    def k_means_scikit(matrix):
        k_means = skcluster.KMeans(n_clusters=50, init='k-means++',
                                   n_init=1, verbose=1)
        k_means.fit(matrix)

        return k_means.labels_

    # OK
    # With cosine distance the algorithm doesn't converge
    @staticmethod
    def k_means_nltk(matrix):
        clusterer = nltk.KMeansClusterer(50, nltk.euclidean_distance,
                                         avoid_empty_clusters=True,
                                         conv_test=10)
        labels = numpy.array(clusterer.cluster(matrix, True, trace=True))

        return labels

    # OK
    @staticmethod
    def gaac(matrix):
        clusterer = nltk.GAAClusterer()
        labels = numpy.array(clusterer.cluster(matrix, False, trace=True))
        dendrogram = clusterer.dendrogram()
        # dendrogram.show()

        return labels

    # OK
    @staticmethod
    def k_means_scipy(matrix):
        centroids, distortion = scicluster.vq.kmeans(matrix, 50, thresh=0.1)

        print('Centroids:', centroids)
        print('Distortion:', distortion)

    @staticmethod
    def linkage(matrix):
        linkage_matrix = scicluster.hierarchy.linkage(matrix)

        print('Linkage matrix:', linkage_matrix)

        dendrogram = scicluster.hierarchy.dendrogram(linkage_matrix)
        ax = plt.gca()
        xlbls = ax.get_xmajorticklabels()
        plt.show()

        leaves = dendrogram['leaves']
        print(leaves)

    @staticmethod
    def mean_shift(matrix):
        mean_shift = skcluster.MeanShift()
        mean_shift.fit(matrix)

        labels = mean_shift.labels_
        # Number of clusters in labels, ignoring noise if present.
        n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
        print('Estimated number of clusters:', n_clusters_)

        return labels

    @staticmethod
    def ward(matrix):
        ward = skcluster.Ward(n_clusters=50, compute_full_tree=False)
        ward.fit(matrix)

        return ward.labels_

    @staticmethod
    def dbscan(matrix):
        dbscan = skcluster.DBSCAN(eps=0.3, min_samples=50, metric='euclidean')
        # dbscan = skcluster.DBSCAN(eps=0.3, min_samples=50,
        #                           metric=nltk.cosine_distance)
        dbscan.fit(matrix)

        labels = dbscan.labels_
        # Number of clusters in labels, ignoring noise if present.
        n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
        print('Estimated number of clusters:', n_clusters_)

        return labels

    # OK
    @staticmethod
    def evaluate_performance(data, labels, metric='euclidean'):
        score = skmetrics.silhouette_score(data, labels, metric=metric)
        print('Labels:', labels)
        print('Score:', score)

        return score

    @staticmethod
    def cluster_data(matrix, algorithm):

        if algorithm == 'k-means-scikit':
            return Clusterer.k_means_scikit(matrix)
        elif algorithm == 'k-means-nltk':
            return Clusterer.k_means_nltk(matrix)
        # elif algorithm == 'k-means-scipy':
        #     return BusinessClusterer.k_means_scipy(matrix)
        elif algorithm == 'mean-shift':
            return Clusterer.mean_shift(matrix)
        elif algorithm == 'ward':
            return Clusterer.ward(matrix)
        elif algorithm == 'dbscan':
            return Clusterer.dbscan(matrix)

    @staticmethod
    def cluster_and_evaluate_data(matrix, algorithm, metric='euclidean'):
        print('Clustering started using ' + algorithm)
        clustering_start = time.time()
        labels = Clusterer.cluster_data(matrix, algorithm)
        clustering_total = time.time() - clustering_start
        print('Clustering time:', clustering_total)

        evaluation_start = time.time()
        score = Clusterer.evaluate_performance(matrix, labels, metric)
        evaluation_total = time.time() - evaluation_start
        print('Evaluation time:', evaluation_total)
        total = time.time() - clustering_start
        print('Total time:', total)

The following are some results after performing clustering on the binary categories matrix:

Clustering started using dbscan min_samples=10 ('Estimated number of clusters:', 474) ('Clustering time:', 269.36252093315125) ('Score:', 0.69331718210204796) ('Evaluation time:', 639.6426520347595) ('Total time:', 909.005243062973)

Clustering started using dbscan min_samples=50 ('Estimated number of clusters:', 48) ('Clustering time:', 236.92400288581848) ('Score:', 0.15291211349926839) ('Evaluation time:', 124.55789494514465) ('Total time:', 361.48195600509644)

Clustering started using k-means-scikit ('Clustering time:', 11.542982816696167) ('Score:', 0.28384880845352284) ('Evaluation time:', 107.42438888549805) ('Total time:', 118.9674379825592)

Clustering started using k-means-nltk ('Clustering time:', 12.875365018844604) ('Labels:', array([42, 24, 32, ..., 10, 45, 7])) ('Score:', 0.24563920653172072) ('Evaluation time:', 120.07143902778625) ('Total time:', 132.9468560218811)

Also to note:

  • k-means using cosine distance does not converge
  • mean shift algorithm didn't converged
  • There are some clusters algorithms for hierarchy clustering such as ward, linkage and gcaa (uses cosine distance), but the number of clusters they generate is very big and therefore the results are hard to interpret
  • Using dbscan with cosine produced just one cluster
  • None of the algorithms worked using jaccard similarity

In order to have clearer and more interpretable results, a testing framework should be developed, which presents the scores of the clustering algorithms and the values for each parameter.