Place similarity



In [2]:

    
import graphlab

Load data after pre-processing



In [3]:

    
hotel = graphlab.SFrame.read_csv('hotels.csv',column_type_hints = {'Airport_Code':str})
dest = graphlab.SFrame.read_csv('dest.csv',column_type_hints = {'Airport_Code':str})









    



PROGRESS: Finished parsing file /home/anil/Downloads/metripping/hotels.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.686312 secs.
PROGRESS: Finished parsing file /home/anil/Downloads/metripping/hotels.csv
PROGRESS: Parsing completed. Parsed 84327 lines in 0.340961 secs.
PROGRESS: Finished parsing file /home/anil/Downloads/metripping/dest.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.544257 secs.
PROGRESS: Finished parsing file /home/anil/Downloads/metripping/dest.csv
PROGRESS: Parsing completed. Parsed 88069 lines in 0.37135 secs.



In [4]:

    
hotel.dtype()









    Out[4]:





[str, int, str, int, int, float, float]

view few rows of destination data



In [5]:

    
dest.head(5)









    Out[5]:





    
        Airport_Code
        Category
        Sub_Category
        Total_Reviews
        Star_Rating
    
    
        IAD
        Sights Landmarks
        Historic Sites
        1
        5.0
    
    
        DCA
        Sights Landmarks
        Historic Sites
        1
        5.0
    
    
        IAD
        Shopping
        Gift Specialty Shops
        4
        4.5
    
    
        DCA
        Shopping
        Gift Specialty Shops
        4
        4.5
    
    
        IAD
        Nightlife
        Bars Clubs
        2
        4.0
    

[5 rows x 5 columns]

Showing the most popular places in the dataset



In [6]:

    
graphlab.canvas.set_target('ipynb')
dest['Airport_Code'].show()



In [7]:

    
dest['Total_Category'] = dest['Category'] +" "+ dest['Sub_Category']



In [8]:

    
dest['word_count'] = graphlab.text_analytics.count_words(dest['Total_Category'])
dest.remove_columns(['Category', 'Sub_Category'])









    Out[8]:





    
        Airport_Code
        Total_Reviews
        Star_Rating
        Total_Category
        word_count
    
    
        IAD
        1
        5.0
        Sights Landmarks Historic
Sites ...
        {'landmarks': 1,
'historic': 1, 'sights': ...
    
    
        DCA
        1
        5.0
        Sights Landmarks Historic
Sites ...
        {'landmarks': 1,
'historic': 1, 'sights': ...
    
    
        IAD
        4
        4.5
        Shopping Gift Specialty
Shops ...
        {'shops': 1, 'shopping':
1, 'specialty': 1, ...
    
    
        DCA
        4
        4.5
        Shopping Gift Specialty
Shops ...
        {'shops': 1, 'shopping':
1, 'specialty': 1, ...
    
    
        IAD
        2
        4.0
        Nightlife Bars Clubs
        {'clubs': 1, 'bars': 1,
'nightlife': 1} ...
    
    
        DCA
        2
        4.0
        Nightlife Bars Clubs
        {'clubs': 1, 'bars': 1,
'nightlife': 1} ...
    
    
        IAD
        4
        5.0
        Concerts Shows Theaters
        {'theaters': 1,
'concerts': 1, 'shows': ...
    
    
        DCA
        4
        5.0
        Concerts Shows Theaters
        {'theaters': 1,
'concerts': 1, 'shows': ...
    
    
        IAD
        388
        4.0
        Concerts Shows Arenas
Stadiums ...
        {'stadiums': 1, 'arenas':
1, 'concerts': 1, ...
    
    
        DCA
        388
        4.0
        Concerts Shows Arenas
Stadiums ...
        {'stadiums': 1, 'arenas':
1, 'concerts': 1, ...
    

[88069 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [9]:

    
dest.head(5)









    Out[9]:





    
        Airport_Code
        Total_Reviews
        Star_Rating
        Total_Category
        word_count
    
    
        IAD
        1
        5.0
        Sights Landmarks Historic
Sites ...
        {'landmarks': 1,
'historic': 1, 'sights': ...
    
    
        DCA
        1
        5.0
        Sights Landmarks Historic
Sites ...
        {'landmarks': 1,
'historic': 1, 'sights': ...
    
    
        IAD
        4
        4.5
        Shopping Gift Specialty
Shops ...
        {'shops': 1, 'shopping':
1, 'specialty': 1, ...
    
    
        DCA
        4
        4.5
        Shopping Gift Specialty
Shops ...
        {'shops': 1, 'shopping':
1, 'specialty': 1, ...
    
    
        IAD
        2
        4.0
        Nightlife Bars Clubs
        {'clubs': 1, 'bars': 1,
'nightlife': 1} ...
    

[5 rows x 5 columns]

Find the similarity between the place without adding the hotel data



In [10]:

    
m = graphlab.recommender.ranking_factorization_recommender.create(dest,
                                                                  'Total_Category',
                                                                  'Airport_Code')









    



PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 88069 observations with 185 users and 353 items.
PROGRESS:     Data prepared in: 1.00394s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | adagrad  |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
PROGRESS: | binary_target                  | Assume Binary Targets                            | True     |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 25       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 11008 / 88069 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 10                | Not Viable                               |
PROGRESS: | 1       | 2.5               | Not Viable                               |
PROGRESS: | 2       | 0.625             | Not Viable                               |
PROGRESS: | 3       | 0.15625           | 0.684611                                 |
PROGRESS: | 4       | 0.078125          | 1.17557                                  |
PROGRESS: | 5       | 0.0390625         | 1.32726                                  |
PROGRESS: | 6       | 0.0195312         | No Decrease (1.411 >= 1.38644)           |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.15625           | 0.684611                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Initial | 222us        | 1.38642           | 0.693089                          |             |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | 1       | 700.801ms    | 0.32705           | 0.143126                          | 0.15625     |
PROGRESS: | 2       | 1.44s        | 0.227982          | 0.106854                          | 0.15625     |
PROGRESS: | 3       | 2.20s        | 0.202419          | 0.0963944                         | 0.15625     |
PROGRESS: | 4       | 3.07s        | 0.182567          | 0.0872886                         | 0.15625     |
PROGRESS: | 5       | 3.89s        | 0.16963           | 0.0819986                         | 0.15625     |
PROGRESS: | 6       | 4.53s        | 0.158894          | 0.0768928                         | 0.15625     |
PROGRESS: | 10      | 7.18s        | 0.131314          | 0.0638484                         | 0.15625     |
PROGRESS: | 11      | 7.81s        | 0.126126          | 0.0614172                         | 0.15625     |
PROGRESS: | 15      | 10.42s       | 0.113745          | 0.0556437                         | 0.15625     |
PROGRESS: | 20      | 13.76s       | 0.102346          | 0.0503294                         | 0.15625     |
PROGRESS: | 25      | 17.05s       | 0.0935713         | 0.0456336                         | 0.15625     |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training Predictive Error.
PROGRESS:        Final objective value: 0.517848
PROGRESS:        Final training Predictive Error: 0.0439802



In [247]:

    
# most similar destination for 'IAD' 
m.get_similar_items(['IAD'])









    



PROGRESS: Getting similar items completed in 0.000989






    Out[247]:





    
        Airport_Code
        similar
        distance
        rank
    
    
        IAD
        AUH
        1.45231065154
        1
    
    
        IAD
        VLI
        1.4506803751
        2
    
    
        IAD
        CXR
        1.403313905
        3
    
    
        IAD
        ADD
        1.38968846202
        4
    
    
        IAD
        KWI
        1.38657107949
        5
    
    
        IAD
        CEB
        1.37112155557
        6
    
    
        IAD
        SKP
        1.36449471116
        7
    
    
        IAD
        KBV
        1.35761326551
        8
    
    
        IAD
        PMV
        1.34921184182
        9
    
    
        IAD
        ADB
        1.34175089002
        10
    

[10 rows x 4 columns]



In [248]:

    
m.get_similar_items(['IAD'])









    



PROGRESS: Getting similar items completed in 0.000873






    Out[248]:





    
        Airport_Code
        similar
        distance
        rank
    
    
        IAD
        AUH
        1.45231065154
        1
    
    
        IAD
        VLI
        1.4506803751
        2
    
    
        IAD
        CXR
        1.403313905
        3
    
    
        IAD
        ADD
        1.38968846202
        4
    
    
        IAD
        KWI
        1.38657107949
        5
    
    
        IAD
        CEB
        1.37112155557
        6
    
    
        IAD
        SKP
        1.36449471116
        7
    
    
        IAD
        KBV
        1.35761326551
        8
    
    
        IAD
        PMV
        1.34921184182
        9
    
    
        IAD
        ADB
        1.34175089002
        10
    

[10 rows x 4 columns]



In [249]:

    
hotel.head(5)









    Out[249]:





    
        Airport_Code
        Hotel_ID
        Property_Type
        Star_Ranking
        Total_Reviews
        Hotel_Score
        Average_Price
    
    
        DXB
        275
        Apartment Hotel
        4
        3403
        8.0
        27318.2575
    
    
        DXB
        276
        Resort
        5
        4321
        8.5
        130560.0875
    
    
        DXB
        277
        Hotel
        4
        243
        6.3
        28930.3675
    
    
        DXB
        278
        Hotel
        4
        1010
        7.7
        21239.75
    
    
        DXB
        279
        Hotel
        4
        1857
        7.6
        26382.0
    

[5 rows x 7 columns]



In [250]:

    
hotel['word_count'] = graphlab.text_analytics.count_words(hotel['Property_Type'])



In [251]:

    
hotel['Reviews_Trust_Label'] = (hotel['Total_Reviews']/hotel['Hotel_Score'])



In [252]:

    
hotel.head(2)









    Out[252]:





    
        Airport_Code
        Hotel_ID
        Property_Type
        Star_Ranking
        Total_Reviews
        Hotel_Score
        Average_Price
    
    
        DXB
        275
        Apartment Hotel
        4
        3403
        8.0
        27318.2575
    
    
        DXB
        276
        Resort
        5
        4321
        8.5
        130560.0875
    


    
        word_count
        Reviews_Trust_Label
    
    
        {'apartment': 1, 'hotel':
1} ...
        425.375
    
    
        {'resort': 1}
        508.352941176
    

[2 rows x 9 columns]



In [253]:

    
mm = graphlab.recommender.ranking_factorization_recommender.create(hotel,
                                                                  'Hotel_ID',
                                                                  'Airport_Code',
                                                                  item_data=dest)









    



PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 84327 observations with 84327 users and 354 items.
PROGRESS:     Data prepared in: 1.50649s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | adagrad  |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
PROGRESS: | binary_target                  | Assume Binary Targets                            | True     |
PROGRESS: | side_data_factorization        | Assign Factors for Side Data                     | True     |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 25       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 10540 / 84327 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 3.84615           | Not Viable                               |
PROGRESS: | 1       | 0.961538          | Not Viable                               |
PROGRESS: | 2       | 0.240385          | Not Viable                               |
PROGRESS: | 3       | 0.0600962         | Not Viable                               |
PROGRESS: | 4       | 0.015024          | Not Viable                               |
PROGRESS: | 5       | 0.00375601        | Not Viable                               |
PROGRESS: | 6       | 0.000939002       | Not Viable                               |
PROGRESS: | 7       | 0.000234751       | Not Viable                               |
PROGRESS: | 8       | 5.86877e-05       | Not Viable                               |
PROGRESS: | 9       | 1.46719e-05       | Not Viable                               |
PROGRESS: | 10      | 3.66798e-06       | Not Viable                               |
PROGRESS: | 11      | 9.16995e-07       | Not Viable                               |
PROGRESS: | 12      | 2.29249e-07       | Not Viable                               |
PROGRESS: | 13      | 5.73122e-08       | Not Viable                               |
PROGRESS: | 14      | 1.4328e-08        | Not Viable                               |
PROGRESS: | 15      | 3.58201e-09       | Not Viable                               |
PROGRESS: | 16      | 8.95502e-10       | Not Viable                               |
PROGRESS: | 17      | 2.23876e-10       | Not Viable                               |
PROGRESS: | 18      | 5.59689e-11       | Not Viable                               |
PROGRESS: | 19      | 1.39922e-11       | Not Viable                               |
PROGRESS: | 20      | 3.49806e-12       | Not Viable                               |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.005             | Unknown                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: WARNING: Having difficulty finding viable stepsize; Model may be at optimum. Continuing with small step size.
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Initial | 207us        | 1.79769e+308      | 1.79769e+308                      |             |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | 1       | 3.681ms      | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: | 2       | 13.549ms     | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: | 3       | 23.681ms     | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: | 4       | 27.055ms     | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: Optimization Complete: Convergence on objective within bounds.
PROGRESS: Computing final objective value and training Predictive Error.
PROGRESS:        Final objective value: 1.79769e+308
PROGRESS:        Final training Predictive Error: 1.79769e+308



In [254]:

    
# most simiral place for "DXB" after adding hotel data
mm.get_similar_items(["DXB"])









    



PROGRESS: Getting similar items completed in 0.007532






    Out[254]:





    
        Airport_Code
        similar
        distance
        rank
    
    
        DXB
        BIO
        1.70740884542
        1
    
    
        DXB
        CEI
        1.61200469732
        2
    
    
        DXB
        TPE
        1.61046296358
        3
    
    
        DXB
        AMS
        1.59781980515
        4
    
    
        DXB
        LHE
        1.51933383942
        5
    
    
        DXB
        GRU
        1.47608801723
        6
    
    
        DXB
        PTY
        1.42839789391
        7
    
    
        DXB
        NAN
        1.38500064611
        8
    
    
        DXB
        AAN
        1.37855488062
        9
    
    
        DXB
        BUF
        1.37443852425
        10
    

[10 rows x 4 columns]



In [255]:

    
# most similar with "SEZ"
mm.get_similar_items(["SEZ"])









    



PROGRESS: Getting similar items completed in 0.001321






    Out[255]:





    
        Airport_Code
        similar
        distance
        rank
    
    
        SEZ
        GDL
        1.53258872032
        1
    
    
        SEZ
        CUN
        1.51108670235
        2
    
    
        SEZ
        AKL
        1.4861626327
        3
    
    
        SEZ
        ACC
        1.45462328196
        4
    
    
        SEZ
        LAX
        1.42601782084
        5
    
    
        SEZ
        DKR
        1.40009027719
        6
    
    
        SEZ
        BTH
        1.38972005248
        7
    
    
        SEZ
        CDG
        1.38469335437
        8
    
    
        SEZ
        ABJ
        1.38048645854
        9
    
    
        SEZ
        OPO
        1.37913474441
        10
    

[10 rows x 4 columns]



In [256]:

    
# most similar with "MXP"
mm.get_similar_items(['MXP'])









    



PROGRESS: Getting similar items completed in 0.000954






    Out[256]:





    
        Airport_Code
        similar
        distance
        rank
    
    
        MXP
        CEB
        1.70715749264
        1
    
    
        MXP
        MFM
        1.58472329378
        2
    
    
        MXP
        SAN
        1.57058191299
        3
    
    
        MXP
        VTE
        1.54280465841
        4
    
    
        MXP
        JDH
        1.46457794309
        5
    
    
        MXP
        LIM
        1.46417695284
        6
    
    
        MXP
        KTM
        1.45444214344
        7
    
    
        MXP
        CPT
        1.44507962465
        8
    
    
        MXP
        FAO
        1.42304974794
        9
    
    
        MXP
        AUA
        1.39572271705
        10
    

[10 rows x 4 columns]

Airport_Code	Category	Sub_Category	Total_Reviews	Star_Rating
IAD	Sights Landmarks	Historic Sites	1	5.0
DCA	Sights Landmarks	Historic Sites	1	5.0
IAD	Shopping	Gift Specialty Shops	4	4.5
DCA	Shopping	Gift Specialty Shops	4	4.5
IAD	Nightlife	Bars Clubs	2	4.0

Airport_Code	Total_Reviews	Star_Rating	Total_Category	word_count
IAD	1	5.0	Sights Landmarks Historic Sites ...	{'landmarks': 1, 'historic': 1, 'sights': ...
DCA	1	5.0	Sights Landmarks Historic Sites ...	{'landmarks': 1, 'historic': 1, 'sights': ...
IAD	4	4.5	Shopping Gift Specialty Shops ...	{'shops': 1, 'shopping': 1, 'specialty': 1, ...
DCA	4	4.5	Shopping Gift Specialty Shops ...	{'shops': 1, 'shopping': 1, 'specialty': 1, ...
IAD	2	4.0	Nightlife Bars Clubs	{'clubs': 1, 'bars': 1, 'nightlife': 1} ...
DCA	2	4.0	Nightlife Bars Clubs	{'clubs': 1, 'bars': 1, 'nightlife': 1} ...
IAD	4	5.0	Concerts Shows Theaters	{'theaters': 1, 'concerts': 1, 'shows': ...
DCA	4	5.0	Concerts Shows Theaters	{'theaters': 1, 'concerts': 1, 'shows': ...
IAD	388	4.0	Concerts Shows Arenas Stadiums ...	{'stadiums': 1, 'arenas': 1, 'concerts': 1, ...
DCA	388	4.0	Concerts Shows Arenas Stadiums ...	{'stadiums': 1, 'arenas': 1, 'concerts': 1, ...

Airport_Code	similar	distance	rank
IAD	AUH	1.45231065154	1
IAD	VLI	1.4506803751	2
IAD	CXR	1.403313905	3
IAD	ADD	1.38968846202	4
IAD	KWI	1.38657107949	5
IAD	CEB	1.37112155557	6
IAD	SKP	1.36449471116	7
IAD	KBV	1.35761326551	8
IAD	PMV	1.34921184182	9
IAD	ADB	1.34175089002	10

Airport_Code	Hotel_ID	Property_Type	Star_Ranking	Total_Reviews	Hotel_Score	Average_Price
DXB	275	Apartment Hotel	4	3403	8.0	27318.2575
DXB	276	Resort	5	4321	8.5	130560.0875
DXB	277	Hotel	4	243	6.3	28930.3675
DXB	278	Hotel	4	1010	7.7	21239.75
DXB	279	Hotel	4	1857	7.6	26382.0

word_count	Reviews_Trust_Label
{'apartment': 1, 'hotel': 1} ...	425.375
{'resort': 1}	508.352941176

Airport_Code	similar	distance	rank
DXB	BIO	1.70740884542	1
DXB	CEI	1.61200469732	2
DXB	TPE	1.61046296358	3
DXB	AMS	1.59781980515	4
DXB	LHE	1.51933383942	5
DXB	GRU	1.47608801723	6
DXB	PTY	1.42839789391	7
DXB	NAN	1.38500064611	8
DXB	AAN	1.37855488062	9
DXB	BUF	1.37443852425	10

Airport_Code	similar	distance	rank
SEZ	GDL	1.53258872032	1
SEZ	CUN	1.51108670235	2
SEZ	AKL	1.4861626327	3
SEZ	ACC	1.45462328196	4
SEZ	LAX	1.42601782084	5
SEZ	DKR	1.40009027719	6
SEZ	BTH	1.38972005248	7
SEZ	CDG	1.38469335437	8
SEZ	ABJ	1.38048645854	9
SEZ	OPO	1.37913474441	10

Airport_Code	similar	distance	rank
MXP	CEB	1.70715749264	1
MXP	MFM	1.58472329378	2
MXP	SAN	1.57058191299	3
MXP	VTE	1.54280465841	4
MXP	JDH	1.46457794309	5
MXP	LIM	1.46417695284	6
MXP	KTM	1.45444214344	7
MXP	CPT	1.44507962465	8
MXP	FAO	1.42304974794	9
MXP	AUA	1.39572271705	10