Place similarity


In [2]:
import graphlab

Load data after pre-processing


In [3]:
hotel = graphlab.SFrame.read_csv('hotels.csv',column_type_hints = {'Airport_Code':str})
dest = graphlab.SFrame.read_csv('dest.csv',column_type_hints = {'Airport_Code':str})


PROGRESS: Finished parsing file /home/anil/Downloads/metripping/hotels.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.686312 secs.
PROGRESS: Finished parsing file /home/anil/Downloads/metripping/hotels.csv
PROGRESS: Parsing completed. Parsed 84327 lines in 0.340961 secs.
PROGRESS: Finished parsing file /home/anil/Downloads/metripping/dest.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.544257 secs.
PROGRESS: Finished parsing file /home/anil/Downloads/metripping/dest.csv
PROGRESS: Parsing completed. Parsed 88069 lines in 0.37135 secs.

In [4]:
hotel.dtype()


Out[4]:
[str, int, str, int, int, float, float]

view few rows of destination data


In [5]:
dest.head(5)


Out[5]:
Airport_Code Category Sub_Category Total_Reviews Star_Rating
IAD Sights Landmarks Historic Sites 1 5.0
DCA Sights Landmarks Historic Sites 1 5.0
IAD Shopping Gift Specialty Shops 4 4.5
DCA Shopping Gift Specialty Shops 4 4.5
IAD Nightlife Bars Clubs 2 4.0
[5 rows x 5 columns]

In [6]:
graphlab.canvas.set_target('ipynb')
dest['Airport_Code'].show()



In [7]:
dest['Total_Category'] = dest['Category'] +" "+ dest['Sub_Category']

In [8]:
dest['word_count'] = graphlab.text_analytics.count_words(dest['Total_Category'])
dest.remove_columns(['Category', 'Sub_Category'])


Out[8]:
Airport_Code Total_Reviews Star_Rating Total_Category word_count
IAD 1 5.0 Sights Landmarks Historic
Sites ...
{'landmarks': 1,
'historic': 1, 'sights': ...
DCA 1 5.0 Sights Landmarks Historic
Sites ...
{'landmarks': 1,
'historic': 1, 'sights': ...
IAD 4 4.5 Shopping Gift Specialty
Shops ...
{'shops': 1, 'shopping':
1, 'specialty': 1, ...
DCA 4 4.5 Shopping Gift Specialty
Shops ...
{'shops': 1, 'shopping':
1, 'specialty': 1, ...
IAD 2 4.0 Nightlife Bars Clubs {'clubs': 1, 'bars': 1,
'nightlife': 1} ...
DCA 2 4.0 Nightlife Bars Clubs {'clubs': 1, 'bars': 1,
'nightlife': 1} ...
IAD 4 5.0 Concerts Shows Theaters {'theaters': 1,
'concerts': 1, 'shows': ...
DCA 4 5.0 Concerts Shows Theaters {'theaters': 1,
'concerts': 1, 'shows': ...
IAD 388 4.0 Concerts Shows Arenas
Stadiums ...
{'stadiums': 1, 'arenas':
1, 'concerts': 1, ...
DCA 388 4.0 Concerts Shows Arenas
Stadiums ...
{'stadiums': 1, 'arenas':
1, 'concerts': 1, ...
[88069 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [9]:
dest.head(5)


Out[9]:
Airport_Code Total_Reviews Star_Rating Total_Category word_count
IAD 1 5.0 Sights Landmarks Historic
Sites ...
{'landmarks': 1,
'historic': 1, 'sights': ...
DCA 1 5.0 Sights Landmarks Historic
Sites ...
{'landmarks': 1,
'historic': 1, 'sights': ...
IAD 4 4.5 Shopping Gift Specialty
Shops ...
{'shops': 1, 'shopping':
1, 'specialty': 1, ...
DCA 4 4.5 Shopping Gift Specialty
Shops ...
{'shops': 1, 'shopping':
1, 'specialty': 1, ...
IAD 2 4.0 Nightlife Bars Clubs {'clubs': 1, 'bars': 1,
'nightlife': 1} ...
[5 rows x 5 columns]

Find the similarity between the place without adding the hotel data


In [10]:
m = graphlab.recommender.ranking_factorization_recommender.create(dest,
                                                                  'Total_Category',
                                                                  'Airport_Code')


PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 88069 observations with 185 users and 353 items.
PROGRESS:     Data prepared in: 1.00394s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | adagrad  |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
PROGRESS: | binary_target                  | Assume Binary Targets                            | True     |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 25       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 11008 / 88069 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 10                | Not Viable                               |
PROGRESS: | 1       | 2.5               | Not Viable                               |
PROGRESS: | 2       | 0.625             | Not Viable                               |
PROGRESS: | 3       | 0.15625           | 0.684611                                 |
PROGRESS: | 4       | 0.078125          | 1.17557                                  |
PROGRESS: | 5       | 0.0390625         | 1.32726                                  |
PROGRESS: | 6       | 0.0195312         | No Decrease (1.411 >= 1.38644)           |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.15625           | 0.684611                                 |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Initial | 222us        | 1.38642           | 0.693089                          |             |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | 1       | 700.801ms    | 0.32705           | 0.143126                          | 0.15625     |
PROGRESS: | 2       | 1.44s        | 0.227982          | 0.106854                          | 0.15625     |
PROGRESS: | 3       | 2.20s        | 0.202419          | 0.0963944                         | 0.15625     |
PROGRESS: | 4       | 3.07s        | 0.182567          | 0.0872886                         | 0.15625     |
PROGRESS: | 5       | 3.89s        | 0.16963           | 0.0819986                         | 0.15625     |
PROGRESS: | 6       | 4.53s        | 0.158894          | 0.0768928                         | 0.15625     |
PROGRESS: | 10      | 7.18s        | 0.131314          | 0.0638484                         | 0.15625     |
PROGRESS: | 11      | 7.81s        | 0.126126          | 0.0614172                         | 0.15625     |
PROGRESS: | 15      | 10.42s       | 0.113745          | 0.0556437                         | 0.15625     |
PROGRESS: | 20      | 13.76s       | 0.102346          | 0.0503294                         | 0.15625     |
PROGRESS: | 25      | 17.05s       | 0.0935713         | 0.0456336                         | 0.15625     |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: Optimization Complete: Maximum number of passes through the data reached.
PROGRESS: Computing final objective value and training Predictive Error.
PROGRESS:        Final objective value: 0.517848
PROGRESS:        Final training Predictive Error: 0.0439802

In [247]:
# most similar destination for 'IAD' 
m.get_similar_items(['IAD'])


PROGRESS: Getting similar items completed in 0.000989
Out[247]:
Airport_Code similar distance rank
IAD AUH 1.45231065154 1
IAD VLI 1.4506803751 2
IAD CXR 1.403313905 3
IAD ADD 1.38968846202 4
IAD KWI 1.38657107949 5
IAD CEB 1.37112155557 6
IAD SKP 1.36449471116 7
IAD KBV 1.35761326551 8
IAD PMV 1.34921184182 9
IAD ADB 1.34175089002 10
[10 rows x 4 columns]


In [248]:
m.get_similar_items(['IAD'])


PROGRESS: Getting similar items completed in 0.000873
Out[248]:
Airport_Code similar distance rank
IAD AUH 1.45231065154 1
IAD VLI 1.4506803751 2
IAD CXR 1.403313905 3
IAD ADD 1.38968846202 4
IAD KWI 1.38657107949 5
IAD CEB 1.37112155557 6
IAD SKP 1.36449471116 7
IAD KBV 1.35761326551 8
IAD PMV 1.34921184182 9
IAD ADB 1.34175089002 10
[10 rows x 4 columns]


In [249]:
hotel.head(5)


Out[249]:
Airport_Code Hotel_ID Property_Type Star_Ranking Total_Reviews Hotel_Score Average_Price
DXB 275 Apartment Hotel 4 3403 8.0 27318.2575
DXB 276 Resort 5 4321 8.5 130560.0875
DXB 277 Hotel 4 243 6.3 28930.3675
DXB 278 Hotel 4 1010 7.7 21239.75
DXB 279 Hotel 4 1857 7.6 26382.0
[5 rows x 7 columns]


In [250]:
hotel['word_count'] = graphlab.text_analytics.count_words(hotel['Property_Type'])

In [251]:
hotel['Reviews_Trust_Label'] = (hotel['Total_Reviews']/hotel['Hotel_Score'])

In [252]:
hotel.head(2)


Out[252]:
Airport_Code Hotel_ID Property_Type Star_Ranking Total_Reviews Hotel_Score Average_Price
DXB 275 Apartment Hotel 4 3403 8.0 27318.2575
DXB 276 Resort 5 4321 8.5 130560.0875
word_count Reviews_Trust_Label
{'apartment': 1, 'hotel':
1} ...
425.375
{'resort': 1} 508.352941176
[2 rows x 9 columns]


In [253]:
mm = graphlab.recommender.ranking_factorization_recommender.create(hotel,
                                                                  'Hotel_ID',
                                                                  'Airport_Code',
                                                                  item_data=dest)


PROGRESS: Recsys training: model = ranking_factorization_recommender
PROGRESS: Preparing data set.
PROGRESS:     Data has 84327 observations with 84327 users and 354 items.
PROGRESS:     Data prepared in: 1.50649s
PROGRESS: Training ranking_factorization_recommender for recommendations.
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | Parameter                      | Description                                      | Value    |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS: | num_factors                    | Factor Dimension                                 | 32       |
PROGRESS: | regularization                 | L2 Regularization on Factors                     | 1e-09    |
PROGRESS: | solver                         | Solver used for training                         | adagrad  |
PROGRESS: | linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
PROGRESS: | binary_target                  | Assume Binary Targets                            | True     |
PROGRESS: | side_data_factorization        | Assign Factors for Side Data                     | True     |
PROGRESS: | max_iterations                 | Maximum Number of Iterations                     | 25       |
PROGRESS: +--------------------------------+--------------------------------------------------+----------+
PROGRESS:   Optimizing model using SGD; tuning step size.
PROGRESS:   Using 10540 / 84327 points for tuning the step size.
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Attempt | Initial Step Size | Estimated Objective Value                |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | 0       | 3.84615           | Not Viable                               |
PROGRESS: | 1       | 0.961538          | Not Viable                               |
PROGRESS: | 2       | 0.240385          | Not Viable                               |
PROGRESS: | 3       | 0.0600962         | Not Viable                               |
PROGRESS: | 4       | 0.015024          | Not Viable                               |
PROGRESS: | 5       | 0.00375601        | Not Viable                               |
PROGRESS: | 6       | 0.000939002       | Not Viable                               |
PROGRESS: | 7       | 0.000234751       | Not Viable                               |
PROGRESS: | 8       | 5.86877e-05       | Not Viable                               |
PROGRESS: | 9       | 1.46719e-05       | Not Viable                               |
PROGRESS: | 10      | 3.66798e-06       | Not Viable                               |
PROGRESS: | 11      | 9.16995e-07       | Not Viable                               |
PROGRESS: | 12      | 2.29249e-07       | Not Viable                               |
PROGRESS: | 13      | 5.73122e-08       | Not Viable                               |
PROGRESS: | 14      | 1.4328e-08        | Not Viable                               |
PROGRESS: | 15      | 3.58201e-09       | Not Viable                               |
PROGRESS: | 16      | 8.95502e-10       | Not Viable                               |
PROGRESS: | 17      | 2.23876e-10       | Not Viable                               |
PROGRESS: | 18      | 5.59689e-11       | Not Viable                               |
PROGRESS: | 19      | 1.39922e-11       | Not Viable                               |
PROGRESS: | 20      | 3.49806e-12       | Not Viable                               |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: | Final   | 0.005             | Unknown                                  |
PROGRESS: +---------+-------------------+------------------------------------------+
PROGRESS: WARNING: Having difficulty finding viable stepsize; Model may be at optimum. Continuing with small step size.
PROGRESS: Starting Optimization.
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Iter.   | Elapsed Time | Approx. Objective | Approx. Training Predictive Error | Step Size   |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | Initial | 207us        | 1.79769e+308      | 1.79769e+308                      |             |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: | 1       | 3.681ms      | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: | 2       | 13.549ms     | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: | 3       | 23.681ms     | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: | 4       | 27.055ms     | 1.79769e+308      | 1.79769e+308                      | 0.005       |
PROGRESS: +---------+--------------+-------------------+-----------------------------------+-------------+
PROGRESS: Optimization Complete: Convergence on objective within bounds.
PROGRESS: Computing final objective value and training Predictive Error.
PROGRESS:        Final objective value: 1.79769e+308
PROGRESS:        Final training Predictive Error: 1.79769e+308

In [254]:
# most simiral place for "DXB" after adding hotel data
mm.get_similar_items(["DXB"])


PROGRESS: Getting similar items completed in 0.007532
Out[254]:
Airport_Code similar distance rank
DXB BIO 1.70740884542 1
DXB CEI 1.61200469732 2
DXB TPE 1.61046296358 3
DXB AMS 1.59781980515 4
DXB LHE 1.51933383942 5
DXB GRU 1.47608801723 6
DXB PTY 1.42839789391 7
DXB NAN 1.38500064611 8
DXB AAN 1.37855488062 9
DXB BUF 1.37443852425 10
[10 rows x 4 columns]


In [255]:
# most similar with "SEZ"
mm.get_similar_items(["SEZ"])


PROGRESS: Getting similar items completed in 0.001321
Out[255]:
Airport_Code similar distance rank
SEZ GDL 1.53258872032 1
SEZ CUN 1.51108670235 2
SEZ AKL 1.4861626327 3
SEZ ACC 1.45462328196 4
SEZ LAX 1.42601782084 5
SEZ DKR 1.40009027719 6
SEZ BTH 1.38972005248 7
SEZ CDG 1.38469335437 8
SEZ ABJ 1.38048645854 9
SEZ OPO 1.37913474441 10
[10 rows x 4 columns]


In [256]:
# most similar with "MXP"
mm.get_similar_items(['MXP'])


PROGRESS: Getting similar items completed in 0.000954
Out[256]:
Airport_Code similar distance rank
MXP CEB 1.70715749264 1
MXP MFM 1.58472329378 2
MXP SAN 1.57058191299 3
MXP VTE 1.54280465841 4
MXP JDH 1.46457794309 5
MXP LIM 1.46417695284 6
MXP KTM 1.45444214344 7
MXP CPT 1.44507962465 8
MXP FAO 1.42304974794 9
MXP AUA 1.39572271705 10
[10 rows x 4 columns]