Pair Distances

It is a common practice that we try to make use of the available data and combine data sets from different sources for further analysis. For example, the following stations1.csv and stations2.csv files have different sets of stations with latitude, longitude information. Assuming that stations1.csv has air quality information while stations2.csv has stations has weather information measured by those stations respectively.

One might assume that the air quality measured by a station in the first data set has a strong correlation with the weather condition registered by its closest station. To combine the two data sets, we need to determine the stations of stations2.csv that are closest to those stations of stations1.csv.


In [1]:
# Import functions
import sys
sys.path.append("../")
import pandas as pd

In [2]:
df1 = pd.read_csv("stations1.csv")
df2 = pd.read_csv("stations2.csv")

In [3]:
df1.head(10)


Out[3]:
station_id latitude longitude
0 dongsi 39.929 116.417
1 tiantan 39.886 116.407
2 guanyuan 39.929 116.339
3 wanshouxigong 39.878 116.352
4 aotizhongxin 39.982 116.397
5 nongzhanguan 39.937 116.461
6 wanliu 39.987 116.287
7 beibuxinqu 40.090 116.174
8 zhiwuyuan 40.002 116.207
9 fengtaihuayuan 39.863 116.279

In [4]:
df2.head(10)


Out[4]:
station_id latitude longitude
0 shunyi 40.126667 116.615278
1 hadian 39.986944 116.290556
2 yanqing 40.449444 115.968889
3 miyun 40.377500 116.864167
4 huairou 40.357778 116.626944
5 shangdianzi 40.658889 117.111667
6 pinggu 40.169444 117.117778
7 tongzhou 39.847500 116.756667
8 chaoyang 39.952500 116.500833
9 pingchang 40.223333 116.211667

Before checking for the closest stations, we can understand the composition of stations found in the different data sets. The designed sets_grps function from preprocess.py file is useful for this purpose.


In [5]:
from script.preprocess import sets_grps

In [6]:
sets_grps(df1.station_id, df2.station_id)


Common elements in both sets:
{'miyun', 'pingchang', 'huairou', 'tongzhou', 'shunyi', 'mentougou', 'daxing', 'pinggu', 'fangshan'} 9

Elements of set1 not in set2:
{'dongsihuan', 'yufa', 'guanyuan', 'miyunshuiku', 'yungang', 'beibuxinqu', 'donggaocun', 'qianmen', 'zhiwuyuan', 'nansanhuan', 'badaling', 'gucheng', 'wanshouxigong', 'yongledian', 'yongdingmennei', 'liulihe', 'dongsi', 'yanqin', 'aotizhongxin', 'tiantan', 'nongzhanguan', 'xizhimenbei', 'fengtaihuayuan', 'wanliu', 'yizhuang', 'dingling'} 26

Elements of set2 not in set1:
{'zhaitang', 'hadian', 'fengtai', 'beijing', 'shijingshan', 'chaoyang', 'shangdianzi', 'xiayunling', 'yanqing'} 9

The summary, as shown above, suggests that both data sets have 9 common stations.

To evaluate the distances between the selected features of two data sets, the designed pair_dist function from preprocess.py file can be handy.

One can provide the dataframes he/she is interested to work with, and the selected features in key-value pair (Python dict). The key-value pair specifies the group label (station_id in this example), and the features (latitude and longitude) we will use for distance evaluation. The distance calculation is based on the cdist function from scipy package.

The pair_dist function expects the provided key-value pairs are of the same size as the distance calculation will refer to a consistent set of features. The returned dataframe will have the group labels of first data set as its index, while the group labels of the second data set as its columns.


In [7]:
from script.preprocess import pair_dist

In [8]:
pair_dist(df1, 
          df2, 
          {"station_id": ["latitude", "longitude"]}, 
          {"station_id": ["latitude", "longitude"]})


Out[8]:
shunyi hadian yanqing miyun huairou shangdianzi pinggu tongzhou chaoyang pingchang zhaitang mentougou beijing shijingshan fengtai daxing fangshan xiayunling
dongsi 0.279975 0.139089 0.686779 0.633333 0.477417 1.007621 0.740880 0.349307 0.087065 0.358879 0.726167 0.263851 0.133612 0.212152 0.181485 0.219492 0.271980 0.705502
tiantan 0.318277 0.154107 0.713730 0.671248 0.520528 1.045903 0.765210 0.351780 0.115008 0.389806 0.720161 0.250617 0.101398 0.209485 0.162485 0.175446 0.240945 0.684777
guanyuan 0.339708 0.075528 0.638627 0.690617 0.516490 1.062898 0.815051 0.425544 0.163531 0.320696 0.648334 0.187206 0.179213 0.134402 0.110599 0.210955 0.212827 0.631103
wanshouxigong 0.362147 0.125077 0.687985 0.715412 0.552975 1.089441 0.819363 0.405814 0.166438 0.372758 0.666709 0.195855 0.137700 0.160274 0.107001 0.159408 0.189538 0.629429
aotizhongxin 0.261866 0.106559 0.633864 0.612099 0.440549 0.984341 0.744752 0.383993 0.107943 0.304286 0.704824 0.258402 0.190224 0.195749 0.188418 0.266805 0.291203 0.703651
nongzhanguan 0.244489 0.177611 0.710474 0.597146 0.452318 0.971849 0.696698 0.308916 0.042743 0.379676 0.769662 0.308562 0.131161 0.255781 0.225805 0.242998 0.313174 0.749977
wanliu 0.356754 0.003556 0.561293 0.696858 0.503029 1.063724 0.850575 0.489946 0.216599 0.248050 0.594922 0.164025 0.256918 0.093053 0.123955 0.276733 0.233217 0.604455
beibuxinqu 0.442799 0.155582 0.413849 0.747654 0.526178 1.096747 0.947116 0.631115 0.354579 0.138552 0.495572 0.202988 0.409732 0.150780 0.230994 0.412904 0.317585 0.564337
zhiwuyuan 0.426887 0.084901 0.506856 0.756881 0.550392 1.118000 0.926042 0.570967 0.297974 0.221383 0.515545 0.124933 0.327490 0.059525 0.137171 0.319451 0.229304 0.540659
fengtaihuayuan 0.427321 0.124482 0.663390 0.779186 0.604872 1.151856 0.893004 0.477918 0.239208 0.366570 0.597164 0.125090 0.198760 0.108421 0.034499 0.162911 0.123639 0.554962
yungang 0.558416 0.217824 0.650038 0.906711 0.718489 1.276539 1.031351 0.611119 0.377384 0.404696 0.477892 0.064618 0.323939 0.132499 0.109534 0.233572 0.070110 0.416514
gucheng 0.480861 0.129132 0.577039 0.823079 0.627008 1.189716 0.968087 0.576515 0.319164 0.310568 0.495411 0.038079 0.305153 0.035567 0.075277 0.259284 0.141311 0.480637
fangshan 0.614553 0.289629 0.726914 0.966482 0.787533 1.338884 1.070792 0.629569 0.421205 0.487245 0.500711 0.147197 0.339552 0.212131 0.168514 0.219693 0.065938 0.395671
daxing 0.460051 0.291892 0.851077 0.804173 0.677510 1.177312 0.844560 0.375691 0.253706 0.540698 0.756377 0.300226 0.109757 0.299818 0.219957 0.049559 0.216936 0.663529
yizhuang 0.349205 0.288546 0.846632 0.683805 0.575627 1.055053 0.717273 0.256106 0.157585 0.519713 0.833208 0.361712 0.038207 0.334948 0.271372 0.169718 0.312605 0.768318
tongzhou 0.245353 0.385882 0.894013 0.531075 0.473154 0.893677 0.535876 0.101270 0.175272 0.563467 0.974748 0.506614 0.209394 0.461196 0.418018 0.351035 0.482246 0.935775
shunyi 0.039724 0.390430 0.758102 0.326345 0.232477 0.701035 0.464720 0.297416 0.232847 0.453679 0.974876 0.553028 0.370676 0.486097 0.483507 0.507065 0.581071 0.997458
pingchang 0.395726 0.237892 0.349585 0.654162 0.421169 0.986206 0.889051 0.643357 0.378564 0.019396 0.590176 0.337351 0.475566 0.275611 0.347059 0.513690 0.445388 0.691433
mentougou 0.543449 0.191194 0.530470 0.876845 0.669654 1.237937 1.038135 0.656793 0.395137 0.305208 0.415419 0.070441 0.386295 0.099430 0.154435 0.330784 0.186148 0.420685
pinggu 0.484997 0.824350 1.171888 0.332577 0.519530 0.516021 0.031865 0.452988 0.628722 0.891958 1.417899 0.977517 0.714909 0.916912 0.897178 0.857881 0.978465 1.421199
huairou 0.201735 0.479779 0.670206 0.241298 0.029796 0.586021 0.514803 0.497429 0.396449 0.429288 1.000537 0.645145 0.545443 0.572105 0.596646 0.667973 0.704397 1.070899
miyun 0.325852 0.663245 0.866760 0.033029 0.205419 0.402082 0.349130 0.527903 0.532895 0.637436 1.206647 0.830053 0.670386 0.758642 0.770691 0.807692 0.873598 1.265951
yanqin 0.721318 0.564522 0.004725 0.895356 0.661830 1.158115 1.180343 0.991127 0.728124 0.331944 0.554818 0.594538 0.816037 0.561274 0.643619 0.828004 0.715320 0.760464
dingling 0.428462 0.313109 0.296387 0.649816 0.412226 0.964198 0.906104 0.696844 0.440599 0.069170 0.616234 0.409197 0.546178 0.349810 0.422479 0.588940 0.519587 0.739780
badaling 0.671029 0.484217 0.086580 0.876256 0.638985 1.161463 1.146577 0.926636 0.658145 0.264757 0.490359 0.506059 0.737662 0.475096 0.557622 0.743034 0.626820 0.682803
miyunshuiku 0.475483 0.804458 0.943414 0.130214 0.317224 0.256577 0.389055 0.669530 0.683300 0.751704 1.327087 0.971098 0.821624 0.898741 0.915684 0.958520 1.020218 1.401228
donggaocun 0.505426 0.837114 1.202983 0.377435 0.556375 0.558951 0.069480 0.442456 0.636493 0.916668 1.433336 0.986704 0.713858 0.928183 0.904384 0.855297 0.981866 1.428565
yongledian 0.447302 0.564000 1.098454 0.670431 0.664366 1.002308 0.566861 0.138035 0.370754 0.766736 1.121776 0.650799 0.327374 0.622007 0.560533 0.428607 0.591990 1.042577
yufa 0.683699 0.467040 0.986662 1.026445 0.899313 1.398524 1.044289 0.561961 0.476855 0.708859 0.758557 0.394822 0.332522 0.432988 0.354527 0.205938 0.274295 0.597073
liulihe 0.823050 0.500026 0.870001 1.175921 0.998998 1.549130 1.263674 0.802559 0.624172 0.677260 0.499876 0.345231 0.521061 0.416588 0.380029 0.380584 0.273809 0.298993
qianmen 0.316788 0.136539 0.696103 0.670134 0.514077 1.044530 0.771718 0.365315 0.118587 0.372563 0.706757 0.238875 0.119039 0.194645 0.152452 0.184892 0.237057 0.676262
yongdingmennei 0.334362 0.151689 0.713833 0.687429 0.535138 1.062055 0.781002 0.363785 0.131399 0.392283 0.708572 0.237903 0.102841 0.200096 0.148832 0.162283 0.224791 0.669861
xizhimenbei 0.317360 0.067090 0.624459 0.666895 0.490193 1.038522 0.798396 0.421348 0.151841 0.302326 0.657079 0.203677 0.190730 0.144182 0.133296 0.235452 0.238148 0.648849
nansanhuan 0.366615 0.152132 0.715169 0.719822 0.564653 1.094381 0.812659 0.388760 0.164186 0.399217 0.685984 0.213984 0.113048 0.184284 0.123550 0.138056 0.192608 0.640246
dongsihuan 0.229600 0.198327 0.724475 0.581008 0.442826 0.955752 0.675313 0.288558 0.022367 0.393023 0.791547 0.330603 0.133578 0.277744 0.247456 0.255143 0.333110 0.771678

It is straightforward to find the closest stations by using the idxmin function.

This jupyter notebook is available at my Github page: Pair-Distances.ipynb, and it is part of the repository jqlearning.


In [9]:
station_pairs_df = pair_dist(df1, 
                             df2, 
                             {"station_id": ["latitude", "longitude"]}, 
                             {"station_id": ["latitude", "longitude"]})
station_pairs_df.idxmin(axis=1)


Out[9]:
dongsi               chaoyang
tiantan               beijing
guanyuan               hadian
wanshouxigong         fengtai
aotizhongxin           hadian
nongzhanguan         chaoyang
wanliu                 hadian
beibuxinqu          pingchang
zhiwuyuan         shijingshan
fengtaihuayuan        fengtai
yungang             mentougou
gucheng           shijingshan
fangshan             fangshan
daxing                 daxing
yizhuang              beijing
tongzhou             tongzhou
shunyi                 shunyi
pingchang           pingchang
mentougou           mentougou
pinggu                 pinggu
huairou               huairou
miyun                   miyun
yanqin                yanqing
dingling            pingchang
badaling              yanqing
miyunshuiku             miyun
donggaocun             pinggu
yongledian           tongzhou
yufa                   daxing
liulihe              fangshan
qianmen              chaoyang
yongdingmennei        beijing
xizhimenbei            hadian
nansanhuan            beijing
dongsihuan           chaoyang
dtype: object