It is a common practice that we try to make use of the available data and combine data sets from different sources for further analysis. For example, the following stations1.csv
and stations2.csv
files have different sets of stations with latitude, longitude information.
Assuming that stations1.csv
has air quality information while stations2.csv
has stations has weather information measured by those stations respectively.
One might assume that the air quality measured by a station in the first data set has a strong correlation with the weather condition registered by its closest station. To combine the two data sets, we need to determine the stations of stations2.csv
that are closest to those stations of stations1.csv
.
In [1]:
# Import functions
import sys
sys.path.append("../")
import pandas as pd
In [2]:
df1 = pd.read_csv("stations1.csv")
df2 = pd.read_csv("stations2.csv")
In [3]:
df1.head(10)
Out[3]:
In [4]:
df2.head(10)
Out[4]:
Before checking for the closest stations, we can understand the composition of stations found in the different data sets. The designed sets_grps function from preprocess.py file is useful for this purpose.
In [5]:
from script.preprocess import sets_grps
In [6]:
sets_grps(df1.station_id, df2.station_id)
The summary, as shown above, suggests that both data sets have 9 common stations.
To evaluate the distances between the selected features of two data sets, the designed pair_dist function from preprocess.py file can be handy.
One can provide the dataframes he/she is interested to work with, and the selected features in key-value pair (Python dict). The key-value pair specifies the group label (station_id in this example), and the features (latitude and longitude) we will use for distance evaluation. The distance calculation is based on the cdist function from scipy package.
The pair_dist function expects the provided key-value pairs are of the same size as the distance calculation will refer to a consistent set of features. The returned dataframe will have the group labels of first data set as its index, while the group labels of the second data set as its columns.
In [7]:
from script.preprocess import pair_dist
In [8]:
pair_dist(df1,
df2,
{"station_id": ["latitude", "longitude"]},
{"station_id": ["latitude", "longitude"]})
Out[8]:
It is straightforward to find the closest stations by using the idxmin function.
This jupyter notebook is available at my Github page: Pair-Distances.ipynb, and it is part of the repository jqlearning.
In [9]:
station_pairs_df = pair_dist(df1,
df2,
{"station_id": ["latitude", "longitude"]},
{"station_id": ["latitude", "longitude"]})
station_pairs_df.idxmin(axis=1)
Out[9]: