Example

The following example demonstrates how to the use the pymatch package to match Lending Club Loan Data. Follow the link to download the dataset from Kaggle (you'll have to create an account, it's fast and free!).

Here we match Lending Club users that fully paid off loans (control) to those that defaulted (test). The example is contrived, however a use case for this could be that we want to analyze user sentiment with the platform. Users that default on loans may have worse sentiment because they are predisposed to a bad situation--influencing their perception of the product. Before analyzing sentiment, we can match users that paid their loans in full to users that defaulted based on the characteristics we can observe. If matching is successful, we could then make a statetment about the causal effect defaulting has on sentiment if we are confident our samples are sufficiently balanced and our model is free from omitted variable bias.

This example, however, only goes through the matching procedure, which can be broken down into the following steps:

Data Preparation
Fit Propensity Score Models
Predict Propensity Scores
Tune Threshold
Match Data
Assess Matches

Data Prep



In [1]:

    
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from pymatch.Matcher import Matcher
import pandas as pd
import numpy as np

Load the dataset (loan.csv), which is the original loan.csv sampled as follows:

path = 'misc/loan_full.csv'
fields = [
    "loan_amnt",
    "funded_amnt",
    "funded_amnt_inv",
    "term",
    "int_rate",
    "installment",
    "grade",
    "sub_grade",
    "loan_status"
]
data = pd.read_csv(path)[fields]

# Treat long late as Defaulted
data.loc[data['loan_status'] == 'Late (31-120 days)', 'loan_status'] = 'Default'

# Sample 20K records from Fully paid and 2K from default
df = data[data.loan_status == 'Fully Paid'].sample(20000, random_state=42) \
    .append(data[data.loan_status == 'Default'].sample(2000, random_state=42))

df.to_csv('misc/loan.csv', index=False)



In [2]:

    
path = "misc/loan.csv"
data = pd.read_csv(path)

Create test and control groups and reassign loan_status to be a binary treatment indicator. This is our reponse in the logistic regression model(s) used to generate propensity scores.



In [3]:

    
test = data[data.loan_status == "Default"]
control = data[data.loan_status == "Fully Paid"]
test['loan_status'] = 1
control['loan_status'] = 0

`Matcher`

Initalize the Matcher object.

Note that:

Upon intialization, Matcher prints the formula used to fit logistic regression model(s) and the number of records in the majority/minority class.
- The regression model(s) are used to generate propensity scores. In this case, we are using the covariates on the right side of the equation to estimate the probability of defaulting on a loan (loan_status= 1).
Matcher will use all covariates in the dataset unless a formula is specified by the user. Note that this step is only fitting model(s), we assign propensity scores later.
Any covariates passed to the (optional) exclude parameter will be ignored from the model fitting process. This parameter is particularly useful for unique identifiers like a user_id.



In [4]:

    
m = Matcher(test, control, yvar="loan_status", exclude=[])









    



Formula:
loan_status ~ funded_amnt+funded_amnt_inv+grade+installment+int_rate+loan_amnt+sub_grade+term
n majority: 20000
n minority: 2000

There is a significant imbalance in our data--the majority group (fully-paid loans) having many more records than the minority group (defaulted loans). We account for this by setting balance=True when calling Matcher.fit_scores() below. This tells Matcher to sample from the majority group when fitting the logistic regression model(s) so that the groups are of equal size. When undersampling this way, it is highly recommended that nmodels is explictly assigned to a integer much larger than 1. This ensure is that more of the majority group is contributing to the generation of propensity scores. The value of this integer should depend on the severity of the imbalance; here we use nmodels=100.



In [5]:

    
# for reproducibility
np.random.seed(20170925)

m.fit_scores(balance=True, nmodels=100)









    



Fitting Models on Balanced Samples: 100\100
Average Accuracy: 66.06%

The average accuracy of our 100 models is 66.06%, suggesting that there's separability within our data and justifiying the need for the matching procedure. It's worth noting that we don't pay much attention to these logistic models since we are using them as a feature extraction tool (generation of propensity scores). The accuracy is a good way to detect separability at a glance, but we shouldn't spend time tuning and tinkering with these models. If our accuracy was close to 50%, that would suggest we cannot detect much separability in our groups given the features we observe and that matching is probably not necessary (or more features should be included if possible).

Predict Scores



In [6]:

    
m.predict_scores()



In [7]:

    
m.plot_scores()

The plot above demonstrates the separability present in our data. Test profiles have a much higher propensity, or estimated probability of defaulting given the features we isolated in the data.

Tune Threshold

The Matcher.match() method matches profiles that have propensity scores within some threshold.

i.e. for two scores s1 and s2, |s1 - s2| <= threshold

By default matches are found from the majority group for the minority group. For example, if our test group contains 1,000 records and our control group contains 20,000, Matcher will iterate through the test (minority) group and find suitable matches from the control (majority) group. If a record in the minority group has no suitable matches, it is dropped from the final matched dataset. We need to ensure our threshold is small enough such that we get close matches and retain most (or all) of our data in the minority group.

Below we tune the threshold using method="random". This matches a random profile that is within the threshold as there could be many. This is much faster than the alternative method "min", which finds the closest match for every minority record.



In [8]:

    
m.tune_threshold(method='random')

It looks like a threshold of 0.0005 retains enough information in our data. Let's proceed with matching using this threshold.

Match Data

Below we match one record from the majority group to each record in the minority group. This is done with replacement, meaning a single majority record can be matched to multiple minority records. Matcher assigns a unique record_id to each record in the test and control groups so this can be addressed after matching. If susequent modelling is planned, one might consider weighting models using a weight vector of 1/f for each record, f being a record's frequency in the matched dataset. Thankfully Matcher can handle all of this for you :).



In [9]:

    
m.match(method="min", nmatches=1, threshold=0.0005)



In [10]:

    
m.record_frequency()

It looks like the bulk of our matched-majority-group records occur only once, 68 occur twice, ... etc. We can preemptively generate a weight vector using Matcher.assign_weight_vector()



In [11]:

    
m.assign_weight_vector()

Let's take a look at our matched data thus far. Note that in addition to the weight vector, Matcher has also assigned a match_id to each record indicating our (in this cased) paired matches since we use nmatches=1. We can verify that matched records have scores within 0.0001 of each other.



In [12]:

    
m.matched_data.sort_values("match_id").head(6)









    Out[12]:







  
    
      
      record_id
      weight
      funded_amnt
      funded_amnt_inv
      grade
      installment
      int_rate
      loan_amnt
      loan_status
      sub_grade
      term
      scores
      match_id
    
  
  
    
      0
      0
      1.0
      12000
      12000.0
      C
      270.53
      12.59
      12000
      1
      C2
      60 months
      0.563669
      0
    
    
      2414
      6069
      1.0
      3200
      3200.0
      C
      112.54
      16.02
      3200
      0
      C5
      36 months
      0.563635
      0
    
    
      1
      1
      1.0
      15000
      15000.0
      B
      489.31
      10.75
      15000
      1
      B4
      36 months
      0.358750
      1
    
    
      3931
      21076
      1.0
      15000
      15000.0
      B
      489.31
      10.75
      15000
      0
      B4
      36 months
      0.358750
      1
    
    
      2
      2
      1.0
      30000
      30000.0
      A
      919.34
      6.49
      30000
      1
      A2
      36 months
      0.240780
      2
    
    
      3612
      18095
      1.0
      20000
      20000.0
      A
      617.46
      6.99
      20000
      0
      A2
      36 months
      0.240748
      2

Assess Matches

We must now determine if our data is "balanced". Can we detect any statistical differences between the covariates of our matched test and control groups? Matcher is configured to treat categorical and continouous variables separately in this assessment.

Discrete

For categorical variables, we look at plots comparing the proportional differences between test and control before and after matching.

For example, the first plot shows:

prop_test - prop_control for all possible term values---prop_test and prop_control being the proportion of test and control records with a given term value, respectively. We want these (orange) bars to be small after matching.
Results (pvalue) of a Chi-Square Test for Independence before and after matching. After matching we want this pvalue to be > 0.05, resulting in our failure to reject the null hypothesis that the frequecy of the enumerated term values are independent of our test and control groups.



In [13]:

    
categorical_results = m.compare_categorical(return_table=True)



In [14]:

    
categorical_results

Looking at the plots and test results, we did a pretty good job balancing our categorical features! The p-values from the Chi-Square tests are all > 0.05 and we can verify by observing the small proportional differences in the plots.

Continuous

For continous variables we look at Empirical Cumulative Distribution Functions (ECDF) for our test and control groups before and after matching.

For example, the first plot pair shows:

ECDF for test vs ECDF for control before matching (left), ECDF for test vs ECDF for control after matching(right). We want the two lines to be very close to each other (or indistiguishable) after matching.
Some tests + metrics are included in the chart titles.
- Tests performed:
  - Kolmogorov-Smirnov Goodness of fit Test (KS-test) This test statistic is calculated on 1000 permuted samples of the data, generating an imperical p-value. See pymatch.functions.ks_boot() This is an adaptation of the ks.boot() method in the R "Matching" package https://www.rdocumentation.org/packages/Matching/versions/4.9-2/topics/ks.boot
  - Chi-Square Distance: Similarly this distance metric is calculated on 1000 permuted samples. See pymatch.functions.grouped_permutation_test()
- Other included Stats:
  - Standarized mean and median differences. How many standard deviations away are the mean/median between our groups before and after matching i.e. abs(mean(control) - mean(test)) / std(control.union(test))



In [15]:

    
cc = m.compare_continuous(return_table=True)



In [16]:

    
cc









    Out[16]:







  
    
      
      var
      ks_before
      ks_after
      grouped_chisqr_before
      grouped_chisqr_after
      std_median_diff_before
      std_median_diff_after
      std_mean_diff_before
      std_mean_diff_after
    
  
  
    
      0
      funded_amnt
      0.0
      0.521
      0.000
      0.003
      0.342448
      0.000000
      0.323858
      -0.000013
    
    
      1
      funded_amnt_inv
      0.0
      0.513
      0.037
      0.000
      0.342227
      0.000000
      0.325860
      -0.000114
    
    
      2
      installment
      0.0
      0.336
      0.000
      0.886
      0.262462
      -0.016480
      0.275534
      0.020788
    
    
      3
      int_rate
      0.0
      0.010
      0.000
      0.004
      0.563311
      0.001942
      0.643545
      0.055132
    
    
      4
      loan_amnt
      0.0
      0.495
      0.000
      0.001
      0.342277
      0.000000
      0.322846
      -0.000013

We want the pvalues from both the KS-test and the grouped permutation of the Chi-Square distance after matching to be > 0.05, and they all are! We can verify by looking at how close the ECDFs are between test and control.

Conclusion

We saw a very "clean" result from the above procedure, achieving balance among all the covariates. In my work at Mozilla, we see much hairier results using the same procedure, which will likely be your experience too. In the case that certain covariates are not well balanced, one might consider tinkering with the parameters of the matching process (nmatches>1) or adding more covariates to the formula specified when we initialized the Matcher object. In any case, in subsequent modelling, you can always control for variables that you haven't deemed "balanced".

	record_id	weight	funded_amnt	funded_amnt_inv	grade	installment	int_rate	loan_amnt	loan_status	sub_grade	term	scores	match_id
0	0	1.0	12000	12000.0	C	270.53	12.59	12000	1	C2	60 months	0.563669	0
2414	6069	1.0	3200	3200.0	C	112.54	16.02	3200	0	C5	36 months	0.563635	0
1	1	1.0	15000	15000.0	B	489.31	10.75	15000	1	B4	36 months	0.358750	1
3931	21076	1.0	15000	15000.0	B	489.31	10.75	15000	0	B4	36 months	0.358750	1
2	2	1.0	30000	30000.0	A	919.34	6.49	30000	1	A2	36 months	0.240780	2
3612	18095	1.0	20000	20000.0	A	617.46	6.99	20000	0	A2	36 months	0.240748	2

	var	ks_after	grouped_chisqr_before	grouped_chisqr_after	std_median_diff_before	std_median_diff_after	std_mean_diff_before	std_mean_diff_after
0	funded_amnt	0.521	0.000	0.003	0.342448	0.000000	0.323858	-0.000013
1	funded_amnt_inv	0.513	0.037	0.000	0.342227	0.000000	0.325860	-0.000114
2	installment	0.336	0.000	0.886	0.262462	-0.016480	0.275534	0.020788
3	int_rate	0.010	0.000	0.004	0.563311	0.001942	0.643545	0.055132
4	loan_amnt	0.495	0.000	0.001	0.342277	0.000000	0.322846	-0.000013