"Test post"

"Awesome summary"

toc: false

branch: master

badges: true

comments: true

categories: [fastpages, jupyter]

image: images/some_folder/your_image.png

hide: false

search_exclude: true

metadata_key1: metadata_value1

metadata_key2: metadata_value2



In [1]:

    
import autogluon as ag
from autogluon import TabularPrediction as task



In [2]:

    
train_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
train_data = train_data.head(500) # subsample 500 data points for faster demo
print(train_data.head())









    



Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073






    



   age   workclass  fnlwgt   education  education-num       marital-status  \
0   25     Private  178478   Bachelors             13        Never-married   
1   23   State-gov   61743     5th-6th              3        Never-married   
2   46     Private  376789     HS-grad              9        Never-married   
3   55           ?  200235     HS-grad              9   Married-civ-spouse   
4   36     Private  224541     7th-8th              4   Married-civ-spouse   

           occupation    relationship    race      sex  capital-gain  \
0        Tech-support       Own-child   White   Female             0   
1    Transport-moving   Not-in-family   White     Male             0   
2       Other-service   Not-in-family   White     Male             0   
3                   ?         Husband   White     Male             0   
4   Handlers-cleaners         Husband   White     Male             0   

   capital-loss  hours-per-week  native-country   class  
0             0              40   United-States   <=50K  
1             0              35   United-States   <=50K  
2             0              15   United-States   <=50K  
3             0              50   United-States    >50K  
4             0              40     El-Salvador   <=50K



In [3]:

    
label_column = 'class'
print("Summary of class variable: \n", train_data[label_column].describe())









    



Summary of class variable: 
 count        500
unique         2
top        <=50K
freq         394
Name: class, dtype: object



In [4]:

    
dir = 'agModels-predictClass' # specifies folder where to store trained models
predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir)









    



Beginning AutoGluon training ...
Preprocessing data ...
Here are the first 10 unique label values in your data:  [' <=50K' ' >50K']
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Data preprocessing and feature engineering runtime = 0.16s ...
AutoGluon will gauge predictive performance using evaluation metric: accuracy
To change this, specify the eval_metric argument of fit()
/usr/local/Cellar/python/3.7.4/Frameworks/Python.framework/Versions/3.7/lib/python3.7/imp.py:342: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return _load(spec)
Fitting model: RandomForestClassifierGini ...
	0.69s	 = Training runtime
	0.9	 = Validation accuracy score
Fitting model: RandomForestClassifierEntr ...
	0.7s	 = Training runtime
	0.9	 = Validation accuracy score
Fitting model: ExtraTreesClassifierGini ...
	0.51s	 = Training runtime
	0.87	 = Validation accuracy score
Fitting model: ExtraTreesClassifierEntr ...
	0.48s	 = Training runtime
	0.86	 = Validation accuracy score
Fitting model: KNeighborsClassifierUnif ...
	0.02s	 = Training runtime
	0.8	 = Validation accuracy score
Fitting model: KNeighborsClassifierDist ...
	0.01s	 = Training runtime
	0.77	 = Validation accuracy score
Fitting model: LightGBMClassifier ...
	0.74s	 = Training runtime
	0.88	 = Validation accuracy score
Fitting model: CatboostClassifier ...
	1.11s	 = Training runtime
	0.9	 = Validation accuracy score
Fitting model: NeuralNetClassifier ...
	7.53s	 = Training runtime
	0.87	 = Validation accuracy score
Fitting model: LightGBMClassifierCustom ...
	1.13s	 = Training runtime
	0.89	 = Validation accuracy score
Fitting model: weighted_ensemble_l1 ...
	0.59s	 = Training runtime
	0.9	 = Validation accuracy score
AutoGluon training complete, total runtime = 16.0s ...



In [5]:

    
test_data = task.Dataset(file_path='https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label_column]  # values to predict
test_data_nolab = test_data.drop(labels=[label_column],axis=1) # delete label column to prove we're not cheating
print(test_data_nolab.head())









    



Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769






    



   age          workclass  fnlwgt      education  education-num  \
0   31            Private  169085           11th              7   
1   17   Self-emp-not-inc  226203           12th              8   
2   47            Private   54260      Assoc-voc             11   
3   21            Private  176262   Some-college             10   
4   17            Private  241185           12th              8   

        marital-status        occupation relationship    race      sex  \
0   Married-civ-spouse             Sales         Wife   White   Female   
1        Never-married             Sales    Own-child   White     Male   
2   Married-civ-spouse   Exec-managerial      Husband   White     Male   
3        Never-married   Exec-managerial    Own-child   White   Female   
4        Never-married    Prof-specialty    Own-child   White     Male   

   capital-gain  capital-loss  hours-per-week  native-country  
0             0             0              20   United-States  
1             0             0              45   United-States  
2             0          1887              60   United-States  
3             0             0              30   United-States  
4             0             0              20   United-States



In [6]:

    
predictor = task.load(dir) # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data_nolab)
print("Predictions:  ", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)









    



Evaluation: accuracy on test data: 0.824854
Evaluations on test data:
{
    "accuracy": 0.8248541304125294,
    "accuracy_score": 0.8248541304125294,
    "balanced_accuracy_score": 0.7104318244165013,
    "matthews_corrcoef": 0.47480025693977573,
    "f1_score": 0.8248541304125294
}






    



Predictions:   [' <=50K' ' <=50K' ' <=50K' ... ' <=50K' ' <=50K' ' <=50K']






    



Detailed (per-class) classification report:
{
    " <=50K": {
        "precision": 0.8546712802768166,
        "recall": 0.928197557374849,
        "f1-score": 0.8899182911921766,
        "support": 7451
    },
    " >50K": {
        "precision": 0.6809779367918902,
        "recall": 0.4926660914581536,
        "f1-score": 0.5717146433041302,
        "support": 2318
    },
    "accuracy": 0.8248541304125294,
    "macro avg": {
        "precision": 0.7678246085343534,
        "recall": 0.7104318244165013,
        "f1-score": 0.7308164672481534,
        "support": 9769
    },
    "weighted avg": {
        "precision": 0.8134571160636874,
        "recall": 0.8248541304125294,
        "f1-score": 0.8144145491710391,
        "support": 9769
    }
}



In [ ]: