End-to-End Data Science in Python

Introduction

This is the workbook for the "End-to-End Data Analysis in Python" workshop at the Open Data Science Conference 2015, in beautiful San Francisco. This notebook contains starter code only; the goal is that we will fill in the gaps together as we progress through the workshop. If, however, you're doing this asynchronously or you get stuck, you can reference the solutions workbook.

The objective is to complete the "Pump it Up: Mining the Water Table" challenge on drivendata.org; the objective here is to predict African wells that are non-functional or in need of repair. Per the rules of the competition, you should register for an account with drivendata.org, at which point you can download the training set values and labels. We will be working with those datasets during this workshop. You should download those files to the directory in which this notebook lives, and name them wells_features.csv and wells_labels.csv (to be consistent with our nomenclature). You are also encouraged to continue developing your solution after this workshop, and/or to enter your solution in the competition on the drivendata website!

Code requirements

Here's the environment you'll need to work with this code base:

python 3 (2.x may work with minor changes, but no guarantees)
pandas
scikit-learn
numpy

First Draft of an Analysis



In [47]:

    
import pandas as pd
import numpy as np
  
features_df = pd.DataFrame.from_csv("well_data.csv")
labels_df   = pd.DataFrame.from_csv("well_labels.csv")  
print( labels_df.head(20) )









    



                  status_group
id                            
69572               functional
8776                functional
34310               functional
67743           non functional
19728               functional
9944                functional
19816           non functional
54551           non functional
53934           non functional
46144               functional
49056               functional
50409               functional
36957               functional
50495               functional
53752               functional
61848               functional
48451           non functional
58155           non functional
34169  functional needs repair
18274               functional

One nice feature of ipython notebooks is it's easy to make small changes to code and then re-execute quickly, to see how things change. For example, printing the first 5 lines of the labels dataframe (which is the default) isn't really ideal here, since there's a label ("functional needs repair") which doesn't appear in the first five lines. Type 20 in the parentheses labels_df.head(), so it now reads labels_df.head(20), and press shift-enter to rerun the code. See the difference?

Now take a quick look at the features, again by calling .head() (set up for you in the code box below, or add your own code to the code box above). You can print or as few rows as you like. Take a quick look at the data--approximately how many features are there? Are they all numeric, or will you have to do work to transform non-numeric features into numbers?



In [3]:

    
print( features_df.head() )









    



       amount_tsh date_recorded        funder  gps_height     installer  \
id                                                                        
69572        6000       3/14/11         Roman        1390         Roman   
8776            0        3/6/13       Grumeti        1399       GRUMETI   
34310          25       2/25/13  Lottery Club         686  World vision   
67743           0       1/28/13        Unicef         263        UNICEF   
19728           0       7/13/11   Action In A           0       Artisan   

       longitude   latitude              wpt_name  num_private  \
id                                                               
69572  34.938093  -9.856322                  none            0   
8776   34.698766  -2.147466              Zahanati            0   
34310  37.460664  -3.821329           Kwa Mahundi            0   
67743  38.486161 -11.155298  Zahanati Ya Nanyumbu            0   
19728  31.130847  -1.825359               Shuleni            0   

                         basin          ...          payment_type  \
id                                      ...                         
69572               Lake Nyasa          ...              annually   
8776             Lake Victoria          ...             never pay   
34310                  Pangani          ...            per bucket   
67743  Ruvuma / Southern Coast          ...             never pay   
19728            Lake Victoria          ...             never pay   

      water_quality  quality_group      quantity quantity_group  \
id                                                                
69572          soft           good        enough         enough   
8776           soft           good  insufficient   insufficient   
34310          soft           good        enough         enough   
67743          soft           good           dry            dry   
19728          soft           good      seasonal       seasonal   

                     source           source_type source_class  \
id                                                               
69572                spring                spring  groundwater   
8776   rainwater harvesting  rainwater harvesting      surface   
34310                   dam                   dam      surface   
67743           machine dbh              borehole  groundwater   
19728  rainwater harvesting  rainwater harvesting      surface   

                   waterpoint_type waterpoint_type_group  
id                                                        
69572           communal standpipe    communal standpipe  
8776            communal standpipe    communal standpipe  
34310  communal standpipe multiple    communal standpipe  
67743  communal standpipe multiple    communal standpipe  
19728           communal standpipe    communal standpipe  

[5 rows x 39 columns]

Transforming string labels into integers

The machine learning algorithms downstream are not going to handle it well if the class labels used for training are strings; instead, we'll want to use integers. The mapping that we'll use is that "non functional" will be transformed to 0, "functional needs repair" will be 1, and "functional" becomes 2.

There are a number of ways to do this; the framework below uses applymap() in pandas. Here's the documentation for applymap(); in the code below, you should fill in the function body for label_map(y) so that if y is "functional", label_map returns 2; if y is "functional needs repair" then it should return 1, and "non functional" is 0. There's a print statement there to help you confirm that the label transformation is working properly.

As an aside, you could also use apply() here if you like. The difference between apply() and applymap() is that applymap() operates on a whole dataframe while apply() operates on a series (or you can think of it as operating on one column of your dataframe). Since labels_df only has one column (aside from the index column), either one will work here.



In [48]:

    
def label_map(y):
    if y=="functional":
        return 2
    elif y=="functional needs repair":
        return 1
    else:
        return 0
labels_df = labels_df.applymap(label_map)
print( labels_df.head() )









    



       status_group
id                 
69572             2
8776              2
34310             2
67743             0
19728             2

Transforming string features into integers

Now that the labels are ready, we'll turn our attention to the features. Many of the features are categorical, where a feature can take on one of a few discrete values, which are not ordered. Fill in the function body of transform_feature( df, column ) below so that it takes our features_df and the name of a column in that dataframe, and returns the same dataframe but with the indicated feature encoded with integers rather than strings.

We've provided code to wrap your transformer function in a loop iterating through all the columns that should be transformed.

Last, add a line of code at the bottom of the block below that removes the date_recorded column from features_df. Time-series information like dates and times need special treatment, which we won't be going into today.



In [49]:

    
def transform_feature( df, column_name ):
    unique_values = set( df[column_name].tolist() )
    transformer_dict = {}
    for ii, value in enumerate(unique_values):
        transformer_dict[value] = ii

    def label_map(y):
        return transformer_dict[y]
    df[column_name] = df[column_name].apply( label_map )
    return df

### list of column names indicating which columns to transform; 
### this is just a start!  Use some of the print( labels_df.head() )
### output upstream to help you decide which columns get the
### transformation
names_of_columns_to_transform = ["funder", "installer", "wpt_name", "basin", "subvillage",
                    "region", "lga", "ward", "public_meeting", "recorded_by",
                    "scheme_management", "scheme_name", "permit",
                    "extraction_type", "extraction_type_group",
                    "extraction_type_class",
                    "management", "management_group",
                    "payment", "payment_type",
                    "water_quality", "quality_group", "quantity", "quantity_group",
                    "source", "source_type", "source_class",
                    "waterpoint_type", "waterpoint_type_group"]
for column in names_of_columns_to_transform:
    features_df = transform_feature( features_df, column )
    
print( features_df.head() )
    
### remove the "date_recorded" column--we're not going to make use
### of time-series data today
features_df.drop("date_recorded", axis=1, inplace=True)

print(features_df.columns.values)









    



       amount_tsh date_recorded  funder  gps_height  installer  longitude  \
id                                                                          
69572        6000       3/14/11     614        1390        685  34.938093   
8776            0        3/6/13    1206        1399       1252  34.698766   
34310          25       2/25/13     878         686        555  37.460664   
67743           0       1/28/13     920         263       2059  38.486161   
19728           0       7/13/11     906           0        164  31.130847   

        latitude  wpt_name  num_private  basin          ...            \
id                                                      ...             
69572  -9.856322     15454            0      5          ...             
8776   -2.147466     19453            0      8          ...             
34310  -3.821329     10040            0      0          ...             
67743 -11.155298     19651            0      7          ...             
19728  -1.825359      7904            0      8          ...             

       payment_type  water_quality  quality_group  quantity  quantity_group  \
id                                                                            
69572             0              2              3         4               4   
8776              2              2              3         2               2   
34310             6              2              3         4               4   
67743             2              2              3         3               3   
19728             2              2              3         1               1   

       source  source_type  source_class  waterpoint_type  \
id                                                          
69572       3            0             2                3   
8776        8            4             0                3   
34310       7            3             0                2   
67743       0            2             2                2   
19728       8            4             0                3   

       waterpoint_type_group  
id                            
69572                      2  
8776                       2  
34310                      2  
67743                      2  
19728                      2  

[5 rows x 39 columns]
['amount_tsh' 'funder' 'gps_height' 'installer' 'longitude' 'latitude'
 'wpt_name' 'num_private' 'basin' 'subvillage' 'region' 'region_code'
 'district_code' 'lga' 'ward' 'population' 'public_meeting' 'recorded_by'
 'scheme_management' 'scheme_name' 'permit' 'construction_year'
 'extraction_type' 'extraction_type_group' 'extraction_type_class'
 'management' 'management_group' 'payment' 'payment_type' 'water_quality'
 'quality_group' 'quantity' 'quantity_group' 'source' 'source_type'
 'source_class' 'waterpoint_type' 'waterpoint_type_group']

Ok, a couple last steps to get everything ready for sklearn. The features and labels are taken out of their dataframes and put into a numpy.ndarray and list, respectively.



In [9]:

    
X = features_df.as_matrix()
y = labels_df["status_group"].tolist()

Predicting well failures with logistic regression

The cheapest and easiest way to train on one portion of your dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score(). This splits your data into 3 equal portions, trains on two of them, and tests on the third. This process repeats 3 times. That's why 3 numbers get printed in the code block below.

You don't have to add anything to the code block, it's ready to go already. However, use it for reference in the next part of the tutorial, where you will be looking at other sklearn algorithms.

Heads up: it can be a little slow. This took a minute or two to evaluate on my MacBook Pro.



In [33]:

    
import sklearn.linear_model
import sklearn.cross_validation

clf = sklearn.linear_model.LogisticRegression()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )









    



[ 0.65363636  0.6569697   0.6560101 ]

Comparing logistic regression to tree-based methods

We have a baseline logistic regression model for well failures. Let's compare to a couple of other classifiers, a decision tree classifier and a random forest classifier, to see which one seems to do the best.

Code this up on your own. You can use the code in the box above as a kind of template, and just drop in the new classifiers. The sklearn documentation might also be helpful:

We will talk about all three of these models more in the next part of the tutorial.



In [38]:

    
import sklearn.tree
import sklearn.ensemble

clf = sklearn.tree.DecisionTreeClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )

clf = sklearn.ensemble.RandomForestClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )









    



[ 0.73590909  0.73691919  0.73005051]
[ 0.78777778  0.7889899   0.78409091]
['amount_tsh' 'funder' 'gps_height' 'installer' 'longitude' 'latitude'
 'wpt_name' 'num_private' 'basin' 'subvillage' 'region' 'region_code'
 'district_code' 'lga' 'ward' 'population' 'public_meeting' 'recorded_by'
 'scheme_management' 'scheme_name' 'permit' 'construction_year'
 'extraction_type' 'extraction_type_group' 'extraction_type_class'
 'management' 'management_group' 'payment' 'payment_type' 'water_quality'
 'quality_group' 'quantity' 'quantity_group' 'source' 'source_type'
 'source_class' 'waterpoint_type' 'waterpoint_type_group']

Congratulations! You have a working data science setup, in which you have:

read in data
transformed features and labels to make the data amenable to machine learning
made a train/test split (this was done implicitly when you called cross_val_score)
evaluated several models for identifying wells that are failed or in danger of failing

Paying down technical debt and tuning the models

We got things running really fast, which is great, but at the cost of being a little quick-and-dirty about some details. First, we got the features encoded as integers, but they really should be dummy variables. Second, it's worth going through the models a little more thoughtfully, to try to understand their performance and if there's any more juice we can get out of them.

One-hot encoding to make dummy variables

A problem with representing categorical variables as integers is that integers are ordered, while categories are not. The standard way to deal with this is to use dummy variables; one-hot encoding is a very common way of dummying. Each possible category becomes a new boolean feature. For example, if our dataframe looked like this:

index country 1 "United States" 2 "Mexico" 3 "Mexico" 4 "Canada" 5 "United States" 6 "Canada"

then after dummying it will look something like this:

index country_UnitedStates country_Mexico country_Canada 1 1 0 0 2 0 1 0 3 0 1 0 4 0 0 1 5 1 0 0 6 0 0 1

Hopefully the origin of the name is clear--each variable is now encoded over several boolean columns, one of which is true (hot) and the others are false.

Now we'll write a hot-encoder function that takes the data frame and the title of a column, and returns the same data frame but one-hot encoding performed on the indicated feature.

Protip: sklearn has a one-hot encoder function available that will be your friend here.



In [50]:

    
import sklearn.preprocessing

def hot_encoder(df, column_name):
    column = df[column_name].tolist()
    column = np.reshape( column, (len(column), 1) )  ### needs to be an N x 1 numpy array
    enc = sklearn.preprocessing.OneHotEncoder()
    enc.fit( column )
    new_column = enc.transform( column ).toarray()
    column_titles = []
    ### making titles for the new columns, and appending them to dataframe
    for ii in range( len(new_column[0]) ):
        this_column_name = column_name+"_"+str(ii)
        df[this_column_name] = new_column[:,ii]
    return df

Now we'll take the to_transform list that you populated above with categorical variables, and use that to loop through columns that will be one-hot encoded.

One note before you code that up: one-hot encoding comes with the baggage that it makes your dataset bigger--sometimes a lot bigger. In the countries example above, one column that encoded the country has now been expanded out to three columns. You can imagine that this can sometimes get really, really big (imagine a column encoding all the counties in the United States, for example).

There are some columns in this example that will really blow up the dataset, so we'll remove them before proceeding with the one-hot encoding.



In [51]:

    
print(features_df.columns.values)


features_df.drop( "funder", axis=1, inplace=True )
features_df.drop( "installer", axis=1, inplace=True )
features_df.drop( "wpt_name", axis=1, inplace=True )
features_df.drop( "subvillage", axis=1, inplace=True )
features_df.drop( "ward", axis=1, inplace=True )

names_of_columns_to_transform.remove("funder")
names_of_columns_to_transform.remove("installer")
names_of_columns_to_transform.remove("wpt_name")
names_of_columns_to_transform.remove("subvillage")
names_of_columns_to_transform.remove("ward")

for feature in names_of_columns_to_transform:
    features_df = hot_encoder( features_df, feature )
    
print( features_df.head() )









    



['amount_tsh' 'funder' 'gps_height' 'installer' 'longitude' 'latitude'
 'wpt_name' 'num_private' 'basin' 'subvillage' 'region' 'region_code'
 'district_code' 'lga' 'ward' 'population' 'public_meeting' 'recorded_by'
 'scheme_management' 'scheme_name' 'permit' 'construction_year'
 'extraction_type' 'extraction_type_group' 'extraction_type_class'
 'management' 'management_group' 'payment' 'payment_type' 'water_quality'
 'quality_group' 'quantity' 'quantity_group' 'source' 'source_type'
 'source_class' 'waterpoint_type' 'waterpoint_type_group']
       amount_tsh  gps_height  longitude   latitude  num_private  basin  \
id                                                                        
69572        6000        1390  34.938093  -9.856322            0      5   
8776            0        1399  34.698766  -2.147466            0      8   
34310          25         686  37.460664  -3.821329            0      0   
67743           0         263  38.486161 -11.155298            0      7   
19728           0           0  31.130847  -1.825359            0      8   

       region  region_code  district_code  lga           ...             \
id                                                       ...              
69572      13           11              5   75           ...              
8776        1           20              2   76           ...              
34310      17           21              4   41           ...              
67743      19           90             63   82           ...              
19728       2           18              1   48           ...              

       waterpoint_type_3  waterpoint_type_4  waterpoint_type_5  \
id                                                               
69572                  1                  0                  0   
8776                   1                  0                  0   
34310                  0                  0                  0   
67743                  0                  0                  0   
19728                  1                  0                  0   

       waterpoint_type_6  waterpoint_type_group_0  waterpoint_type_group_1  \
id                                                                           
69572                  0                        0                        0   
8776                   0                        0                        0   
34310                  0                        0                        0   
67743                  0                        0                        0   
19728                  0                        0                        0   

       waterpoint_type_group_2  waterpoint_type_group_3  \
id                                                        
69572                        1                        0   
8776                         1                        0   
34310                        1                        0   
67743                        1                        0   
19728                        1                        0   

       waterpoint_type_group_4  waterpoint_type_group_5  
id                                                       
69572                        0                        0  
8776                         0                        0  
34310                        0                        0  
67743                        0                        0  
19728                        0                        0  

[5 rows x 3031 columns]

Now that the features are a little fixed up, I'd invite you to rerun the models, and see if the cross_val_score goes up as a result. It is also a great chance to take some of the theory discussion from the workshop and play around with the parameters of your models, and see if you can increase their scores that way. There's a blank code box below where you can play around.



In [53]:

    
X = features_df.as_matrix()
y = labels_df["status_group"].tolist()

clf = sklearn.ensemble.RandomForestClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print(score)









    



[ 0.78060606  0.78308081  0.77585859]

End-to-end workflows using Pipeline and GridSearchCV

So far we have made a nice workflow using a few ideas assembled in a script-like workflow. A few spots remain where we can tighten things up though:

the best model, the random forest, has a lot of parameters that we'd have to work through if we really wanted to tune it
after dummying, we have lots of features, probably only a subset of which are really offering any discriminatory power (this is a version of the bias-variance tradeoff)
maybe there's a way to make the code more streamlined (hint: there is)

We will solve all these with two related and lovely tools in sklearn: Pipeline and GridSearchCV.

Pipeline in sklearn is a tool for chaining together multiple pieces of a workflow into a single coherent analysis. In our example, we will chain together a tool for feature selection, to will address the second point, which then feeds our optimized feature set into the random forest model, all in a few lines of code (which addresses the third point).

To get to the first point, about finding the best parameters--that's where the magic of GridSearchCV comes in. But first we need to get the feature selector and pipeline up and running, so let's do that now.

In sklearn.feature_selection there is a useful tool, SelectKBest (link) that you should use. By default, this will select the 10 best features; that seems like it might be too few features to do well on this problem, so change the number of features to 100.



In [54]:

    
import sklearn.feature_selection

select = sklearn.feature_selection.SelectKBest(k=100)
selected_X = select.fit_transform(X, y)

print( selected_X.shape )









    



(59400, 100)






    



/Users/civisemployee/.py34/lib/python3.4/site-packages/sklearn/feature_selection/univariate_selection.py:111: UserWarning: Features [ 12 191] are constant.
  UserWarning)

Pipeline

After selecting the 100 best features, the natural next step would be to run our random forest again to see if it does a little better with fewer features. So we would have SelectKBest doing selection, with the output of that process going straight into a classifier. A Pipeline packages the transformation step of SelectKBest with the estimation step of RandomForestClassifier into a coherent workflow.

Why might you want to use Pipeline instead of keeping the steps separate?

makes code more readable
don't have to worry about keeping track data during intermediate steps, for example between transforming and estimating
makes it trivial to move ordering of the pipeline pieces, or to swap pieces in and out
Allows you to do GridSearchCV on your workflow

This last point is, in my opinion, the most important. We will get to it very soon, but first let's get a pipeline up and running that does SelectKBest followed by RandomForestClassifier.

In the code box below, I've also set up a slightly better training/testing structure, where I am explicitly splitting the data into training and testing sets which we'll use below. The training/testing split before was handled automatically in cross_val_score, but we'll be using a different evaluation metric from here forward, the classification report, which requires us to handle the train/test split ourselves.

Note: when you do SelectKBest, you might see a warning about a bunch of features that are constant. This isn't a problem. It's giving you a heads up that the indicated features don't show any variation, which could be a signal that something is wrong or that SelectKBest might be doing something unexpected. I



In [56]:

    
import sklearn.pipeline

select = sklearn.feature_selection.SelectKBest(k=100)
clf = sklearn.ensemble.RandomForestClassifier()

steps = [('feature_selection', select),
         ('random_forest', clf)]

pipeline = sklearn.pipeline.Pipeline(steps)

X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y, test_size=0.33, random_state=42)

### fit your pipeline on X_train and y_train
pipeline.fit( X_train, y_train )
### call pipeline.predict() on your X_test data to make a set of test predictions
y_prediction = pipeline.predict( X_test )
### test your predictions using sklearn.classification_report()
report = sklearn.metrics.classification_report( y_test, y_prediction )
### and print the report
print(report)









    



             precision    recall  f1-score   support

          0       0.80      0.56      0.66      5007
          1       0.12      0.00      0.01       942
          2       0.69      0.92      0.79      7119

avg / total       0.69      0.72      0.68     13068







    



/Users/civisemployee/.py34/lib/python3.4/site-packages/sklearn/feature_selection/univariate_selection.py:111: UserWarning: Features [  12  191  214  219  262  267  274  277  284  291  312  322  334  336  346
  372  376  384  397  411  437  460  474  475  479  483  511  525  536  559
  562  586  606  610  650  652  658  659  704  737  738  741  762  774  785
  796  807  810  830  851  878  884  891  917  931  960  970  989 1003 1045
 1057 1058 1060 1061 1083 1097 1130 1153 1154 1177 1178 1226 1228 1231 1241
 1256 1261 1275 1281 1295 1302 1327 1330 1331 1392 1401 1403 1409 1422 1461
 1498 1502 1581 1587 1602 1604 1627 1630 1645 1650 1655 1664 1680 1684 1689
 1693 1702 1704 1714 1756 1784 1798 1802 1827 1833 1854 1909 1910 1939 1940
 1968 2017 2087 2105 2120 2133 2138 2170 2271 2283 2318 2351 2360 2376 2380
 2390 2397 2401 2406 2424 2429 2433 2464 2470 2477 2482 2483 2509 2523 2527
 2528 2531 2535 2561 2588 2596 2601 2628 2630 2637 2669 2674 2708 2726 2737
 2759 2771 2804 2815 2832 2852 2862 2889 2894 2900] are constant.
  UserWarning)

Reading the classification report

A brief aside--we've switched from cross_val_score to classification_report for evaluation, mostly to show you two different ways to evaluating a model. The classification report has the advantage of giving you a lot more information, and if (for example) one class is more important to get right than the others (say you're trying to zero in on non-functional wells, so finding those correctly is more important than getting the functional wells right).

For more information, the sklearn docs on classification_report are, like all the sklearn docs, incredibly helpful. For interpreting the various metrics, this page may also help.

GridSearchCV

We're in the home stretch now. When we decided to select the 100 best features, setting that number to 100 was kind of a hand-wavey decision. Similarly, the RandomForestClassifier that we're using right now has all its parameters set to their default values, which might not be optimal.

So, a straightforward thing to do now is to try different values of k and any RandomForestClassifier parameters we want to tune (for the sake of concreteness, let's play with n_estimators and min_samples_split). Trying lots of values for each of these free parameters is tedious, and there can sometimes be interactions between the choices you make in one step and the optimal value for a downstream step. In other words, to avoid local optima, you should try all the combinations of parameters, and not just vary them independently. So if you want to try 5 different values each for k, n_estimators and min_samples_split, that means 5 x 5 x 5 = 125 different combinations to try. Not something you want to do by hand.

GridSearchCV allows you to construct a grid of all the combinations of parameters, tries each combination, and then reports back the best combination/model.



In [66]:

    
import sklearn.grid_search
#import warnings
#warnings.filterwarnings("ignore")


parameters = dict(feature_selection__k=[100, 200], 
              random_forest__n_estimators=[50],
              random_forest__min_samples_split=[4])

cv = sklearn.grid_search.GridSearchCV(pipeline, param_grid=parameters)
print(pipeline.named_steps)
cv.fit(X_train, y_train)
y_predictions = cv.predict(X_test)
report = sklearn.metrics.classification_report( y_test, y_predictions )
### and print the report
print(report)









    



{'random_forest': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False), 'feature_selection': SelectKBest(k=20, score_func=<function f_classif at 0x1119a4598>)}
[0 2 2 2 2 0 2 0 0 2 2 2 0 2 2 2 2 2 0 0]



In [ ]: