13 - Model Deployment

by Alejandro Correa Bahnsen

version 0.1, May 2016

Part of the class Machine Learning for Security Informatics

This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License]

Agenda:

  1. Creating and saving a model
  2. Running the model in batch
  3. Exposing the model as an API

Part 1: Phishing Detection

Phishing, by definition, is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity. Users usually have a hard time differentiating between legitimate and malicious sites because they are made to look exactly the same. Therefore, there is a need to create better tools to combat attackers.


In [1]:
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/phishing.csv.zip', 'r') as z:
    f = z.open('phishing.csv')
    data = pd.read_csv(f, index_col=False)

In [2]:
data.head()


Out[2]:
url phishing
0 http://www.subalipack.com/contact/images/sampl... 1
1 http://fasc.maximecapellot-gypsyjazz-ensemble.... 1
2 http://theotheragency.com/confirmer/confirmer-... 1
3 http://aaalandscaping.com/components/com_smart... 1
4 http://paypal.com.confirm-key-21107316126168.s... 1

In [4]:
data.phishing.value_counts()


Out[4]:
1    20000
0    20000
Name: phishing, dtype: int64

Creating features


In [92]:
data.url[data.phishing==1].sample(50, random_state=1).tolist()


Out[92]:
['http://dothan.com.co/gold/austspark/index.htm\n',
 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\n',
 'http://verify95.5gbfree.com/coverme2010/\n',
 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\n',
 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\n',
 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\n',
 'http://senevi.com/confirmation/\n',
 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\n',
 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\n',
 'http://alen.co/docs/cleaner\n',
 'http://rattanhouse.co/Atualizacao_Bradesco/cadastro2013.php?2MAS2XACUJPI3U8D9ZDDG2G9YJICVABQ3K73KWDKYK0NA0AWWWCOUEDUJRXHRKPNMUYLDV89RA6OCG2MQUS0TAUXX9IOGJUEIXPDS5B0RM18OF1H860UAMJOY6ICUR81VSEKKJFPBYNLYGUXBGJ1HEHKOMLTM01P658M\n',
 'http://steamcommunily.co/p.php?login=true\n',
 'http://www.nyyg.com/Bradesco/5W9SQ394.html\n',
 'http://wp.tipografiacentral.com.co/sparkde/index.html\n',
 'http://www.entrerev.com/component/.secure.wpa/.www.paypal.com.returnUrl=/cgi-bin/5RF3S6y0K349/PayPal.co.uk/dispute_centre/sotmks/npsw&st.payment.decline.centre/ipoi/secure-codes.paypal.account4738154login.complete-infrmations.login.accountSecure26/securities/\n',
 'http://x.co/SecurCent\n',
 'http://dejatequerer.co/united.com/index.html\n',
 'http://www.speakeasymovies.com/components/com_wrapper/.amazon.co.uk/\n',
 'http://www.culturaespanola.com.br/bt/www.paypal.com/paypal.com.com/index-new.php\n',
 'http://www.agroassistance.com/components/com_content/c05354aa285b6a932a57086ba13762a1/\n',
 'http://www.estranetsrl.com.ar/bbvacambios.html\n',
 'http://osfsw.cba.pl/content/classic/html/ibpf/bradesco/?UOREEIYGQTERIRVSJTUHMVMZJWWYSVNYQOFSPWVFTEJEEKMJWHFERRYTFRWPSYYWGFIGJUPLZMZLTNSKOGMQQSHSXPLMXILVSM\n',
 'http://bitcrush.co/~geetha5/natwest/natwest/ibcarregister-natwst.html\n',
 'http://cannot-hide-from-PhishTank.zenith-services.com/controllare/auth/\n',
 'http://nova.pymesonline.co/fr.php\n',
 'http://comococino.com/wp-content/uploads/2013/01/paypal.com/us/cgi-bin/webscr.htm?\n',
 'http://www.fundacionchwinqlal.com.gt/imgs/Notas/img/_New/Agencias_Bradesco/Public_201133.php?KSR6YOU359CY1USIRMSBI8CFJF7TVREFJ6KIUFKZNXXNRP7JBYVU79APNGJI8YYR5I0YXUXLRU0JKF4WEYQL81BUGVDOTBFXUPVSKSEBNNU84X4IWT54UFYABCY5OE3J5XBOQQ1EDVMHTPZPJ4TEJSOU5NZS32B8ZNWQ\n',
 'http://flightripe.com/confirmation/update/billing/9a523c6017caa3406af9d5c2c0cb1854/\n',
 'http://accademiazerootto.it/templates/zerootto-new/html/com_content/category/bompreco.php\n',
 'http://santanderseguranca.zapto.org/Clientesx/\n',
 'http://www.muttico.com/components/com_media/p3rs0na4l/53f8b14c76c890e1806b8f9d97f12f80/\n',
 'http://us.fxlhtvf.ml/login/en/login.html.asp?refhttp:%2F%2Futddirect.com%2Fcomponents%2Fcom_content%2Fviews%2Fcategories%2Fmenu.html\n',
 'http://conferencistainternacional.com.co/urruirrhyttjk/Index.htm\n',
 'http://www.creativesovereign.com/components/com_newsfeeds/views/.../perfil/\n',
 'http://villamarina.com.co/administrator/servers/BankofAmerica/security-update/SecMeasure/account-overview.cgi/presentation/jskeys/sas/signonScreen.do/\n',
 'http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php\n',
 'http://www.enoxia.fr/components/com_content/tamfidelidade01.php\n',
 'http://gobbva.com/bb/empresa/index.php?tarjeta=\n',
 'http://paypal-com-confim.sharmikelectric.com/s4575234bf5055889415\n',
 'http://paypal.com.au.au.webapps.mpp.homes.konyadosemeciler.com/confirm/login.australia/au/webapps/mpp/home/initthi.php?cmd=SignIn&co_partnerId=2&pUserId=&siteid=0&pageType=&pa1=&i1=&bshowgif=&UsingSSL=&ru=&pp=&pa2=&errmsg=&runame=%5C%5C%5C%5C\n',
 'http://www.bbvabancocontinental.ya.st\n',
 'http://www.giannielectric.com/company/components/com_poll/assets/a/a5643cded2383f7568719482a943e1a5\n',
 'http://cooperativasanjose.com.co/plugins/josetta_ext/k2category/section/first.php\n',
 'http://appleid-apple-com-confirm-oyns-uattw6w61x3oka3pq.scientificcollectables.com/3c43e3d92e0b8a48f09f5fbb25d008a9/index1.php?cmd=https://connect.paypal.com/WebObjects/iTunesConnect.woa?login-processing=t&login_access=13409884065d3a174c294a9bf21bf71c23a3\n',
 'http://consultoriojuridico.co/pp/www.paypal.com/\n',
 'http://lovetodo.in.th/administrator/components/com_content/models/key/\n',
 'http://lnk.co/io6u45y45?erydh?mario.Carelli@poste.it\n',
 'http://www2.bancobbvacontnental.com/Centroll/informe/03/14/datitarlz/WUJFQ0VSUkFATVVOSVpMQVcuQ09N\n',
 'http://lfcintl.com/components/com_user/zzxc/bpd.com.do/app/do/personas/289302294350311363178310441412402464323394411438376403437407/banco.popular.php?Personal\n',
 'http://procuraduria.videoteca.com.co/update/apple.com/.cgi-bin/WebObjects/MyAppleIdwoa/wa/sign_in.html?appId=4129.returnURL=DaHR0cDovL3N0b3JlLmFwcGxlLmNvbS91c3wxYW9zZmU4OGZjNWIyNThhYWVhOTM5MzVjZjI2NTk1OGE3MWUwY2Y0MmI2OA%26r%3DSDHCD9JUYKX777H9KT\n']

Contain any of the following:

  • https
  • login
  • .php
  • .html
  • @
  • sign
  • ?

In [26]:
keywords = ['https', 'login', '.php', '.html', '@', 'sign']

In [31]:
for keyword in keywords:
    data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)
  • Lenght of the url
  • Lenght of domain
  • is IP?
  • Number of .com

In [35]:
data['lenght'] = data.url.str.len() - 2

In [38]:
domain = data.url.str.split('/', expand=True).iloc[:, 2]

In [41]:
data['lenght_domain'] = domain.str.len()

In [44]:
domain.head(12)


Out[44]:
0                                    www.subalipack.com
1             fasc.maximecapellot-gypsyjazz-ensemble.nl
2                                    theotheragency.com
3                                    aaalandscaping.com
4     paypal.com.confirm-key-21107316126168.securepp...
5                              lcthomasdeiriarte.edu.co
6                                       livetoshare.org
7                                            www.i-m.co
8                                     manuelfernando.co
9                                www.bladesmithnews.com
10                                      www.rasbaek.com
11                                      199.231.190.160
Name: 2, dtype: object

In [67]:
data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)

In [68]:
data['count_com'] = data.url.str.count('com')

In [69]:
data.sample(15, random_state=4)


Out[69]:
url phishing keyword_sign keyword_https keyword_login keyword_.php keyword_.html keyword_@ count_com lenght lenght_domain isIP
28607 http://pennstatehershey.org/web/ibd/home/event... 0 0 0 0 0 0 0 0 80 20 0
3689 http://guiadesanborja.com/multiprinter/muestra... 1 0 0 1 1 0 0 1 81 18 0
6405 http://paranaibaweb.com/faleconosco/accounting... 1 0 0 0 0 1 0 1 65 16 0
35355 http://courts.delaware.gov/Jury%20Services/Hel... 0 0 0 0 0 0 0 0 94 19 0
16520 http://erpa.co/tmp/getproductrequest.htm\n 1 0 0 0 0 0 0 0 39 7 0
16196 http://pulapulapipoca.com/components/com_media... 1 0 0 1 1 0 0 4 239 18 0
3810 http://www.dag.or.kr/zboard/icon/visa/img/Atua... 1 0 0 0 0 0 0 0 62 13 0
3005 http://www.amazingdressup.com/wp-content/theme... 1 0 0 0 0 1 0 1 94 22 0
9003 http://web.indosuksesfutures.com/content_file/... 1 0 0 0 0 0 0 1 80 25 0
34704 http://www.nutritionaltree.com/subcat.aspx?cid... 0 0 0 0 0 0 0 1 69 23 0
12561 http://www.formation-continue-loiret.fr/compon... 1 0 0 0 0 0 0 5 122 32 0
10885 http://191.91.128.205/httpss/bancolombiaa.olb.... 1 0 1 0 1 1 0 2 451 14 1
2633 http://www.sternies-hp.de/components/com_conte... 1 0 0 0 0 0 0 2 85 18 0
22253 http://www.silive.com/northshore/index.ssf/200... 0 0 0 0 0 1 0 1 85 14 0
4720 http://www.dineo.co.za/components/com_content/... 1 0 0 0 1 0 0 3 172 15 0

Create Model


In [70]:
X = data.drop(['url', 'phishing'], axis=1)

In [71]:
y = data.phishing

In [72]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score

In [73]:
clf = RandomForestClassifier(n_jobs=-1, n_estimators=100)

In [74]:
cross_val_score(clf, X, y, cv=10)


Out[74]:
array([ 0.80625,  0.81175,  0.8085 ,  0.79475,  0.8025 ,  0.816  ,
        0.80375,  0.80525,  0.80175,  0.794  ])

In [75]:
clf.fit(X, y)


Out[75]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Save model


In [76]:
from sklearn.externals import joblib

In [79]:
joblib.dump(clf, '22_clf_rf.pkl', compress=3)


Out[79]:
['22_clf_rf.pkl']

Part 2: Model in batch

See 22_model_deployment.py


In [131]:
from m22_model_deployment import predict_proba

In [132]:
predict_proba('http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php')


Out[132]:
0.89000000000000001

Part 3: API

Flask is considered more Pythonic than Django because Flask web application code is in most cases more explicit. Flask is easy to get started with as a beginner because there is little boilerplate code for getting a simple app up and running.

First we need to install some libraries

pip install flask-restplus

Load Flask


In [87]:
from flask import Flask
from flask.ext.restplus import Api
from flask.ext.restplus import fields
from sklearn.externals import joblib
from flask.ext.restplus import Resource
from sklearn.externals import joblib
import pandas as pd

Create api


In [128]:
app = Flask(__name__)

api = Api(
    app, 
    version='1.0', 
    title='Phishing Prediction API',
    description='Phishing Prediction API')

ns = api.namespace('predict', 
     description='Phishing Classifier')
   
parser = api.parser()

parser.add_argument(
    'URL', 
    type=str, 
    required=True, 
    help='URL to be analyzed', 
    location='args')

resource_fields = api.model('Resource', {
    'result': fields.String,
})

Load model and create function that predicts an URL


In [129]:
clf = joblib.load('22_clf_rf.pkl') 

@ns.route('/')
class PhishingApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        result = self.predict_proba(args)

        return result, 200

    def predict_proba(self, args):
        url = args['URL']
        
        url_ = pd.DataFrame([url], columns=['url'])
        
        # Create features
        keywords = ['https', 'login', '.php', '.html', '@', 'sign']
        for keyword in keywords:
            url_['keyword_' + keyword] = url_.url.str.contains(keyword).astype(int)
        
        url_['lenght'] = url_.url.str.len() - 2
        domain = url_.url.str.split('/', expand=True).iloc[:, 2]
        url_['lenght_domain'] = domain.str.len()
        url_['isIP'] = (url_.url.str.replace('.', '') * 1).str.isnumeric().astype(int)
        url_['count_com'] = url_.url.str.count('com')

        # Make prediction
        p1 = clf.predict_proba(url_.drop('url', axis=1))[0,1]

        print('url=', url,'| p1=', p1)

        return {
         "result": p1
        }

Run API


In [ ]:
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)