13 - Model Deployment

version 0.1, May 2016

Part of the class Machine Learning for Security Informatics

This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License]

Agenda:

Creating and saving a model
Running the model in batch
Exposing the model as an API

Part 1: Phishing Detection

Phishing, by definition, is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity. Users usually have a hard time differentiating between legitimate and malicious sites because they are made to look exactly the same. Therefore, there is a need to create better tools to combat attackers.



In [1]:

    
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/phishing.csv.zip', 'r') as z:
    f = z.open('phishing.csv')
    data = pd.read_csv(f, index_col=False)



In [2]:

    
data.head()









    Out[2]:






  
    
      
      url
      phishing
    
  
  
    
      0
      http://www.subalipack.com/contact/images/sampl...
      1
    
    
      1
      http://fasc.maximecapellot-gypsyjazz-ensemble....
      1
    
    
      2
      http://theotheragency.com/confirmer/confirmer-...
      1
    
    
      3
      http://aaalandscaping.com/components/com_smart...
      1
    
    
      4
      http://paypal.com.confirm-key-21107316126168.s...
      1



In [4]:

    
data.phishing.value_counts()









    Out[4]:





1    20000
0    20000
Name: phishing, dtype: int64

Creating features



In [92]:

    
data.url[data.phishing==1].sample(50, random_state=1).tolist()









    Out[92]:





['http://dothan.com.co/gold/austspark/index.htm\n',
 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\n',
 'http://verify95.5gbfree.com/coverme2010/\n',
 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\n',
 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\n',
 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\n',
 'http://senevi.com/confirmation/\n',
 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\n',
 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\n',
 'http://alen.co/docs/cleaner\n',
 'http://rattanhouse.co/Atualizacao_Bradesco/cadastro2013.php?2MAS2XACUJPI3U8D9ZDDG2G9YJICVABQ3K73KWDKYK0NA0AWWWCOUEDUJRXHRKPNMUYLDV89RA6OCG2MQUS0TAUXX9IOGJUEIXPDS5B0RM18OF1H860UAMJOY6ICUR81VSEKKJFPBYNLYGUXBGJ1HEHKOMLTM01P658M\n',
 'http://steamcommunily.co/p.php?login=true\n',
 'http://www.nyyg.com/Bradesco/5W9SQ394.html\n',
 'http://wp.tipografiacentral.com.co/sparkde/index.html\n',
 'http://www.entrerev.com/component/.secure.wpa/.www.paypal.com.returnUrl=/cgi-bin/5RF3S6y0K349/PayPal.co.uk/dispute_centre/sotmks/npsw&st.payment.decline.centre/ipoi/secure-codes.paypal.account4738154login.complete-infrmations.login.accountSecure26/securities/\n',
 'http://x.co/SecurCent\n',
 'http://dejatequerer.co/united.com/index.html\n',
 'http://www.speakeasymovies.com/components/com_wrapper/.amazon.co.uk/\n',
 'http://www.culturaespanola.com.br/bt/www.paypal.com/paypal.com.com/index-new.php\n',
 'http://www.agroassistance.com/components/com_content/c05354aa285b6a932a57086ba13762a1/\n',
 'http://www.estranetsrl.com.ar/bbvacambios.html\n',
 'http://osfsw.cba.pl/content/classic/html/ibpf/bradesco/?UOREEIYGQTERIRVSJTUHMVMZJWWYSVNYQOFSPWVFTEJEEKMJWHFERRYTFRWPSYYWGFIGJUPLZMZLTNSKOGMQQSHSXPLMXILVSM\n',
 'http://bitcrush.co/~geetha5/natwest/natwest/ibcarregister-natwst.html\n',
 'http://cannot-hide-from-PhishTank.zenith-services.com/controllare/auth/\n',
 'http://nova.pymesonline.co/fr.php\n',
 'http://comococino.com/wp-content/uploads/2013/01/paypal.com/us/cgi-bin/webscr.htm?\n',
 'http://www.fundacionchwinqlal.com.gt/imgs/Notas/img/_New/Agencias_Bradesco/Public_201133.php?KSR6YOU359CY1USIRMSBI8CFJF7TVREFJ6KIUFKZNXXNRP7JBYVU79APNGJI8YYR5I0YXUXLRU0JKF4WEYQL81BUGVDOTBFXUPVSKSEBNNU84X4IWT54UFYABCY5OE3J5XBOQQ1EDVMHTPZPJ4TEJSOU5NZS32B8ZNWQ\n',
 'http://flightripe.com/confirmation/update/billing/9a523c6017caa3406af9d5c2c0cb1854/\n',
 'http://accademiazerootto.it/templates/zerootto-new/html/com_content/category/bompreco.php\n',
 'http://santanderseguranca.zapto.org/Clientesx/\n',
 'http://www.muttico.com/components/com_media/p3rs0na4l/53f8b14c76c890e1806b8f9d97f12f80/\n',
 'http://us.fxlhtvf.ml/login/en/login.html.asp?refhttp:%2F%2Futddirect.com%2Fcomponents%2Fcom_content%2Fviews%2Fcategories%2Fmenu.html\n',
 'http://conferencistainternacional.com.co/urruirrhyttjk/Index.htm\n',
 'http://www.creativesovereign.com/components/com_newsfeeds/views/.../perfil/\n',
 'http://villamarina.com.co/administrator/servers/BankofAmerica/security-update/SecMeasure/account-overview.cgi/presentation/jskeys/sas/signonScreen.do/\n',
 'http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php\n',
 'http://www.enoxia.fr/components/com_content/tamfidelidade01.php\n',
 'http://gobbva.com/bb/empresa/index.php?tarjeta=\n',
 'http://paypal-com-confim.sharmikelectric.com/s4575234bf5055889415\n',
 'http://paypal.com.au.au.webapps.mpp.homes.konyadosemeciler.com/confirm/login.australia/au/webapps/mpp/home/initthi.php?cmd=SignIn&co_partnerId=2&pUserId=&siteid=0&pageType=&pa1=&i1=&bshowgif=&UsingSSL=&ru=&pp=&pa2=&errmsg=&runame=%5C%5C%5C%5C\n',
 'http://www.bbvabancocontinental.ya.st\n',
 'http://www.giannielectric.com/company/components/com_poll/assets/a/a5643cded2383f7568719482a943e1a5\n',
 'http://cooperativasanjose.com.co/plugins/josetta_ext/k2category/section/first.php\n',
 'http://appleid-apple-com-confirm-oyns-uattw6w61x3oka3pq.scientificcollectables.com/3c43e3d92e0b8a48f09f5fbb25d008a9/index1.php?cmd=https://connect.paypal.com/WebObjects/iTunesConnect.woa?login-processing=t&login_access=13409884065d3a174c294a9bf21bf71c23a3\n',
 'http://consultoriojuridico.co/pp/www.paypal.com/\n',
 'http://lovetodo.in.th/administrator/components/com_content/models/key/\n',
 'http://lnk.co/io6u45y45?erydh?mario.Carelli@poste.it\n',
 'http://www2.bancobbvacontnental.com/Centroll/informe/03/14/datitarlz/WUJFQ0VSUkFATVVOSVpMQVcuQ09N\n',
 'http://lfcintl.com/components/com_user/zzxc/bpd.com.do/app/do/personas/289302294350311363178310441412402464323394411438376403437407/banco.popular.php?Personal\n',
 'http://procuraduria.videoteca.com.co/update/apple.com/.cgi-bin/WebObjects/MyAppleIdwoa/wa/sign_in.html?appId=4129.returnURL=DaHR0cDovL3N0b3JlLmFwcGxlLmNvbS91c3wxYW9zZmU4OGZjNWIyNThhYWVhOTM5MzVjZjI2NTk1OGE3MWUwY2Y0MmI2OA%26r%3DSDHCD9JUYKX777H9KT\n']

Contain any of the following:

https
login
.php
.html
@
sign
?



In [26]:

    
keywords = ['https', 'login', '.php', '.html', '@', 'sign']



In [31]:

    
for keyword in keywords:
    data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)

Lenght of the url
Lenght of domain
is IP?
Number of .com



In [35]:

    
data['lenght'] = data.url.str.len() - 2



In [38]:

    
domain = data.url.str.split('/', expand=True).iloc[:, 2]



In [41]:

    
data['lenght_domain'] = domain.str.len()



In [44]:

    
domain.head(12)









    Out[44]:





0                                    www.subalipack.com
1             fasc.maximecapellot-gypsyjazz-ensemble.nl
2                                    theotheragency.com
3                                    aaalandscaping.com
4     paypal.com.confirm-key-21107316126168.securepp...
5                              lcthomasdeiriarte.edu.co
6                                       livetoshare.org
7                                            www.i-m.co
8                                     manuelfernando.co
9                                www.bladesmithnews.com
10                                      www.rasbaek.com
11                                      199.231.190.160
Name: 2, dtype: object



In [67]:

    
data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)



In [68]:

    
data['count_com'] = data.url.str.count('com')



In [69]:

    
data.sample(15, random_state=4)









    Out[69]:






  
    
      
      url
      phishing
      keyword_sign
      keyword_https
      keyword_login
      keyword_.php
      keyword_.html
      keyword_@
      count_com
      lenght
      lenght_domain
      isIP
    
  
  
    
      28607
      http://pennstatehershey.org/web/ibd/home/event...
      0
      0
      0
      0
      0
      0
      0
      0
      80
      20
      0
    
    
      3689
      http://guiadesanborja.com/multiprinter/muestra...
      1
      0
      0
      1
      1
      0
      0
      1
      81
      18
      0
    
    
      6405
      http://paranaibaweb.com/faleconosco/accounting...
      1
      0
      0
      0
      0
      1
      0
      1
      65
      16
      0
    
    
      35355
      http://courts.delaware.gov/Jury%20Services/Hel...
      0
      0
      0
      0
      0
      0
      0
      0
      94
      19
      0
    
    
      16520
      http://erpa.co/tmp/getproductrequest.htm\n
      1
      0
      0
      0
      0
      0
      0
      0
      39
      7
      0
    
    
      16196
      http://pulapulapipoca.com/components/com_media...
      1
      0
      0
      1
      1
      0
      0
      4
      239
      18
      0
    
    
      3810
      http://www.dag.or.kr/zboard/icon/visa/img/Atua...
      1
      0
      0
      0
      0
      0
      0
      0
      62
      13
      0
    
    
      3005
      http://www.amazingdressup.com/wp-content/theme...
      1
      0
      0
      0
      0
      1
      0
      1
      94
      22
      0
    
    
      9003
      http://web.indosuksesfutures.com/content_file/...
      1
      0
      0
      0
      0
      0
      0
      1
      80
      25
      0
    
    
      34704
      http://www.nutritionaltree.com/subcat.aspx?cid...
      0
      0
      0
      0
      0
      0
      0
      1
      69
      23
      0
    
    
      12561
      http://www.formation-continue-loiret.fr/compon...
      1
      0
      0
      0
      0
      0
      0
      5
      122
      32
      0
    
    
      10885
      http://191.91.128.205/httpss/bancolombiaa.olb....
      1
      0
      1
      0
      1
      1
      0
      2
      451
      14
      1
    
    
      2633
      http://www.sternies-hp.de/components/com_conte...
      1
      0
      0
      0
      0
      0
      0
      2
      85
      18
      0
    
    
      22253
      http://www.silive.com/northshore/index.ssf/200...
      0
      0
      0
      0
      0
      1
      0
      1
      85
      14
      0
    
    
      4720
      http://www.dineo.co.za/components/com_content/...
      1
      0
      0
      0
      1
      0
      0
      3
      172
      15
      0

Create Model



In [70]:

    
X = data.drop(['url', 'phishing'], axis=1)



In [71]:

    
y = data.phishing



In [72]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score



In [73]:

    
clf = RandomForestClassifier(n_jobs=-1, n_estimators=100)



In [74]:

    
cross_val_score(clf, X, y, cv=10)









    Out[74]:





array([ 0.80625,  0.81175,  0.8085 ,  0.79475,  0.8025 ,  0.816  ,
        0.80375,  0.80525,  0.80175,  0.794  ])



In [75]:

    
clf.fit(X, y)









    Out[75]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Save model



In [76]:

    
from sklearn.externals import joblib



In [79]:

    
joblib.dump(clf, '22_clf_rf.pkl', compress=3)









    Out[79]:





['22_clf_rf.pkl']

Part 2: Model in batch

See 22_model_deployment.py



In [131]:

    
from m22_model_deployment import predict_proba



In [132]:

    
predict_proba('http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php')









    Out[132]:





0.89000000000000001

Part 3: API

Flask is considered more Pythonic than Django because Flask web application code is in most cases more explicit. Flask is easy to get started with as a beginner because there is little boilerplate code for getting a simple app up and running.

First we need to install some libraries

pip install flask-restplus

Load Flask



In [87]:

    
from flask import Flask
from flask.ext.restplus import Api
from flask.ext.restplus import fields
from sklearn.externals import joblib
from flask.ext.restplus import Resource
from sklearn.externals import joblib
import pandas as pd

Create api



In [128]:

    
app = Flask(__name__)

api = Api(
    app, 
    version='1.0', 
    title='Phishing Prediction API',
    description='Phishing Prediction API')

ns = api.namespace('predict', 
     description='Phishing Classifier')
   
parser = api.parser()

parser.add_argument(
    'URL', 
    type=str, 
    required=True, 
    help='URL to be analyzed', 
    location='args')

resource_fields = api.model('Resource', {
    'result': fields.String,
})

Load model and create function that predicts an URL



In [129]:

    
clf = joblib.load('22_clf_rf.pkl') 

@ns.route('/')
class PhishingApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        result = self.predict_proba(args)

        return result, 200

    def predict_proba(self, args):
        url = args['URL']
        
        url_ = pd.DataFrame([url], columns=['url'])
        
        # Create features
        keywords = ['https', 'login', '.php', '.html', '@', 'sign']
        for keyword in keywords:
            url_['keyword_' + keyword] = url_.url.str.contains(keyword).astype(int)
        
        url_['lenght'] = url_.url.str.len() - 2
        domain = url_.url.str.split('/', expand=True).iloc[:, 2]
        url_['lenght_domain'] = domain.str.len()
        url_['isIP'] = (url_.url.str.replace('.', '') * 1).str.isnumeric().astype(int)
        url_['count_com'] = url_.url.str.count('com')

        # Make prediction
        p1 = clf.predict_proba(url_.drop('url', axis=1))[0,1]

        print('url=', url,'| p1=', p1)

        return {
         "result": p1
        }

Run API



In [ ]:

    
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)

Check using

http://localhost:5000/predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/

	url	phishing
0	http://www.subalipack.com/contact/images/sampl...	1
1	http://fasc.maximecapellot-gypsyjazz-ensemble....	1
2	http://theotheragency.com/confirmer/confirmer-...	1
3	http://aaalandscaping.com/components/com_smart...	1
4	http://paypal.com.confirm-key-21107316126168.s...	1

	url	phishing	keyword_https	keyword_login	keyword_.php	keyword_.html	count_com	lenght	lenght_domain	isIP
28607	http://pennstatehershey.org/web/ibd/home/event...	0	0	0	0	0	0	80	20	0
3689	http://guiadesanborja.com/multiprinter/muestra...	1	0	1	1	0	1	81	18	0
6405	http://paranaibaweb.com/faleconosco/accounting...	1	0	0	0	1	1	65	16	0
35355	http://courts.delaware.gov/Jury%20Services/Hel...	0	0	0	0	0	0	94	19	0
16520	http://erpa.co/tmp/getproductrequest.htm\n	1	0	0	0	0	0	39	7	0
16196	http://pulapulapipoca.com/components/com_media...	1	0	1	1	0	4	239	18	0
3810	http://www.dag.or.kr/zboard/icon/visa/img/Atua...	1	0	0	0	0	0	62	13	0
3005	http://www.amazingdressup.com/wp-content/theme...	1	0	0	0	1	1	94	22	0
9003	http://web.indosuksesfutures.com/content_file/...	1	0	0	0	0	1	80	25	0
34704	http://www.nutritionaltree.com/subcat.aspx?cid...	0	0	0	0	0	1	69	23	0
12561	http://www.formation-continue-loiret.fr/compon...	1	0	0	0	0	5	122	32	0
10885	http://191.91.128.205/httpss/bancolombiaa.olb....	1	1	0	1	1	2	451	14	1
2633	http://www.sternies-hp.de/components/com_conte...	1	0	0	0	0	2	85	18	0
22253	http://www.silive.com/northshore/index.ssf/200...	0	0	0	0	1	1	85	14	0
4720	http://www.dineo.co.za/components/com_content/...	1	0	0	1	0	3	172	15	0