07 - Model Deployment

by Alejandro Correa Bahnsen & Iván Torroledo

version 1.2, Feb 2018

Part of the class Machine Learning for Risk Management

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Agenda:

Creating and saving a model
Running the model in batch
Exposing the model as an API

Part 1: Phishing Detection

Phishing, by definition, is the act of defrauding an online user in order to obtain personal information by posing as a trustworthy institution or entity. Users usually have a hard time differentiating between legitimate and malicious sites because they are made to look exactly the same. Therefore, there is a need to create better tools to combat attackers.



In [1]:

    
import pandas as pd
import zipfile
with zipfile.ZipFile('../datasets/model_deployment/phishing.csv.zip', 'r') as z:
    f = z.open('phishing.csv')
    data = pd.read_csv(f, index_col=False)



In [2]:

    
data.head()









    Out[2]:







  
    
      
      url
      phishing
    
  
  
    
      0
      http://www.subalipack.com/contact/images/sampl...
      1
    
    
      1
      http://fasc.maximecapellot-gypsyjazz-ensemble....
      1
    
    
      2
      http://theotheragency.com/confirmer/confirmer-...
      1
    
    
      3
      http://aaalandscaping.com/components/com_smart...
      1
    
    
      4
      http://paypal.com.confirm-key-21107316126168.s...
      1



In [3]:

    
data.tail()









    Out[3]:







  
    
      
      url
      phishing
    
  
  
    
      39995
      http://www.diaperswappers.com/forum/member.php...
      0
    
    
      39996
      http://posting.bohemian.com/northbay/Tools/Ema...
      0
    
    
      39997
      http://www.tripadvisor.jp/Hotel_Review-g303832...
      0
    
    
      39998
      http://www.baylor.edu/content/services/downloa...
      0
    
    
      39999
      http://www.phinfever.com/forums/viewtopic.php?...
      0



In [4]:

    
data.phishing.value_counts()









    Out[4]:





1    20000
0    20000
Name: phishing, dtype: int64

Creating features



In [5]:

    
data.url[data.phishing==1].sample(50, random_state=1).tolist()









    Out[5]:





['http://dothan.com.co/gold/austspark/index.htm\n',
 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\n',
 'http://verify95.5gbfree.com/coverme2010/\n',
 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\n',
 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\n',
 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\n',
 'http://senevi.com/confirmation/\n',
 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\n',
 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\n',
 'http://alen.co/docs/cleaner\n',
 'http://rattanhouse.co/Atualizacao_Bradesco/cadastro2013.php?2MAS2XACUJPI3U8D9ZDDG2G9YJICVABQ3K73KWDKYK0NA0AWWWCOUEDUJRXHRKPNMUYLDV89RA6OCG2MQUS0TAUXX9IOGJUEIXPDS5B0RM18OF1H860UAMJOY6ICUR81VSEKKJFPBYNLYGUXBGJ1HEHKOMLTM01P658M\n',
 'http://steamcommunily.co/p.php?login=true\n',
 'http://www.nyyg.com/Bradesco/5W9SQ394.html\n',
 'http://wp.tipografiacentral.com.co/sparkde/index.html\n',
 'http://www.entrerev.com/component/.secure.wpa/.www.paypal.com.returnUrl=/cgi-bin/5RF3S6y0K349/PayPal.co.uk/dispute_centre/sotmks/npsw&st.payment.decline.centre/ipoi/secure-codes.paypal.account4738154login.complete-infrmations.login.accountSecure26/securities/\n',
 'http://x.co/SecurCent\n',
 'http://dejatequerer.co/united.com/index.html\n',
 'http://www.speakeasymovies.com/components/com_wrapper/.amazon.co.uk/\n',
 'http://www.culturaespanola.com.br/bt/www.paypal.com/paypal.com.com/index-new.php\n',
 'http://www.agroassistance.com/components/com_content/c05354aa285b6a932a57086ba13762a1/\n',
 'http://www.estranetsrl.com.ar/bbvacambios.html\n',
 'http://osfsw.cba.pl/content/classic/html/ibpf/bradesco/?UOREEIYGQTERIRVSJTUHMVMZJWWYSVNYQOFSPWVFTEJEEKMJWHFERRYTFRWPSYYWGFIGJUPLZMZLTNSKOGMQQSHSXPLMXILVSM\n',
 'http://bitcrush.co/~geetha5/natwest/natwest/ibcarregister-natwst.html\n',
 'http://cannot-hide-from-PhishTank.zenith-services.com/controllare/auth/\n',
 'http://nova.pymesonline.co/fr.php\n',
 'http://comococino.com/wp-content/uploads/2013/01/paypal.com/us/cgi-bin/webscr.htm?\n',
 'http://www.fundacionchwinqlal.com.gt/imgs/Notas/img/_New/Agencias_Bradesco/Public_201133.php?KSR6YOU359CY1USIRMSBI8CFJF7TVREFJ6KIUFKZNXXNRP7JBYVU79APNGJI8YYR5I0YXUXLRU0JKF4WEYQL81BUGVDOTBFXUPVSKSEBNNU84X4IWT54UFYABCY5OE3J5XBOQQ1EDVMHTPZPJ4TEJSOU5NZS32B8ZNWQ\n',
 'http://flightripe.com/confirmation/update/billing/9a523c6017caa3406af9d5c2c0cb1854/\n',
 'http://accademiazerootto.it/templates/zerootto-new/html/com_content/category/bompreco.php\n',
 'http://santanderseguranca.zapto.org/Clientesx/\n',
 'http://www.muttico.com/components/com_media/p3rs0na4l/53f8b14c76c890e1806b8f9d97f12f80/\n',
 'http://us.fxlhtvf.ml/login/en/login.html.asp?refhttp:%2F%2Futddirect.com%2Fcomponents%2Fcom_content%2Fviews%2Fcategories%2Fmenu.html\n',
 'http://conferencistainternacional.com.co/urruirrhyttjk/Index.htm\n',
 'http://www.creativesovereign.com/components/com_newsfeeds/views/.../perfil/\n',
 'http://villamarina.com.co/administrator/servers/BankofAmerica/security-update/SecMeasure/account-overview.cgi/presentation/jskeys/sas/signonScreen.do/\n',
 'http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php\n',
 'http://www.enoxia.fr/components/com_content/tamfidelidade01.php\n',
 'http://gobbva.com/bb/empresa/index.php?tarjeta=\n',
 'http://paypal-com-confim.sharmikelectric.com/s4575234bf5055889415\n',
 'http://paypal.com.au.au.webapps.mpp.homes.konyadosemeciler.com/confirm/login.australia/au/webapps/mpp/home/initthi.php?cmd=SignIn&co_partnerId=2&pUserId=&siteid=0&pageType=&pa1=&i1=&bshowgif=&UsingSSL=&ru=&pp=&pa2=&errmsg=&runame=%5C%5C%5C%5C\n',
 'http://www.bbvabancocontinental.ya.st\n',
 'http://www.giannielectric.com/company/components/com_poll/assets/a/a5643cded2383f7568719482a943e1a5\n',
 'http://cooperativasanjose.com.co/plugins/josetta_ext/k2category/section/first.php\n',
 'http://appleid-apple-com-confirm-oyns-uattw6w61x3oka3pq.scientificcollectables.com/3c43e3d92e0b8a48f09f5fbb25d008a9/index1.php?cmd=https://connect.paypal.com/WebObjects/iTunesConnect.woa?login-processing=t&login_access=13409884065d3a174c294a9bf21bf71c23a3\n',
 'http://consultoriojuridico.co/pp/www.paypal.com/\n',
 'http://lovetodo.in.th/administrator/components/com_content/models/key/\n',
 'http://lnk.co/io6u45y45?erydh?mario.Carelli@poste.it\n',
 'http://www2.bancobbvacontnental.com/Centroll/informe/03/14/datitarlz/WUJFQ0VSUkFATVVOSVpMQVcuQ09N\n',
 'http://lfcintl.com/components/com_user/zzxc/bpd.com.do/app/do/personas/289302294350311363178310441412402464323394411438376403437407/banco.popular.php?Personal\n',
 'http://procuraduria.videoteca.com.co/update/apple.com/.cgi-bin/WebObjects/MyAppleIdwoa/wa/sign_in.html?appId=4129.returnURL=DaHR0cDovL3N0b3JlLmFwcGxlLmNvbS91c3wxYW9zZmU4OGZjNWIyNThhYWVhOTM5MzVjZjI2NTk1OGE3MWUwY2Y0MmI2OA%26r%3DSDHCD9JUYKX777H9KT\n']

Contain any of the following:

https
login
.php
.html
@
sign
?



In [6]:

    
keywords = ['https', 'login', '.php', '.html', '@', 'sign']



In [7]:

    
for keyword in keywords:
    data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)

Lenght of the url
Lenght of domain
is IP?
Number of .com



In [8]:

    
data['lenght'] = data.url.str.len() - 2



In [9]:

    
domain = data.url.str.split('/', expand=True).iloc[:, 2]



In [10]:

    
data['lenght_domain'] = domain.str.len()



In [11]:

    
domain.head(12)









    Out[11]:





0                                    www.subalipack.com
1             fasc.maximecapellot-gypsyjazz-ensemble.nl
2                                    theotheragency.com
3                                    aaalandscaping.com
4     paypal.com.confirm-key-21107316126168.securepp...
5                              lcthomasdeiriarte.edu.co
6                                       livetoshare.org
7                                            www.i-m.co
8                                     manuelfernando.co
9                                www.bladesmithnews.com
10                                      www.rasbaek.com
11                                      199.231.190.160
Name: 2, dtype: object



In [12]:

    
data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)



In [13]:

    
data['count_com'] = data.url.str.count('com')



In [14]:

    
data.sample(15, random_state=4)









    Out[14]:







  
    
      
      url
      phishing
      keyword_https
      keyword_login
      keyword_.php
      keyword_.html
      keyword_@
      keyword_sign
      lenght
      lenght_domain
      isIP
      count_com
    
  
  
    
      28607
      http://pennstatehershey.org/web/ibd/home/event...
      0
      0
      0
      0
      0
      0
      0
      80
      20
      0
      0
    
    
      3689
      http://guiadesanborja.com/multiprinter/muestra...
      1
      0
      1
      1
      0
      0
      0
      81
      18
      0
      1
    
    
      6405
      http://paranaibaweb.com/faleconosco/accounting...
      1
      0
      0
      0
      1
      0
      0
      65
      16
      0
      1
    
    
      35355
      http://courts.delaware.gov/Jury%20Services/Hel...
      0
      0
      0
      0
      0
      0
      0
      94
      19
      0
      0
    
    
      16520
      http://erpa.co/tmp/getproductrequest.htm\n
      1
      0
      0
      0
      0
      0
      0
      39
      7
      0
      0
    
    
      16196
      http://pulapulapipoca.com/components/com_media...
      1
      0
      1
      1
      0
      0
      0
      239
      18
      0
      4
    
    
      3810
      http://www.dag.or.kr/zboard/icon/visa/img/Atua...
      1
      0
      0
      0
      0
      0
      0
      62
      13
      0
      0
    
    
      3005
      http://www.amazingdressup.com/wp-content/theme...
      1
      0
      0
      0
      1
      0
      0
      94
      22
      0
      1
    
    
      9003
      http://web.indosuksesfutures.com/content_file/...
      1
      0
      0
      0
      0
      0
      0
      80
      25
      0
      1
    
    
      34704
      http://www.nutritionaltree.com/subcat.aspx?cid...
      0
      0
      0
      0
      0
      0
      0
      69
      23
      0
      1
    
    
      12561
      http://www.formation-continue-loiret.fr/compon...
      1
      0
      0
      0
      0
      0
      0
      122
      32
      0
      5
    
    
      10885
      http://191.91.128.205/httpss/bancolombiaa.olb....
      1
      1
      0
      1
      1
      0
      0
      451
      14
      1
      2
    
    
      2633
      http://www.sternies-hp.de/components/com_conte...
      1
      0
      0
      0
      0
      0
      0
      85
      18
      0
      2
    
    
      22253
      http://www.silive.com/northshore/index.ssf/200...
      0
      0
      0
      0
      1
      0
      0
      85
      14
      0
      1
    
    
      4720
      http://www.dineo.co.za/components/com_content/...
      1
      0
      0
      1
      0
      0
      0
      172
      15
      0
      3

Create Model



In [15]:

    
X = data.drop(['url', 'phishing'], axis=1)



In [16]:

    
y = data.phishing



In [17]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score



In [18]:

    
clf = RandomForestClassifier(n_jobs=-1, n_estimators=100)



In [19]:

    
cross_val_score(clf, X, y, cv=10)









    Out[19]:





array([0.80875, 0.80825, 0.804  , 0.79025, 0.80475, 0.81125, 0.80475,
       0.80675, 0.8045 , 0.78925])



In [20]:

    
clf.fit(X, y)









    Out[20]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Save model



In [21]:

    
from sklearn.externals import joblib



In [22]:

    
joblib.dump(clf, '../datasets/model_deployment/07_phishing_clf.pkl', compress=3)









    Out[22]:





['../datasets/model_deployment/07_phishing_clf.pkl']

Part 2: Model in batch

See m07_model_deployment.py



In [23]:

    
from m07_model_deployment import predict_proba



In [24]:

    
predict_proba('http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php')









    Out[24]:





0.6824999999999997

Part 3: API

Flask is considered more Pythonic than Django because Flask web application code is in most cases more explicit. Flask is easy to get started with as a beginner because there is little boilerplate code for getting a simple app up and running.

First we need to install some libraries

pip install flask-restplus

Load Flask



In [25]:

    
from flask import Flask
from flask_restplus import Api, Resource, fields
from sklearn.externals import joblib
import pandas as pd

Create api



In [26]:

    
app = Flask(__name__)

api = Api(
    app, 
    version='1.0', 
    title='Phishing Prediction API',
    description='Phishing Prediction API')

ns = api.namespace('predict', 
     description='Phishing Classifier')
   
parser = api.parser()

parser.add_argument(
    'URL', 
    type=str, 
    required=True, 
    help='URL to be analyzed', 
    location='args')

resource_fields = api.model('Resource', {
    'result': fields.String,
})

Load model and create function that predicts an URL



In [27]:

    
clf = joblib.load('../datasets/model_deployment/07_phishing_clf.pkl') 

@ns.route('/')
class PhishingApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        result = self.predict_proba(args)

        return result, 200

    def predict_proba(self, args):
        url = args['URL']
        
        url_ = pd.DataFrame([url], columns=['url'])
        
        # Create features
        keywords = ['https', 'login', '.php', '.html', '@', 'sign']
        for keyword in keywords:
            url_['keyword_' + keyword] = url_.url.str.contains(keyword).astype(int)
        
        url_['lenght'] = url_.url.str.len() - 2
        domain = url_.url.str.split('/', expand=True).iloc[:, 2]
        url_['lenght_domain'] = domain.str.len()
        url_['isIP'] = (url_.url.str.replace('.', '') * 1).str.isnumeric().astype(int)
        url_['count_com'] = url_.url.str.count('com')

        # Make prediction
        p1 = clf.predict_proba(url_.drop('url', axis=1))[0,1]

        print('url=', url,'| p1=', p1)

        return {
         "result": p1
        }

Run API



In [28]:

    
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)









    



 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [07/Feb/2018 15:56:34] "GET /predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/ HTTP/1.1" 200 -






    



url= http://consultoriojuridico.co/pp/www.paypal.com/ | p1= 0.32507780944545644

Check using

http://localhost:5000/predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/

	url	phishing
0	http://www.subalipack.com/contact/images/sampl...	1
1	http://fasc.maximecapellot-gypsyjazz-ensemble....	1
2	http://theotheragency.com/confirmer/confirmer-...	1
3	http://aaalandscaping.com/components/com_smart...	1
4	http://paypal.com.confirm-key-21107316126168.s...	1

	url	phishing
39995	http://www.diaperswappers.com/forum/member.php...	0
39996	http://posting.bohemian.com/northbay/Tools/Ema...	0
39997	http://www.tripadvisor.jp/Hotel_Review-g303832...	0
39998	http://www.baylor.edu/content/services/downloa...	0
39999	http://www.phinfever.com/forums/viewtopic.php?...	0

	url	phishing	keyword_https	keyword_login	keyword_.php	keyword_.html	lenght	lenght_domain	isIP	count_com
28607	http://pennstatehershey.org/web/ibd/home/event...	0	0	0	0	0	80	20	0	0
3689	http://guiadesanborja.com/multiprinter/muestra...	1	0	1	1	0	81	18	0	1
6405	http://paranaibaweb.com/faleconosco/accounting...	1	0	0	0	1	65	16	0	1
35355	http://courts.delaware.gov/Jury%20Services/Hel...	0	0	0	0	0	94	19	0	0
16520	http://erpa.co/tmp/getproductrequest.htm\n	1	0	0	0	0	39	7	0	0
16196	http://pulapulapipoca.com/components/com_media...	1	0	1	1	0	239	18	0	4
3810	http://www.dag.or.kr/zboard/icon/visa/img/Atua...	1	0	0	0	0	62	13	0	0
3005	http://www.amazingdressup.com/wp-content/theme...	1	0	0	0	1	94	22	0	1
9003	http://web.indosuksesfutures.com/content_file/...	1	0	0	0	0	80	25	0	1
34704	http://www.nutritionaltree.com/subcat.aspx?cid...	0	0	0	0	0	69	23	0	1
12561	http://www.formation-continue-loiret.fr/compon...	1	0	0	0	0	122	32	0	5
10885	http://191.91.128.205/httpss/bancolombiaa.olb....	1	1	0	1	1	451	14	1	2
2633	http://www.sternies-hp.de/components/com_conte...	1	0	0	0	0	85	18	0	2
22253	http://www.silive.com/northshore/index.ssf/200...	0	0	0	0	1	85	14	0	1
4720	http://www.dineo.co.za/components/com_content/...	1	0	0	1	0	172	15	0	3