Import data from web scrawlers

Here we first build a web scrawler to scrap all the trafic information from the official twitter accounts of RATP, SNCF



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# ---- Summary of the twitter accounts -----#
# RER_A
# RERB
# RERC_SNCF --< Infotrafic
# RERD_SNCF --< Infotrafic
# RERE_SNCF --< Infotrafic
# Ligne12_RATP --< from 1 to 14 lines

line_list = ['RER_A', 'RER_B', 'RER_C', 'RER_D', 'RER_E','Ligne1_RATP', 'Ligne2_RATP', 'Ligne3_RATP', 'Ligne4_RATP',
 'Ligne5_RATP', 'Ligne6_RATP', 'Ligne7_RATP', 'Ligne8_RATP', 'Ligne9_RATP', 'Ligne10_RATP', 'Ligne11_RATP', 'Ligne12_RATP',
 'Ligne13_RATP', 'Ligne14_RATP' ]

file_path = "data/"
line_dict = dict()
for item in line_list:
    line_dict[item] = pd.read_csv(file_path + item +'.csv', sep=';',error_bad_lines=False)









    



b'Skipping line 4313: expected 10 fields, saw 11\n'
b'Skipping line 1089: expected 10 fields, saw 11\nSkipping line 4918: expected 10 fields, saw 11\nSkipping line 7193: expected 10 fields, saw 11\n'
b'Skipping line 11823: expected 10 fields, saw 11\nSkipping line 12791: expected 10 fields, saw 11\nSkipping line 17202: expected 10 fields, saw 11\nSkipping line 17564: expected 10 fields, saw 11\nSkipping line 17567: expected 10 fields, saw 11\nSkipping line 17835: expected 10 fields, saw 11\nSkipping line 17848: expected 10 fields, saw 11\nSkipping line 18418: expected 10 fields, saw 11\nSkipping line 18430: expected 10 fields, saw 11\nSkipping line 18509: expected 10 fields, saw 11\nSkipping line 18599: expected 10 fields, saw 11\nSkipping line 18880: expected 10 fields, saw 11\nSkipping line 18920: expected 10 fields, saw 11\nSkipping line 19707: expected 10 fields, saw 11\n'
b'Skipping line 4081: expected 10 fields, saw 11\nSkipping line 10409: expected 10 fields, saw 11\nSkipping line 14147: expected 10 fields, saw 11\nSkipping line 16851: expected 10 fields, saw 11\nSkipping line 17961: expected 10 fields, saw 11\nSkipping line 18574: expected 10 fields, saw 11\nSkipping line 18658: expected 10 fields, saw 11\nSkipping line 21004: expected 10 fields, saw 11\nSkipping line 21895: expected 10 fields, saw 11\nSkipping line 23069: expected 10 fields, saw 11\n'
b'Skipping line 5159: expected 10 fields, saw 11\nSkipping line 7458: expected 10 fields, saw 11\nSkipping line 7672: expected 10 fields, saw 11\n'
b'Skipping line 428: expected 10 fields, saw 11\n'

First, have a look at the data



In [2]:

    
line_dict['RER_A'].sort_values(by='retweets',ascending = False).head()









    Out[2]:







  
    
      
      username
      date
      retweets
      favorites
      text
      geo
      mentions
      hashtags
      id
      permalink
    
  
  
    
      2987
      RER_A
      2017-05-24 12:41
      1996
      416
      Ce chien a été retrouvé ce matin, vers 7h, sur...
      NaN
      NaN
      #RERA
      8.673298e+17
      https://twitter.com/RER_A/status/8673297646919...
    
    
      107
      RER_A
      2017-10-05 17:52
      429
      145
      1 incident électrique au niveau de la caténair...
      NaN
      NaN
      #RERA
      9.159679e+17
      https://twitter.com/RER_A/status/9159679468995...
    
    
      784
      RER_A
      2017-08-28 12:02
      342
      409
      [ #TravauxEteRERA ] Les travaux sont terminés ...
      NaN
      NaN
      #TravauxEteRERA #RERA
      9.021090e+17
      https://twitter.com/RER_A/status/9021090016257...
    
    
      16061
      RER_A
      2015-01-29 06:23
      306
      19
      Arrêt de travail spontané en raison de l'agres...
      NaN
      NaN
      #RERA
      5.606696e+17
      https://twitter.com/RER_A/status/5606696327600...
    
    
      16046
      RER_A
      2015-01-29 08:01
      295
      26
      Depuis 7h52 fermeture des gares #RERA #RATP sa...
      NaN
      NaN
      #RERA #RATP
      5.606942e+17
      https://twitter.com/RER_A/status/5606941607020...

Check how many items in each sub data set



In [3]:

    
# Check how many items we have
for k,v in line_dict.items():
    print(k, v.shape)









    



RER_A (21028, 10)
RER_B (15540, 10)
RER_C (23086, 10)
RER_D (23306, 10)
RER_E (8497, 10)
Ligne1_RATP (7436, 10)
Ligne2_RATP (5518, 10)
Ligne3_RATP (5043, 10)
Ligne4_RATP (9091, 10)
Ligne5_RATP (5998, 10)
Ligne6_RATP (6977, 10)
Ligne7_RATP (5963, 10)
Ligne8_RATP (5903, 10)
Ligne9_RATP (5642, 10)
Ligne10_RATP (3950, 10)
Ligne11_RATP (3812, 10)
Ligne12_RATP (6335, 10)
Ligne13_RATP (6992, 10)
Ligne14_RATP (4927, 10)

Have a look at the data in each sub dataset

Since RER datasets may have different patterns comparing to metros.



In [4]:

    
line_dict['Ligne11_RATP'].head()









    Out[4]:







  
    
      
      username
      date
      retweets
      favorites
      text
      geo
      mentions
      hashtags
      id
      permalink
    
  
  
    
      0
      Ligne11_RATP
      2017-10-10 12:19
      0
      0
      Bonjour, la #Ligne11 fonctionne correctement, ...
      NaN
      NaN
      #Ligne11
      9.176961e+17
      https://twitter.com/Ligne11_RATP/status/917696...
    
    
      1
      Ligne11_RATP
      2017-10-09 14:22
      0
      1
      Demain, le trafic sera normal sur votre #ligne...
      NaN
      NaN
      #ligne11 #RATP #greve10octobre
      9.173647e+17
      https://twitter.com/Ligne11_RATP/status/917364...
    
    
      2
      Ligne11_RATP
      2017-10-08 12:58
      0
      1
      Retour à un trafic régulier sur l'ensemble de ...
      NaN
      NaN
      #Ligne11 #RATP
      9.169811e+17
      https://twitter.com/Ligne11_RATP/status/916981...
    
    
      3
      Ligne11_RATP
      2017-10-08 12:51
      1
      0
      12:51, Retour à un trafic régulier sur l'ensem...
      NaN
      NaN
      #Ligne11 #RATP
      9.169795e+17
      https://twitter.com/Ligne11_RATP/status/916979...
    
    
      4
      Ligne11_RATP
      2017-10-08 12:45
      1
      0
      12:43, le trafic reprend progressivement (coli...
      NaN
      NaN
      #RATP #Ligne11
      9.169778e+17
      https://twitter.com/Ligne11_RATP/status/916977...



In [5]:

    
line_dict['RER_B'].head()









    Out[5]:







  
    
      
      username
      date
      retweets
      favorites
      text
      geo
      mentions
      hashtags
      id
      permalink
    
  
  
    
      0
      RERB
      2017-10-12 19:16
      0
      0
      Toutes les missions sont rendues omnibus. Bonn...
      NaN
      NaN
      NaN
      9.185259e+17
      https://twitter.com/RERB/status/91852587036885...
    
    
      1
      RERB
      2017-10-12 18:51
      1
      0
      Bonsoir, le trafic reprend progressivement mai...
      NaN
      NaN
      NaN
      9.185196e+17
      https://twitter.com/RERB/status/91851956429560...
    
    
      2
      RERB
      2017-10-12 18:45
      1
      0
      Il est également possible de prendre un train ...
      NaN
      NaN
      NaN
      9.185180e+17
      https://twitter.com/RERB/status/91851797897883...
    
    
      3
      RERB
      2017-10-12 18:40
      0
      0
      Bonsoir, oui des trains circulent au départ d'...
      NaN
      NaN
      NaN
      9.185168e+17
      https://twitter.com/RERB/status/91851677975599...
    
    
      4
      RERB
      2017-10-12 18:36
      0
      1
      Bonsoir, l'incident est terminé, le trafic rep...
      NaN
      NaN
      NaN
      9.185157e+17
      https://twitter.com/RERB/status/91851566784590...

We find a fun fact by checking data from RER_A: the most popular twitter is about a refound dog???



In [6]:

    
# sort by retweets --> a dog is missing??? comes up first???
line_dict['RER_A'].sort_values(by='retweets',ascending = False)['text'].head()









    Out[6]:





2987     Ce chien a été retrouvé ce matin, vers 7h, sur...
107      1 incident électrique au niveau de la caténair...
784      [ #TravauxEteRERA ] Les travaux sont terminés ...
16061    Arrêt de travail spontané en raison de l'agres...
16046    Depuis 7h52 fermeture des gares #RERA #RATP sa...
Name: text, dtype: object

Import different station names from RATP API



In [7]:

    
# find all station names --> gares
df_st = pd.read_csv('data\gares.csv', delimiter=';')
gares = df_st.nomptar.str.split('(')
gares = [x[0].rstrip(' ') for x in gares] # la defense has a lagging space

Data Cleaning:

Build a data cleaning function.
Run the function over all datasets.



In [8]:

    
# change display to 200
## Step 1: delete all 
## Théo, Bonjour, @, Tom, Emma

import re
def clean_data(input):
    pd.options.display.max_colwidth = 200
    input['date'] = pd.to_datetime(input.date)
    input = input[input.date >= pd.to_datetime('2014-1-1')]
    
    # replace pte, chateau 
    input.text = input.text.str.replace('Pte|pte', 'Porte')
    input.text = input.text.str.replace('Chateau|chateau', 'Château')
    input.text = input.text.str.replace('électr.', 'électrique')
    input.text = input.text.str.replace('tvx.', 'travaux')    
    
    # in RER C, D, E, they published traffic information 
    # with a hashtag of "Infotrafic"
    if re.search('RER[CDE]_SNCF',input.username.iloc[0]):
        output = input[input.text.str.contains('Infotrafic', na=False)]
    else:
        # for all other lines, 
        # we drop the conservations data (see report for more details)
        to_drop = ["Bonjour", "@",'Théo', 'Emma','Bjr','Inès',
                   'Lana','vous','soirée','Oui','estimée', 
                   'Travaux prévus','journée','bonjour','rerb', 
                   'rerc','rerd', 'rere','Infotrafic'] # all about conversations 
        output = input[~input.text.str.contains('|'.join(to_drop), na=False)] 
    return output

Check how many items left in each sub dataset.



In [9]:

    
for k in line_dict.keys():
    line_dict[k] = clean_data(line_dict[k])
    print(k, line_dict[k].shape)









    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value






    



RER_A (8214, 10)
RER_B (6586, 10)
RER_C (2502, 10)
RER_D (2967, 10)
RER_E (1567, 10)
Ligne1_RATP (2965, 10)
Ligne2_RATP (2221, 10)
Ligne3_RATP (2073, 10)
Ligne4_RATP (4314, 10)
Ligne5_RATP (2671, 10)
Ligne6_RATP (3439, 10)
Ligne7_RATP (2851, 10)
Ligne8_RATP (3013, 10)
Ligne9_RATP (2546, 10)
Ligne10_RATP (1345, 10)
Ligne11_RATP (1285, 10)
Ligne12_RATP (2616, 10)
Ligne13_RATP (3104, 10)
Ligne14_RATP (1786, 10)



In [10]:

    
line_dict['RER_A'].sample(3).text









    Out[10]:





599                      09:31, le trafic est perturbé entre Etoile et Boissy/ Marne-la-Vallee (incident voyageur) #RERA
13900                       09:21, le trafic est ralenti de Etoile vers St-Germain-en-L. (incident voyageur) #RATP #RERA
16962    Notre personnel vient de prendre contact avec les services de police (colis suspect à Gare de Lyon) #RERA #RATP
Name: text, dtype: object

Using NLP to analyze tweets

Find the most frequent words
Find the most frequent gares
Find the most stations



In [11]:

    
# top 20 frequent words
import nltk
def words_freq(output):
    moby_tokens = nltk.word_tokenize(output.text.str.lower().str.cat(sep = ' '))
    text1 = nltk.Text(moby_tokens)
    nltk.FreqDist(text1).most_common(20)
    stopwords = nltk.corpus.stopwords.words('french')
    stopwords = stopwords + ['rera','rerb','rerc','rerd','rere',
                             'ratp','ligne','entre',
                             'http','les','vers','dir','trafic','gare']
    words_except_stop_dist = nltk.FreqDist(w for w in text1 if w not 
                                           in stopwords and w.isalpha() )
    return words_except_stop_dist



In [12]:

    
from collections import Counter
def gare_fq(output):
    gare_freq = Counter()
    for gare in gares:
        gare_freq[gare] = output.text.str.lower().str.contains(gare.lower()).sum()
    return gare_freq



In [13]:

    
# sometimes, cergy-le-haut, naterre may be due to their direction --> 
# result is true, many items are ignored entre XXX et XXX 
line_dict['RER_A'].text[line_dict['RER_A'].text.str.contains('Cergy-Le-Haut')].sample(10)









    Out[13]:





10569        18:02, le trafic est ralenti entre Cergy-Le-Haut et La Defense (Grande Arche) (panne de signalisation) #RATP #RERA
8849                 10:28, le trafic est interrompu entre Maisons-Laffitte et Cergy-Le-Haut (acte de malveillance) #RATP #RERA
5190                            18:16, le trafic est perturbé de Nation vers Cergy-Le-Haut (panne de signalisation) #RATP #RERA
9022         15:18, le trafic reprend progressivement entre Maisons-Laffitte et Cergy-Le-Haut (panne électriqueque) #RATP #RERA
11609                             09:35, le trafic est interrompu entre Cergy-Le-Haut et Cergy-S.C. (colis suspect) #RATP #RERA
10573              16:41, le trafic est ralenti entre Cergy-Le-Haut et Nanterre-Prefecture (panne de signalisation) #RATP #RERA
8088           20:48, le trafic est ralenti entre La Defense (Grande Arche) et Cergy-Le-Haut (voyageur sur la voie) #RATP #RERA
13584                   17:16, le trafic est interrompu entre Cergy-Le-Haut et Acheres-Ville (voyageur sur la voie) #RATP #RERA
13746    15:07, le trafic est ralenti entre La Defense (Grande Arche) et Cergy-Le-Haut (accident grave de voyageur) #RATP #RERA
18574       18:58, le trafic reprend progressivement entre Cergy-Le-Haut et Nanterre-Prefecture (incident voyageur) #RATP #RERA
Name: text, dtype: object

Check words_freq function on RER_B



In [14]:

    
## Now let's try RER B
output_b = line_dict['RER_B']
words_freq(output_b).most_common(20)









    Out[14]:





[('incident', 1417),
 ('perturbé', 1111),
 ('retour', 1044),
 ('terminé', 963),
 ('régulier', 921),
 ('voyageur', 861),
 ('aeroport', 688),
 ('colis', 678),
 ('suspect', 676),
 ('nord', 663),
 ('interrompu', 651),
 ('cdg', 631),
 ('panne', 555),
 ('ralenti', 476),
 ('mitry', 451),
 ('train', 446),
 ('robinson', 432),
 ('claye', 401),
 ('charles', 381),
 ('gaulle', 373)]

Check gare_fq function on RER_B



In [15]:

    
gare_fq(output_b).most_common(20)









    Out[15]:





[('Gare du Nord', 584),
 ('Robinson', 464),
 ('Palaiseau', 225),
 ('Les Halles', 213),
 ('Antony', 144),
 ('Bourg-la-Reine', 112),
 ('Mitry-Claye', 106),
 ('Châtelet', 98),
 ('Laplace', 90),
 ('Massy-Palaiseau', 84),
 ('Luxembourg', 83),
 ('Aulnay-sous-Bois', 65),
 ('Parc des Expositions', 60),
 ('Le Bourget', 57),
 ('Port Royal', 52),
 ('Cité', 41),
 ('La Plaine-Stade de France', 31),
 ('Orsay-Ville', 31),
 ('Denfert-Rochereau', 30),
 ('Sceaux', 29)]

Build incident_freq function



In [16]:

    
from collections import Counter
def incidient_reason(input):
    output = input
    incidents = ['malaise voyageur',"incident d'exploitation","incident technique",'Incident de signalisation',
                "colis suspect", "voyageur malade", "incident voyageur", 
                "divers incidents",'panne de signalisation','panne de matériel',
                'panne électrique','panne mécanique','panne de caténaire',
                "panne d'aiguillage",'panne matériel','panne éléctrique', 
                'panne sur un train','pannes de signalisation',"panne d'un train",
                "panne de train",'obstacle sur la voie', 'bagage abandonné','incident de passage', 
                 'accident de personne','feu aux abords','pb signalisation','acte de malveillance',
                 'jets de pierre','obstacle sur la voie','bagage oublié',
                 'personnes sur les voies','branche tombée','jet de projectile']
    incident_freq = Counter()
    for incident in incidents:
        incident_freq[incident] = output.text.str.lower().str.contains(incident.lower()).sum()
    return incident_freq

Test incident_reason function in RER_C



In [17]:

    
incidient_reason(line_dict['RER_C']).most_common()









    Out[17]:





[('bagage abandonné', 149),
 ('bagage oublié', 144),
 ('personnes sur les voies', 113),
 ("panne d'un train", 103),
 ('accident de personne', 78),
 ('malaise voyageur', 25),
 ('Incident de signalisation', 24),
 ('acte de malveillance', 24),
 ('obstacle sur la voie', 15),
 ('incident technique', 11),
 ('incident de passage', 9),
 ('feu aux abords', 8),
 ('branche tombée', 8),
 ('panne électrique', 6),
 ('jets de pierre', 5),
 ('panne de signalisation', 4),
 ('colis suspect', 3),
 ('panne sur un train', 3),
 ('pb signalisation', 3),
 ('jet de projectile', 3),
 ('incident voyageur', 2),
 ("panne d'aiguillage", 2),
 ('divers incidents', 1),
 ('panne de train', 1),
 ("incident d'exploitation", 0),
 ('voyageur malade', 0),
 ('panne de matériel', 0),
 ('panne mécanique', 0),
 ('panne de caténaire', 0),
 ('panne matériel', 0),
 ('panne éléctrique', 0),
 ('pannes de signalisation', 0)]

Consolidate all functions in one dashboard function



In [18]:

    
# what if we write a summary function 
def summary(input):
    output = input
    print()
    print ('The 20 most frequent words are: ')
    print(words_freq(output).most_common(20))
    print('\n')
    print('The 20 most frequent stations are: ')
    print(gare_fq(output).most_common(20))
    print('\n')
    print('The 20 most frequent reasons are: ')
    print(incidient_reason(output).most_common(20))
#summary(line_dict['RER_A'])

Merge all sub datasets into one



In [19]:

    
# concat all dataframe and clean data
def consol(data_dic):
    result = pd.DataFrame()
    for k, v in data_dic.items():
        result = pd.concat([result, v])
    result = result.sort_values(by='date')
    return result
df_consol = consol(line_dict)

How many tweets per Ligne/RER



In [20]:

    
# overall tweets
df_consol.username.value_counts()









    Out[20]:





RER_A           8214
RERB            6586
Ligne4_RATP     4314
Ligne6_RATP     3439
Ligne13_RATP    3104
Ligne8_RATP     3013
RERD_SNCF       2967
Ligne1_RATP     2965
Ligne7_RATP     2851
Ligne5_RATP     2671
Ligne12_RATP    2616
Ligne9_RATP     2546
RERC_SNCF       2502
Ligne2_RATP     2221
Ligne3_RATP     2073
Ligne14_RATP    1786
RERE_SNCF       1567
Ligne10_RATP    1345
Ligne11_RATP    1285
Name: username, dtype: int64

Numbers of Tweets per day



In [21]:

    
date_tweets = df_consol.date.apply(lambda x: x.date()).value_counts()
date_tweets.iloc[:10]









    Out[21]:





2015-01-29    278
2016-12-06    166
2015-06-30    161
2014-12-31    154
2017-06-24    144
2016-11-29    142
2016-08-14    134
2016-05-25    133
2016-03-29    129
2016-12-07    120
Name: date, dtype: int64

Plot a chart visualize the data



In [22]:

    
from matplotlib import pyplot as plt
%matplotlib inline
date_tweets.plot()









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0xc9712e8>

Export date, username, tweets count per date & consolidated data



In [23]:

    
# export date, username, tweets count
df_consol['date_new'] =  df_consol.date.apply(lambda x: x.date())
df_consol.groupby(['date_new', 'username']).size().to_csv('output/tweets_date.csv')
df_consol.to_csv('output/consol.csv')

Consolidate the data by hour



In [24]:

    
df_consol['hour'] = df_consol.date.apply(lambda x: x.hour)
df_consol.groupby(['hour','username']).size().to_csv('output/date_hour.csv')
df_consol.sort_values(by='retweets',ascending = False).head()









    Out[24]:







  
    
      
      username
      date
      retweets
      favorites
      text
      geo
      mentions
      hashtags
      id
      permalink
      date_new
      hour
    
  
  
    
      5167
      RERB
      2016-08-20 00:15:00
      1100
      286
      Alerte enlèvement 19.08.2016 pic.twitter.com/Glqtxno04Q
      NaN
      NaN
      NaN
      7.667604e+17
      https://twitter.com/RERB/status/766760426562519040
      2016-08-20
      0
    
    
      630
      Ligne11_RATP
      2016-04-01 07:54:00
      667
      410
      Ds une volonté de modernisation, la #RATP modifie le nom de la station Télégraphe en #Tweet http://www. ratp.fr/fr/ratp/v_1490 39/metro-rer-la-ratp-renomme-13-stations/ … pic.twitter.com/W3x8Np6pR9
      NaN
      NaN
      #RATP #Tweet
      7.157795e+17
      https://twitter.com/Ligne11_RATP/status/715779457173364737
      2016-04-01
      7
    
    
      1157
      Ligne2_RATP
      2016-04-01 07:51:00
      470
      258
      2 ǝuƃıl ǝɹʇoʌ ɹns sɹǝʌuɐ uoıʇɐʇs ɐl ǝp ıpǝɹpuǝʌ ǝɔ uoıʇɐɹnƃnɐuı http://www. ratp.fr/fr/ratp/v_1490 39/metro-rer-la-ratp-renomme-13-stations/ … pic.twitter.com/ekm1oychl4
      NaN
      NaN
      NaN
      7.157786e+17
      https://twitter.com/Ligne2_RATP/status/715778636557783040
      2016-04-01
      7
    
    
      107
      RER_A
      2017-10-05 17:52:00
      429
      145
      1 incident électriqueque au niveau de la caténaire s'est produit à La Défense, bruyant, ms ss danger pour les voyageurs. Trafic perturbé #RERA
      NaN
      NaN
      #RERA
      9.159679e+17
      https://twitter.com/RER_A/status/915967946899562500
      2017-10-05
      17
    
    
      328
      Ligne4_RATP
      2017-08-01 15:10:00
      366
      192
      La station Château Rouge rouvre aujourd'hui avec une salle d'accueil + spacieuse, un nouvel accès... ! #Ligne4 http:// ow.ly/XQnv30e4FPy pic.twitter.com/DrcL2GM9K7
      NaN
      NaN
      #Ligne4
      8.923720e+17
      https://twitter.com/Ligne4_RATP/status/892371982330851329
      2017-08-01
      15

Export Tweets Number, Incident Reason, and Ligne/RER



In [25]:

    
df_incident = pd.DataFrame()
for k, v in line_dict.items():
    print(k,'\n')
    df_inter = pd.DataFrame.from_dict(incidient_reason(v).most_common())
    df_inter['username'] = k
    df_incident = pd.concat([df_incident, df_inter])
df_incident.sort_values(by=1, ascending = False).head()









    



RER_A 

RER_B 

RER_C 

RER_D 

RER_E 

Ligne1_RATP 

Ligne2_RATP 

Ligne3_RATP 

Ligne4_RATP 

Ligne5_RATP 

Ligne6_RATP 

Ligne7_RATP 

Ligne8_RATP 

Ligne9_RATP 

Ligne10_RATP 

Ligne11_RATP 

Ligne12_RATP 

Ligne13_RATP 

Ligne14_RATP 







    Out[25]:







  
    
      
      0
      1
      username
    
  
  
    
      0
      colis suspect
      1171
      RER_A
    
    
      0
      colis suspect
      699
      Ligne4_RATP
    
    
      0
      colis suspect
      676
      RER_B
    
    
      0
      incident technique
      476
      Ligne1_RATP
    
    
      1
      malaise voyageur
      471
      RER_A

Then we find out that a few incidents should be classified in a same group



In [26]:

    
df_incident['group'] = df_incident.iloc[:,0]
rep = {'bagage oublié':'bagage abandonné', 'colis suspect':'bagage abandonné',
       'voyageur malade':'malaise voyageur',
       'pb signalisation':'panne de signalisation', 'jets de pierre':'acte de malveillance',
       'jets de pierre':'acte de malveillance','jet de projectile':'acte de malveillance'}
df_incident.group = df_incident.group.replace(rep)
df_incident.loc[df_incident[0].str.contains('bagage', na=False)].head()
df_incident.to_csv('output/df_incident.csv')



In [27]:

    
df_incident.head()









    Out[27]:







  
    
      
      0
      1
      username
      group
    
  
  
    
      0
      colis suspect
      1171
      RER_A
      bagage abandonné
    
    
      1
      malaise voyageur
      471
      RER_A
      malaise voyageur
    
    
      2
      incident technique
      381
      RER_A
      incident technique
    
    
      3
      incident voyageur
      245
      RER_A
      incident voyageur
    
    
      4
      panne de matériel
      233
      RER_A
      panne de matériel

Build a table for Machie Learning Algorithm

Import temperature data.
Import which arrondissements are passed through by which line.



In [28]:

    
df_temp = pd.read_csv('data/temperature.csv')
df_temp.head()
fil = ['Date','T_avg','V_avg','W_avg', 'rain','fog','snow','Thunderstorms']
df_temp_fil = df_temp[fil]
df_temp_fil.head()









    Out[28]:







  
    
      
      Date
      T_avg
      V_avg
      W_avg
      rain
      fog
      snow
      Thunderstorms
    
  
  
    
      0
      2012/1/1
      13
      10
      19
      1
      0
      1
      0
    
    
      1
      2012/1/2
      8
      9
      18
      1
      1
      0
      0
    
    
      2
      2012/1/3
      9
      10
      24
      1
      0
      0
      0
    
    
      3
      2012/1/4
      8
      10
      19
      0
      0
      0
      0
    
    
      4
      2012/1/5
      10
      9
      32
      1
      0
      0
      0



In [29]:

    
# import which arrondissements are passed through by which line.
df_arr = pd.read_csv('data/data_arrondissement.csv')
df_arr.head()









    Out[29]:







  
    
      
      username
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      19
      20
      91
      92
      93
      94
      95
      77
      78
      45
    
  
  
    
      0
      Ligne1_RATP
      5
      0
      0
      4
      0
      0
      0
      5
      0
      ...
      0
      1
      0
      4
      0
      3
      0
      0
      0
      0
    
    
      1
      Ligne2_RATP
      0
      0
      0
      0
      0
      0
      0
      7
      4
      ...
      4
      7
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      Ligne3_RATP
      0
      5
      4
      0
      0
      0
      0
      3
      2
      ...
      0
      3
      0
      3
      1
      0
      0
      0
      0
      0
    
    
      3
      Ligne4_RATP
      3
      3
      2
      2
      1
      7
      0
      0
      1
      ...
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      4
      Ligne5_RATP
      0
      0
      1
      1
      1
      0
      0
      0
      0
      ...
      5
      0
      0
      0
      4
      0
      0
      0
      0
      0
    
  

5 rows × 29 columns



In [30]:

    
df_traffic = pd.read_csv('data/traffic-16.csv')
df_traffic.head()









    Out[30]:







  
    
      
      username
      Traffic-2016
    
  
  
    
      0
      Ligne1_RATP
      217255582
    
    
      1
      Ligne2_RATP
      125819494
    
    
      2
      Ligne3_RATP
      153695350
    
    
      3
      Ligne4_RATP
      215968806
    
    
      4
      Ligne5_RATP
      185094089



In [31]:

    
# build another conslidated dataframe, and forecast the reason
def df_class(input):
    # list all reasons of incidents
    incidents = ['malaise voyageur',"incident d'exploitation","incident technique",
                 'Incident de signalisation',
                "colis suspect", "voyageur malade", "incident voyageur", 
                "divers incidents",'panne de signalisation','panne de matériel',
                'panne électrique','panne mécanique','panne de caténaire',
                "panne d'aiguillage",'panne matériel','panne éléctrique', 
                'panne sur un train','pannes de signalisation',"panne d'un train",
                "panne de train",'obstacle sur la voie', 'bagage abandonné','incident de passage', 
                 'accident de personne','feu aux abords','pb signalisation','acte de malveillance',
                 'jets de pierre','obstacle sur la voie','bagage oublié',
                 'personnes sur les voies','branche tombée','jet de projectile',
                'grave de voyageur','animal sur la voie','défaut électrique',
                 'fin tardive de chantier',"Défaut d'alimentation électrique"]
    
    # clean data
    output = clean_data(input)          
    output = input[input.text.str.contains('|'.join(incidents),na=False)]
    filt = "(malaise voyageur|incident d'exploitation|incident technique|Incident de signalisation|colis suspect|voyageur malade|incident voyageur|divers incidents|panne de signalisation|panne de matériel|panne électrique|panne mécanique|panne de caténaire|panne d'aiguillage|panne matériel|panne éléctrique|panne sur un train|pannes de signalisation|panne d'un train|panne de train|obstacle sur la voie|bagage abandonné|incident de passage|accident de personne|feu aux abords|pb signalisation|acte de malveillance|jets de pierre|obstacle sur la voie|bagage oublié|personnes sur les voies|branche tombée|jet de projectile|grave de voyageur|animal sur la voie|défaut électrique|fin tardive de chantier|Défaut d'alimentation électrique)"
    output['reason'] = output.text.str.extract(filt)
    filt2 = ['username','date_new','reason']
    
    # extract incident reasons and create a new column of "reasons"
    output = output[filt2]
    
    # create quarter, month, year columns
    output.date_new = pd.to_datetime(output.date_new)
    df_temp_fil.Date = pd.to_datetime(df_temp_fil.Date)
    
    #merge temperature data, arrondissements data and traffic data
    output = output.merge(right=df_temp_fil, how='inner', left_on='date_new', right_on='Date')
    output = output.merge(right=df_arr, how='inner', left_on='username', right_on='username')
    output = output.merge(right=df_traffic, how='inner', left_on='username', right_on='username')
    
    output['Quarter'] = output.date_new.apply(lambda x: pd.to_datetime(x).quarter)
    output['Month'] = output.date_new.apply(lambda x: pd.to_datetime(x).day)
    output['Year'] = output.date_new.apply(lambda x: pd.to_datetime(x).year)
    output = output.drop(['date_new','Date'], axis=1)
    
    # standardize all incident reasons
    rep = {'bagage oublié':'bagage abandonné', 'colis suspect':'bagage abandonné',
       'voyageur malade':'malaise voyageur',  "Défaut d'alimentation électrique":'panne électrique',
        "panne d'un train":'panne de train','grave de voyageur':'incident voyageur',
           'Incident de signalisation':'pannes de signalisation',
           'panne de matériel':'panne matériel',
           'panne sur un train':'panne de train',
           'pb signalisation':'panne de signalisation', 'jets de pierre':'acte de malveillance',
       'jets de pierre':'acte de malveillance','jet de projectile':'acte de malveillance',
          'accident de personne':'incident voyageur','malaise voyageur':'incident voyageur',
           'pannes de signalisation':'panne de signalisation'}
    output.reason = output.reason.replace(rep)    
    
    # some rows from df_temp_fil contains '-' items
    output = output[output.T_avg != '-']
    output = output[output.V_avg != '-']   
    return output

Remember that we built a consolidated before, let's review this dataframe.



In [32]:

    
df_consol.head()









    Out[32]:







  
    
      
      username
      date
      retweets
      favorites
      text
      geo
      mentions
      hashtags
      id
      permalink
      date_new
      hour
    
  
  
    
      5754
      Ligne6_RATP
      2014-01-01 03:09:00
      1
      0
      Fin de la manifestation impactant la ligne #Ligne6 .
      NaN
      NaN
      #Ligne6
      4.182022e+17
      https://twitter.com/Ligne6_RATP/status/418202249354289152
      2014-01-01
      3
    
    
      4606
      Ligne9_RATP
      2014-01-01 03:09:00
      0
      0
      03:09, Le trafic est de nouveau normal sur l'ensemble de la ligne #Ligne9 . #RATP
      NaN
      NaN
      #Ligne9 #RATP
      4.182023e+17
      https://twitter.com/Ligne9_RATP/status/418202256325238784
      2014-01-01
      3
    
    
      4605
      Ligne9_RATP
      2014-01-01 03:09:00
      0
      0
      Fin de la manifestation impactant la ligne #Ligne9 .
      NaN
      NaN
      #Ligne9
      4.182023e+17
      https://twitter.com/Ligne9_RATP/status/418202266920050688
      2014-01-01
      3
    
    
      5049
      Ligne8_RATP
      2014-01-01 06:52:00
      1
      0
      06:49, le trafic est interrompu entre Maisons-Alfort-Les Juilliottes et Pointe du Lac (voyageur sur la voie) #RATP #Ligne8
      NaN
      NaN
      #RATP #Ligne8
      4.182584e+17
      https://twitter.com/Ligne8_RATP/status/418258368995024897
      2014-01-01
      6
    
    
      5048
      Ligne8_RATP
      2014-01-01 07:07:00
      0
      0
      07:05, le trafic reprend progressivement sur la ligne (voyageur sur la voie) #RATP #Ligne8
      NaN
      NaN
      #RATP #Ligne8
      4.182621e+17
      https://twitter.com/Ligne8_RATP/status/418262143814090752
      2014-01-01
      7

Split the dataset into X_train, X_test, y_train, y_test



In [33]:

    
df_class(df_consol).drop('reason', axis=1).sample(5)









    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
D:\Users\ThinkPad\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
D:\Users\ThinkPad\Anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value






    Out[33]:







  
    
      
      username
      T_avg
      V_avg
      W_avg
      rain
      fog
      snow
      Thunderstorms
      1
      2
      ...
      93
      94
      95
      77
      78
      45
      Traffic-2016
      Quarter
      Month
      Year
    
  
  
    
      505
      Ligne14_RATP
      16
      11
      23
      0
      0
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      2
      7
      2017
    
    
      454
      Ligne14_RATP
      5
      10
      8
      1
      1
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      1
      10
      2017
    
    
      260
      Ligne14_RATP
      12
      11
      6
      0
      0
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      4
      23
      2015
    
    
      242
      Ligne14_RATP
      13
      8
      8
      1
      1
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      3
      13
      2015
    
    
      66
      Ligne14_RATP
      20
      12
      5
      0
      0
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      3
      5
      2014
    
  

5 rows × 40 columns



In [34]:

    
# let's run classification
from sklearn.model_selection import train_test_split
X = df_class(df_consol).drop('reason', axis=1)









    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
D:\Users\ThinkPad\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
D:\Users\ThinkPad\Anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Check X data



In [35]:

    
X.head()









    Out[35]:







  
    
      
      username
      T_avg
      V_avg
      W_avg
      rain
      fog
      snow
      Thunderstorms
      1
      2
      ...
      93
      94
      95
      77
      78
      45
      Traffic-2016
      Quarter
      Month
      Year
    
  
  
    
      0
      Ligne14_RATP
      10
      11
      18
      1
      0
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      1
      9
      2014
    
    
      1
      Ligne14_RATP
      10
      11
      18
      1
      0
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      1
      9
      2014
    
    
      2
      Ligne14_RATP
      10
      11
      18
      1
      0
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      1
      9
      2014
    
    
      3
      Ligne14_RATP
      2
      5
      8
      0
      1
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      1
      12
      2014
    
    
      4
      Ligne14_RATP
      2
      5
      8
      0
      1
      0
      0
      2
      0
      ...
      0
      0
      0
      0
      0
      0
      143760624
      1
      12
      2014
    
  

5 rows × 40 columns



In [36]:

    
# convert all data into numeric values and scale the data
X.T_avg = pd.to_numeric(X.T_avg)
X.V_avg = pd.to_numeric(X.V_avg)
X.W_avg = pd.to_numeric(X.W_avg)
X = pd.get_dummies(X)
y = df_class(df_consol).reason
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)









    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
D:\Users\ThinkPad\Anaconda3\lib\site-packages\ipykernel_launcher.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
D:\Users\ThinkPad\Anaconda3\lib\site-packages\pandas\core\generic.py:3643: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Run different models

Decision Tree
Random Forest
KNN
Linear SVC
Logistic Regression



In [38]:

    
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import auc, roc_auc_score, accuracy_score, f1_score

tree = DecisionTreeClassifier()
tree.fit(X_train_scaled, y_train)
y_pred = tree.predict(X_test_scaled)
acc_tree = accuracy_score(y_test, y_pred)
f1_tree = f1_score(y_test, y_pred, average = 'weighted')
print('Accuracy is {}'.format(accuracy_score(y_test, y_pred)))
print('F1 score is {}'.format(f1_score(y_test, y_pred, average = 'weighted')))









    



Accuracy is 0.7482014388489209
F1 score is 0.7424348602611966






    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [40]:

    
from sklearn.metrics import confusion_matrix
y_predicted = tree.predict(X_test_scaled)
confusion = confusion_matrix(y_test, y_predicted)
df_cm = pd.DataFrame(confusion)
#sns.set(font_scale=1.4)#for label size
plt.figure(figsize = (10,7))
sns.heatmap(df_cm)# font size









    Out[40]:





<matplotlib.axes._subplots.AxesSubplot at 0xe9f3898>



In [41]:

    
# run knn
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier

plt.figure()
scores = []
for n in range(1,50,20):
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train_scaled, y_train)
    scores.append(knn.score(X_test_scaled, y_test))
plt.plot(range(1,50,20), scores)
plt.title('KNN Accuracy curve')
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.show()  
acc_knn = max(scores)
n_knn = list(range(1,50,20))[scores.index(max(scores))]

# 
knn = KNeighborsClassifier(n_neighbors=n_knn)
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)
f1_knn = f1_score(y_test, y_pred, average = 'weighted')
print('Accuracy is {}'.format(acc_knn))
print('Best parameter is n = {}'.format(n_knn))
print('F1 score is {}'.format(f1_knn))









    












    



Accuracy is 0.762589928057554
Best parameter is n = 1
F1 score is 0.7584480870931942






    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
D:\Users\ThinkPad\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1137: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)



In [42]:

    
# performance --> confusion matrix

from sklearn.metrics import confusion_matrix
y_predicted = knn.predict(X_test_scaled)
confusion = confusion_matrix(y_test, y_predicted)
import seaborn as sns
import matplotlib.pyplot as plt

df_cm = pd.DataFrame(confusion)
#sns.set(font_scale=1.4)#for label size
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, cmap="YlGnBu")# font size









    Out[42]:





<matplotlib.axes._subplots.AxesSubplot at 0xeaf7390>



In [43]:

    
# Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import auc, roc_auc_score, accuracy_score, f1_score
scores = []
for n in range(1,200,20):    
    forest = RandomForestClassifier(n_estimators=n)
    forest.fit(X_train_scaled, y_train)
    y_pred = forest.predict(X_test_scaled)
    scores.append(accuracy_score(y_test, y_pred))
    #print(score)
plt.plot(range(1,200,20), scores)
plt.title('Random Forest accuracy curve')
plt.xlabel('n_estimators')
plt.ylabel('Accuracy score')
plt.show()  

acc_forest = max(scores)
n_forest = list(range(1,200,20))[scores.index(max(scores))]
# 
forest = RandomForestClassifier(n_estimators=n_forest)
forest.fit(X_train_scaled, y_train)
y_pred = forest.predict(X_test_scaled)
f1_forest = f1_score(y_test, y_pred, average = 'weighted')
print('Accuracy is {}'.format(acc_forest))
print('Best parameter is n = {}'.format(n_forest))
print('F1 score is {}'.format(f1_forest))









    












    



Accuracy is 0.8057553956834532
Best parameter is n = 101
F1 score is 0.7835634563110526






    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [44]:

    
from sklearn.metrics import confusion_matrix
y_predicted = forest.predict(X_test_scaled)
confusion = confusion_matrix(y_test, y_predicted)
df_cm = pd.DataFrame(confusion)
#sns.set(font_scale=1.4)#for label size
plt.figure(figsize = (10,7))
sns.heatmap(df_cm)# font size









    Out[44]:





<matplotlib.axes._subplots.AxesSubplot at 0xeb7c0f0>



In [45]:

    
from sklearn.ensemble import AdaBoostClassifier
scores = []
for n in [1,200,500,1000]:    
    boost = AdaBoostClassifier(n_estimators=n)
    boost.fit(X_train_scaled, y_train)
    y_pred = boost.predict(X_test_scaled)
    scores.append(accuracy_score(y_test, y_pred))
    #print(score)
plt.plot([1,200,500,1000], scores)
plt.title('AdaBoost accuracy curve')
plt.xlabel('n_estimators')
plt.ylabel('Accuracy score')
plt.show()  

acc_boost = max(scores)
n_boost = list(range(1,200,20))[scores.index(max(scores))]
# 
boost = AdaBoostClassifier(n_estimators=n_boost)
boost.fit(X_train_scaled, y_train)
y_pred = boost.predict(X_test_scaled)
f1_boost = f1_score(y_test, y_pred, average = 'weighted')
print('Accuracy is {}'.format(acc_boost))
print('Best parameter is n = {}'.format(n_boost))
print('F1 score is {}'.format(f1_boost))









    












    



Accuracy is 0.35251798561151076
Best parameter is n = 1
F1 score is 0.1837593754783407






    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [46]:

    
from sklearn.svm import LinearSVC
import numpy as np
scores = []
rng = [1,10,50,70,100]
for c in rng:    
    l_svc = LinearSVC(C=c)
    l_svc.fit(X_train_scaled, y_train)
    y_pred = l_svc.predict(X_test_scaled)
    scores.append(accuracy_score(y_test, y_pred))
    #print(score)
plt.plot(rng, scores)
plt.title('Linear SVC')
plt.xlabel('C')
plt.ylabel('Accuracy score')
plt.show()  
acc_svc = max(scores)
c_svc = rng[scores.index(max(scores))]
# 
l_svc = LinearSVC(C=c_svc)
l_svc.fit(X_train_scaled, y_train)
y_pred = l_svc.predict(X_test_scaled)
f1_svc = f1_score(y_test, y_pred, average = 'weighted')
print('Accuracy is {}'.format(acc_svc))
print('Best parameter is c = {}'.format(c_svc))
print('F1 score is {}'.format(f1_svc))









    












    



Accuracy is 0.381294964028777
Best parameter is c = 70
F1 score is 0.29843931698891524






    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [47]:

    
l_svc = LinearSVC(C=c_svc)
l_svc.fit(X_train_scaled, y_train)
y_predicted = l_svc.predict(X_test_scaled)
confusion = confusion_matrix(y_test, y_predicted)
df_cm = pd.DataFrame(confusion)
#sns.set(font_scale=1.4)#for label size
plt.figure(figsize = (10,7))
sns.heatmap(df_cm)# font size









    Out[47]:





<matplotlib.axes._subplots.AxesSubplot at 0xeaa1eb8>



In [48]:

    
from sklearn.linear_model import LogisticRegression
import numpy as np
scores = []
rng = [0.1,1,3,5,10,15]
for c in rng:    
    lr = LogisticRegression(C=c)
    lr.fit(X_train_scaled, y_train)
    y_pred = lr.predict(X_test_scaled)
    scores.append(accuracy_score(y_test, y_pred))
    #print(score)
plt.plot(rng, scores)
plt.title('Logistic Regression')
plt.xlabel('C')
plt.ylabel('Accuracy score')
plt.show()  
acc_lr = max(scores)
c_lr = rng[scores.index(max(scores))]
# 
lr = LinearSVC(C=c_svc)
lr.fit(X_train_scaled, y_train)
y_pred = lr.predict(X_test_scaled)
f1_lr = f1_score(y_test, y_pred, average = 'weighted')
print('Accuracy is {}'.format(acc_lr))
print('Best parameter is c = {}'.format(c_lr))
print('F1 score is {}'.format(f1_lr))









    












    



Accuracy is 0.38848920863309355
Best parameter is c = 0.1
F1 score is 0.27762407982021764






    



D:\Users\ThinkPad\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)



In [49]:

    
y_predicted = lr.predict(X_test_scaled)
confusion = confusion_matrix(y_test, y_predicted)
df_cm = pd.DataFrame(confusion)
#sns.set(font_scale=1.4)#for label size
plt.figure(figsize = (10,7))
sns.heatmap(df_cm)# font size









    Out[49]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b4f0f0>



In [50]:

    
models = pd.DataFrame({
    'Model': ['Linear SVC', 'KNN', 'Random Forest', 'AdaBoost', 
              'Logistic Regression', 'Decision Tree'],
    'Score': [acc_svc,acc_knn, acc_forest, acc_boost, 
              acc_lr, acc_tree],
    'F1 Score':[f1_svc, f1_knn, f1_forest, f1_boost, 
               f1_lr, f1_tree]})

models.sort_values(by='Score', ascending=False)









    Out[50]:







  
    
      
      F1 Score
      Model
      Score
    
  
  
    
      2
      0.783563
      Random Forest
      0.805755
    
    
      1
      0.758448
      KNN
      0.762590
    
    
      5
      0.742435
      Decision Tree
      0.748201
    
    
      4
      0.277624
      Logistic Regression
      0.388489
    
    
      0
      0.298439
      Linear SVC
      0.381295
    
    
      3
      0.183759
      AdaBoost
      0.352518

	username	date	retweets	favorites	text	geo	mentions	hashtags	id	permalink
2987	RER_A	2017-05-24 12:41	1996	416	Ce chien a été retrouvé ce matin, vers 7h, sur...	NaN	NaN	#RERA	8.673298e+17	https://twitter.com/RER_A/status/8673297646919...
107	RER_A	2017-10-05 17:52	429	145	1 incident électrique au niveau de la caténair...	NaN	NaN	#RERA	9.159679e+17	https://twitter.com/RER_A/status/9159679468995...
784	RER_A	2017-08-28 12:02	342	409	[ #TravauxEteRERA ] Les travaux sont terminés ...	NaN	NaN	#TravauxEteRERA #RERA	9.021090e+17	https://twitter.com/RER_A/status/9021090016257...
16061	RER_A	2015-01-29 06:23	306	19	Arrêt de travail spontané en raison de l'agres...	NaN	NaN	#RERA	5.606696e+17	https://twitter.com/RER_A/status/5606696327600...
16046	RER_A	2015-01-29 08:01	295	26	Depuis 7h52 fermeture des gares #RERA #RATP sa...	NaN	NaN	#RERA #RATP	5.606942e+17	https://twitter.com/RER_A/status/5606941607020...

	username	date	retweets	favorites	text	geo	mentions	hashtags	id	permalink
0	Ligne11_RATP	2017-10-10 12:19	0	0	Bonjour, la #Ligne11 fonctionne correctement, ...	NaN	NaN	#Ligne11	9.176961e+17	https://twitter.com/Ligne11_RATP/status/917696...
1	Ligne11_RATP	2017-10-09 14:22	0	1	Demain, le trafic sera normal sur votre #ligne...	NaN	NaN	#ligne11 #RATP #greve10octobre	9.173647e+17	https://twitter.com/Ligne11_RATP/status/917364...
2	Ligne11_RATP	2017-10-08 12:58	0	1	Retour à un trafic régulier sur l'ensemble de ...	NaN	NaN	#Ligne11 #RATP	9.169811e+17	https://twitter.com/Ligne11_RATP/status/916981...
3	Ligne11_RATP	2017-10-08 12:51	1	0	12:51, Retour à un trafic régulier sur l'ensem...	NaN	NaN	#Ligne11 #RATP	9.169795e+17	https://twitter.com/Ligne11_RATP/status/916979...
4	Ligne11_RATP	2017-10-08 12:45	1	0	12:43, le trafic reprend progressivement (coli...	NaN	NaN	#RATP #Ligne11	9.169778e+17	https://twitter.com/Ligne11_RATP/status/916977...

	username	date	retweets	favorites	text	geo	mentions	hashtags	id	permalink
0	RERB	2017-10-12 19:16	0	0	Toutes les missions sont rendues omnibus. Bonn...	NaN	NaN	NaN	9.185259e+17	https://twitter.com/RERB/status/91852587036885...
1	RERB	2017-10-12 18:51	1	0	Bonsoir, le trafic reprend progressivement mai...	NaN	NaN	NaN	9.185196e+17	https://twitter.com/RERB/status/91851956429560...
2	RERB	2017-10-12 18:45	1	0	Il est également possible de prendre un train ...	NaN	NaN	NaN	9.185180e+17	https://twitter.com/RERB/status/91851797897883...
3	RERB	2017-10-12 18:40	0	0	Bonsoir, oui des trains circulent au départ d'...	NaN	NaN	NaN	9.185168e+17	https://twitter.com/RERB/status/91851677975599...
4	RERB	2017-10-12 18:36	0	1	Bonsoir, l'incident est terminé, le trafic rep...	NaN	NaN	NaN	9.185157e+17	https://twitter.com/RERB/status/91851566784590...

	username	date	retweets	favorites	text	geo	mentions	hashtags	id	permalink	date_new	hour
5167	RERB	2016-08-20 00:15:00	1100	286	Alerte enlèvement 19.08.2016 pic.twitter.com/Glqtxno04Q	NaN	NaN	NaN	7.667604e+17	https://twitter.com/RERB/status/766760426562519040	2016-08-20	0
630	Ligne11_RATP	2016-04-01 07:54:00	667	410	Ds une volonté de modernisation, la #RATP modifie le nom de la station Télégraphe en #Tweet http://www. ratp.fr/fr/ratp/v_1490 39/metro-rer-la-ratp-renomme-13-stations/ … pic.twitter.com/W3x8Np6pR9	NaN	NaN	#RATP #Tweet	7.157795e+17	https://twitter.com/Ligne11_RATP/status/715779457173364737	2016-04-01	7
1157	Ligne2_RATP	2016-04-01 07:51:00	470	258	2 ǝuƃıl ǝɹʇoʌ ɹns sɹǝʌuɐ uoıʇɐʇs ɐl ǝp ıpǝɹpuǝʌ ǝɔ uoıʇɐɹnƃnɐuı http://www. ratp.fr/fr/ratp/v_1490 39/metro-rer-la-ratp-renomme-13-stations/ … pic.twitter.com/ekm1oychl4	NaN	NaN	NaN	7.157786e+17	https://twitter.com/Ligne2_RATP/status/715778636557783040	2016-04-01	7
107	RER_A	2017-10-05 17:52:00	429	145	1 incident électriqueque au niveau de la caténaire s'est produit à La Défense, bruyant, ms ss danger pour les voyageurs. Trafic perturbé #RERA	NaN	NaN	#RERA	9.159679e+17	https://twitter.com/RER_A/status/915967946899562500	2017-10-05	17
328	Ligne4_RATP	2017-08-01 15:10:00	366	192	La station Château Rouge rouvre aujourd'hui avec une salle d'accueil + spacieuse, un nouvel accès... ! #Ligne4 http:// ow.ly/XQnv30e4FPy pic.twitter.com/DrcL2GM9K7	NaN	NaN	#Ligne4	8.923720e+17	https://twitter.com/Ligne4_RATP/status/892371982330851329	2017-08-01	15

	0	1	username
0	colis suspect	1171	RER_A
0	colis suspect	699	Ligne4_RATP
0	colis suspect	676	RER_B
0	incident technique	476	Ligne1_RATP
1	malaise voyageur	471	RER_A

	Date	T_avg	V_avg	W_avg	rain	fog	snow
0	2012/1/1	13	10	19	1	0	1
1	2012/1/2	8	9	18	1	1	0
2	2012/1/3	9	10	24	1	0	0
3	2012/1/4	8	10	19	0	0	0
4	2012/1/5	10	9	32	1	0	0

	username	1	2	3	4	5	6	8	9	...	19	20	92	93	94
0	Ligne1_RATP	5	0	0	4	0	0	5	0	...	0	1	4	0	3
1	Ligne2_RATP	0	0	0	0	0	0	7	4	...	4	7	0	0	0
2	Ligne3_RATP	0	5	4	0	0	0	3	2	...	0	3	3	1	0
3	Ligne4_RATP	3	3	2	2	1	7	0	1	...	0	0	1	0	0
4	Ligne5_RATP	0	0	1	1	1	0	0	0	...	5	0	0	4	0

	username	Traffic-2016
0	Ligne1_RATP	217255582
1	Ligne2_RATP	125819494
2	Ligne3_RATP	153695350
3	Ligne4_RATP	215968806
4	Ligne5_RATP	185094089

	username	date	retweets	text	geo	mentions	hashtags	id	permalink	date_new	hour
5754	Ligne6_RATP	2014-01-01 03:09:00	1	Fin de la manifestation impactant la ligne #Ligne6 .	NaN	NaN	#Ligne6	4.182022e+17	https://twitter.com/Ligne6_RATP/status/418202249354289152	2014-01-01	3
4606	Ligne9_RATP	2014-01-01 03:09:00	0	03:09, Le trafic est de nouveau normal sur l'ensemble de la ligne #Ligne9 . #RATP	NaN	NaN	#Ligne9 #RATP	4.182023e+17	https://twitter.com/Ligne9_RATP/status/418202256325238784	2014-01-01	3
4605	Ligne9_RATP	2014-01-01 03:09:00	0	Fin de la manifestation impactant la ligne #Ligne9 .	NaN	NaN	#Ligne9	4.182023e+17	https://twitter.com/Ligne9_RATP/status/418202266920050688	2014-01-01	3
5049	Ligne8_RATP	2014-01-01 06:52:00	1	06:49, le trafic est interrompu entre Maisons-Alfort-Les Juilliottes et Pointe du Lac (voyageur sur la voie) #RATP #Ligne8	NaN	NaN	#RATP #Ligne8	4.182584e+17	https://twitter.com/Ligne8_RATP/status/418258368995024897	2014-01-01	6
5048	Ligne8_RATP	2014-01-01 07:07:00	0	07:05, le trafic reprend progressivement sur la ligne (voyageur sur la voie) #RATP #Ligne8	NaN	NaN	#RATP #Ligne8	4.182621e+17	https://twitter.com/Ligne8_RATP/status/418262143814090752	2014-01-01	7

	username	T_avg	V_avg	W_avg	rain	fog	1	...	Traffic-2016	Quarter	Month	Year
505	Ligne14_RATP	16	11	23	0	0	2	...	143760624	2	7	2017
454	Ligne14_RATP	5	10	8	1	1	2	...	143760624	1	10	2017
260	Ligne14_RATP	12	11	6	0	0	2	...	143760624	4	23	2015
242	Ligne14_RATP	13	8	8	1	1	2	...	143760624	3	13	2015
66	Ligne14_RATP	20	12	5	0	0	2	...	143760624	3	5	2014

	F1 Score	Model	Score
2	0.783563	Random Forest	0.805755
1	0.758448	KNN	0.762590
5	0.742435	Decision Tree	0.748201
4	0.277624	Logistic Regression	0.388489
0	0.298439	Linear SVC	0.381295
3	0.183759	AdaBoost	0.352518