In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('passagierfrequenz.csv', delimiter=';')

Data set 1: Passengers frequence in the Swiss railway stations

Source and documentation: http://data.sbb.ch/explore/dataset/passagierfrequenz/

  • DTV = Durchschnittlicher täglicher Verkehr (Montag bis Sonntag) = average daily circulation (including the weekend)

  • DWV = Durchschnittlicher werktäglicher Verkehr (Montag bis Freitag) = average daily circulation Mo-Friday


In [3]:
print("Q1: Which Swiss railway station is the most frequented?")
print("A: The most frequented station is Zürich HB:")
df[['Station', 'DTV']].sort_values(by='DTV', ascending=False).head(1)


Q1: Which Swiss railway station is the most frequented?
A: The most frequented station is Zürich HB:
Out[3]:
Station DTV
97 Zürich HB 396300

In [4]:
print("Q2: Which stations have a higher average daily circulation on Saturday and Sunday?")
print("A: These 21 stations:")
df[df['DTV'] > df['DWV']]


Q2: Which stations have a higher average daily circulation on Saturday and Sunday?
A: These 21 stations:
Out[4]:
Code Station Year DTV DWV Owner Comments geopos
12 NOI Noiraigue 2014 350 330 CFF NaN 46.9549132858, 6.72179933612
42 LQ Landquart 2014 12400 12300 SBB NaN 46.9674389764, 9.55402859224
68 GOE Göschenen 2014 2100 2000 SBB NaN 46.6658163572, 8.58879659143
103 AI Airolo 2014 540 520 FFS NaN 46.5275264817, 8.6086051345
106 BRUE Brünig-Hasliberg 2014 520 470 ZB NaN 46.7577632267, 8.13864485952
108 CDM Champ-du-Moulin 2014 80 60 CFF NaN 46.95990299, 6.77498412319
139 EPS Epesses 2014 130 120 CFF NaN 46.4893183671, 6.74516355201
154 IO Interlaken Ost 2014 10800 10500 ZB Passagierfrequenzen: ohne BOB. 46.6905054601, 7.86900513744
264 NRRB Niederrickenbach LSE 2014 60 50 ZB NaN 46.9278346971, 8.39619411198
288 EBG Engelberg 2014 1800 1600 ZB NaN 46.8195231702, 8.40264298232
300 AST Alpnachstad 2014 460 410 ZB Passagierfrequenzen: ohne PB. 46.9550392879, 8.2778329759
363 LINB Linthal Braunwaldbahn 2014 300 250 SBB NaN 46.928393834, 9.00248305864
396 SLF Schloss Laufen am Rheinfall 2014 370 320 SBB NaN 47.6765598478, 8.61424692042
454 GEAP Genève-Aéroport 2014 13700 13500 CFF NaN 46.2325293452, 6.11198571992
478 LIG Ligerz 2014 370 340 SBB NaN 47.0837604425, 7.13514479553
630 TWN Twann 2014 550 500 SBB NaN 47.0936555927, 7.15649171643
639 VI Visp 2014 18800 18400 SBB NaN 46.2940248591, 7.88145816487
663 UNT Unterterzen 2014 370 310 SBB NaN 47.1138559734, 9.25500777055
665 STSA St-Saphorin 2014 140 120 CFF NaN 46.4725855624, 6.7970306112
694 LZVH Luzern-Verkehrshaus 2014 560 550 SBB NaN 47.053927868, 8.33728023957
712 RIV Rivaz 2014 150 140 CFF NaN 46.4741966711, 6.78458954595

In [5]:
print("Q3: Print a comma-separated list of all the comments in the Comments column. Escape them with the “\"” character and don't include any empty cell.")
comments_list = df[df['Comments'] == df['Comments']]['Comments'].tolist()
print("A: The comments are:", '"' + str.join('","',comments_list)  + '".')


Q3: Print a comma-separated list of all the comments in the Comments column. Escape them with the “"” character and don't include any empty cell.
A: The comments are: "Passagierfrequenzen: ohne RB.","Passagierfrequenzen: ohne MBC.","Passagierfrequenzen: ohne NstCM.","Passagierfrequenzen: inklusive Bahnverkehr MVR.","Passagierfrequenzen: ohne TRAVYS.","Passagierfrequenzen: ohne FW.","Passagierfrequenzen: ohne TPF.","Passagierfrequenzen: ohne SNCF.","Passagierfrequenzen: ohne RBS.","Passagierfrequenzen: ohne TPF.","Umfasst auch ZLOE, ZMUS und ZUSZ; Passagierfrequenzen: ohne Uetlibergbahn der SZU.","Passagierfrequenzen: ohne BDWM.","Passagierfrequenzen: ohne CJ und TRN.","Passagierfrequenzen: ohne BDWM.","Passagierfrequenzen: ohne BOB.","Passagierfrequenzen: ohne TRAVYS.","Passagierfrequenzen: inklusive FW.","Passagierfrequenzen: ohne ASM und RBS.","Passagierfrequenzen: ohne TPF.","Passagierfrequenzen: ohne TPF.","Passagierfrequenzen: ohne TSOL.","Passagierfrequenzen: ohne PB.","Passagierfrequenzen: ohne WSB.","Passagierfrequenzen: ohne TPC.","Passagierfrequenzen: ohne TRN.","Passagierfrequenzen: ohne FLP.","Passagierfrequenzen: ohne TPF.","Passagierfrequenzen: ohne ASM.","Passagierfrequenzen: ohne BRB.","Passagierfrequenzen: ohne TPC.","Passagierfrequenzen: ohne Regionalverkehr ÖBB.","Passagierfrequenzen: ohne CJ.","Passagierfrequenzen: ohne SNCF.","Passagierfrequenzen: ohne FART.","Passagierfrequenzen: ohne WB.","Passagierfrequenzen: ohne FB.","Passagierfrequenzen: inklusive Bahnverkehr MVR.","Passagierfrequenzen: ohne TRN.","Passagierfrequenzen: ohne CJ.","Passagierfrequenzen: ohne DB.","Passagierfrequenzen: ohne RBS".

In [6]:
print("Q4: How many rows contains another year than 2014?")
print("A: I counted", len(df[df['Year'] != 2014]), "rows containing another year than 2014.")


Q4: How many rows contains another year than 2014?
A: I counted 0 rows containing another year than 2014.

In [7]:
print("Q5: What is the size (rows, columns) of the data?")
print("A: There is", df.shape[0], "rows and", df.shape[1], "columns.")


Q5: What is the size (rows, columns) of the data?
A: There is 724 rows and 8 columns.

In [8]:
df[df['Station'] == 'Zürich HB']


Out[8]:
Code Station Year DTV DWV Owner Comments geopos
97 ZUE Zürich HB 2014 396300 441400 SBB Umfasst auch ZLOE, ZMUS und ZUSZ; Passagierfre... 47.3781765756, 8.54019221036

In [9]:
print("Q6: How many stations have a name starting with A?")
import re


a_stations = df[df['Station'].str.match('^A')]
print("A: There is", len(a_stations), "“A stations”. Here they are:")
a_stations


Q6: How many stations have a name starting with A?
A: There is 28 “A stations”. Here they are:
Out[9]:
Code Station Year DTV DWV Owner Comments geopos
0 GD Arth-Goldau 2014 12600 13100 SBB Passagierfrequenzen: ohne RB. 47.04915622, 8.54949116322
1 ABO Aarburg-Oftringen 2014 2000 2500 SBB NaN 47.3202667174, 7.90820373354
3 ALTD Altendorf 2014 800 940 SBB NaN 47.193966384, 8.82288898291
4 AUV Auvernier 2014 380 410 CFF NaN 46.9796312118, 6.87770603996
5 AV Avenches 2014 640 760 CFF NaN 46.8845963101, 7.04091658825
100 ALN Altnau 2014 400 410 SBB NaN 47.6213657537, 9.26589677408
101 AD Aadorf 2014 1700 2000 SBB NaN 47.4881178542, 8.90328450849
102 AE Aesch 2014 2100 2400 SBB NaN 47.4677359724, 7.60305476436
103 AI Airolo 2014 540 520 FFS NaN 46.5275264817, 8.6086051345
124 ARBS Arbon Seemossriet 2014 430 460 SBB NaN 47.5215190263, 9.42429272006
131 AUSG Au SG 2014 290 330 SBB NaN 47.4361134139, 9.64124454757
132 AX Arnex 2014 200 250 CFF NaN 46.6980662731, 6.51890326658
133 AU Au ZH 2014 1100 1200 SBB NaN 47.2468165268, 8.6437080619
200 AN Andelfingen 2014 2100 2500 SBB NaN 47.5937589886, 8.67797953564
201 ALL Allaman 2014 3000 3500 CFF NaN 46.4757395273, 6.39970400408
202 ARB Arbon 2014 1600 1700 SBB NaN 47.5105854479, 9.43333858591
203 ARD Ardon 2014 340 420 CFF NaN 46.2085842921, 7.27094418819
204 AL Altdorf 2014 410 440 SBB NaN 46.8757407268, 8.63156572016
300 AST Alpnachstad 2014 460 410 ZB Passagierfrequenzen: ohne PB. 46.9550392879, 8.2778329759
301 AA Aarau 2014 35900 42000 SBB Passagierfrequenzen: ohne WSB. 47.3913553369, 8.05125354274
302 AIG Aigle 2014 7000 7700 CFF Passagierfrequenzen: ohne TPC. 46.3168441199, 6.9636800936
303 AAT Aathal 2014 770 860 SBB NaN 47.3359563788, 8.76561022548
351 ARN Arnegg 2014 210 240 SBB NaN 47.4420020412, 9.25200601938
400 AF Affoltern am Albis 2014 7500 8700 SBB NaN 47.276064074, 8.44658239143
401 ALT Altstätten SG 2014 2100 2400 SBB NaN 47.3742291734, 9.55651096481
403 ADF Alpnach Dorf 2014 1400 1600 ZB NaN 46.9406471179, 8.27524412618
404 AP Ambri-Piotta 2014 50 50 FFS NaN 46.510616719, 8.69015147937
405 AW Amriswil 2014 3000 3300 SBB NaN 47.5504471564, 9.30221759023

In [10]:
print("Q7: Which are the least frequented stations during the work days? And the full week?")
print("A(a): During the work days:")
df[['Station', 'DWV']].sort_values(by='DWV').head(10)


Q7: Which are the least frequented stations during the work days? And the full week?
A(a): During the work days:
Out[10]:
Station DWV
723 Oron 50
353 Bodio 50
651 Ranzo-S. Abbondio 50
404 Ambri-Piotta 50
443 Ebligen 50
450 Lavorgo 50
77 Oppikon 50
348 Bressonnaz 50
525 Siviriez 50
628 Vaumarcus 50

In [11]:
print("A(b): During the full week:")
df[['Station', 'DTV']].sort_values(by='DTV').head(10)


A(b): During the full week:
Out[11]:
Station DTV
723 Oron 50
572 Roche VD 50
562 Pontenet 50
542 Sommerau 50
536 Trey 50
525 Siviriez 50
617 S. Nazzaro 50
70 Henniez 50
450 Lavorgo 50
443 Ebligen 50

In [12]:
print("Q8: Take the most frequented and the least frequented stations. How many times more passengers has the most frequented one?")

most_freq = df[['Station', 'DTV']].sort_values(by='DTV', ascending=False).head(1)
least_freq = df[['Station', 'DTV']].sort_values(by='DTV').head(1)
most_freq[['Station', 'DTV']]

print("A:", most_freq['Station'].tolist()[0], "has", most_freq['DTV'].tolist()[0], "average daily passengers and", least_freq['Station'].tolist()[0], str(least_freq['DTV'].tolist()[0]) + ".")

ratio = most_freq['DTV'].tolist()[0] / least_freq['DTV'].tolist()[0]
print("This means that Zurich HB has", ratio, "times more daily passengers than Oron.")


Q8: Take the most frequented and the least frequented stations. How many times more passengers has the most frequented one?
A: Zürich HB has 396300 average daily passengers and Oron 50.
This means that Zurich HB has 7926.0 times more daily passengers than Oron.

In [13]:
print("Q9: Which stations have far more passengers during work days than during the full week? Group them in a subset.")

work_days = df[df['DWV'] >= 1.35 * df['DTV']]
print("A: These", len(work_days), "stations have at least 35% more passengers during the work days:")
work_days


Q9: Which stations have far more passengers during work days than during the full week? Group them in a subset.
A: These 10 stations have at least 35% more passengers during the work days:
Out[13]:
Code Station Year DTV DWV Owner Comments geopos
109 BNG Bussnang 2014 400 560 THURBO NaN 47.5560796135, 9.08414102198
126 BLGK Baldegg Kloster 2014 450 610 SBB NaN 47.1806078267, 8.28057856573
151 GRS Grandson 2014 50 80 CFF NaN 46.8062699548, 6.6419368926
155 LAPR Lancy-Pont-Rouge 2014 1100 1600 CFF NaN 46.1888976308, 6.12533012554
159 BIBD Biel/Bienne Bözingenfeld/Champ 2014 300 410 SBB NaN 47.1583882845, 7.29458712956
172 MATT Hergiswil Matt 2014 70 100 ZB NaN 46.9936310359, 8.31271109334
282 GESE Genève-Sécheron 2014 730 990 CFF NaN 46.222435146, 6.14455938175
402 BAAN Baar Neufeld 2014 840 1200 SBB NaN 47.1884876743, 8.51775897551
426 CRET Crêt-du-Locle, Le 2014 300 410 CFF NaN 47.0775941385, 6.78517707886
504 ZIM Zimeysa 2014 730 1000 CFF NaN 46.2212609654, 6.06574042511

In [14]:
print("Q10: Find a crazy station name. Is its average frequency near to the mean average frequency of all stations?")

# Let's try to find a very long name...
longnames = df[df['Station'].str.match('.{25,}')]

longnames


Q10: Find a crazy station name. Is its average frequency near to the mean average frequency of all stations?
Out[14]:
Code Station Year DTV DWV Owner Comments geopos
159 BIBD Biel/Bienne Bözingenfeld/Champ 2014 300 410 SBB NaN 47.1583882845, 7.29458712956
197 SCHO Schöfflisdorf-Oberweningen 2014 1100 1300 SBB NaN 47.4979704691, 8.41179388261
304 GEC Geneveys-sur-Coffrane, Les 2014 350 420 CFF NaN 47.015085057, 6.85321349712
396 SLF Schloss Laufen am Rheinfall 2014 370 320 SBB NaN 47.6765598478, 8.61424692042
544 MSCH Münsterlingen-Scherzingen 2014 300 320 SBB NaN 47.6334235783, 9.2269983596

In [15]:
# … We'll pick “Geneveys-sur-Coffrane, Les”. This is an pretty long name.

meanDTV = df['DTV'].mean()
GeneveysDTV = df[df['Code'] == 'GEC']['DTV'].values

print("A: “Geneveys-sur-Coffran, Les” has an average daily frequency of", str(GeneveysDTV[0]) + ".")
print("This is far less than", str(meanDTV) + ", the mean average frequency of all stations.")
print("However, the _median_ frequency of all stations is only", str(df['DTV'].median()) + ".")


A: “Geneveys-sur-Coffran, Les” has an average daily frequency of 350.
This is far less than 4424.29558011, the mean average frequency of all stations.
However, the _median_ frequency of all stations is only 770.0.

In [16]:
print("Q11: Who else than the SBB CFF FFS (Federal Railways) owns stations? Make a list of them (remove any duplicate).")
other_owner = df[(df['Owner'] != 'CFF') & (df['Owner'] != 'SBB') & (df['Owner'] != 'FFS')]
list_owners = other_owner['Owner'].tolist()
print("A:", str.join(", ", set(list_owners)))


Q11: Who else than the SBB CFF FFS (Federal Railways) owns stations? Make a list of them (remove any duplicate).
A: THURBO, ZB

In [17]:
print("Q12: Print how many stations each owner has.")
print("A: Here is how many stations they have:\n" + str(df['Owner'].value_counts()))


Q12: Print how many stations each owner has.
A: Here is how many stations they have:
SBB       453
CFF       194
FFS        33
ZB         30
THURBO     14
Name: Owner, dtype: int64

Graphics


In [18]:
import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use('ggplot')

standard = df[(df['DWV'] > 300) & (df['DWV'] < 2300) ]

standard.plot(kind='scatter', x='DWV', y='DTV')
print("These are the stations in Q2 and Q3 and their average daily passengers during the full week vs. the work days:")


These are the stations in Q2 and Q3 and their average daily passengers during the full week vs. the work days:

In [19]:
plt.style.use('ggplot')
least_frequented = df.sort_values(by='DWV').head(20)
least_frequented.plot(kind='barh', x='Station', y='DTV').invert_yaxis()
print("These are the 20 least frequented stations, in average daily passengers:")


These are the 20 least frequented stations, in average daily passengers:

In [20]:
q1_freq = df[df['DWV'] <= 340]
q2_freq = df[(df['DWV'] <= 915) & (df['DWV'] > 340)]
q3_freq = df[(df['DWV'] <= 2700) & (df['DWV'] > 915)]

plt.scatter(y=q1_freq["DWV"], x=q1_freq["DTV"], c='c', alpha=0.75, marker='1')
plt.scatter(y=q2_freq["DWV"], x=q2_freq["DTV"], c='y', alpha=0.75, marker='2')
plt.scatter(y=q3_freq["DWV"], x=q3_freq["DTV"], c='m', alpha=0.75, marker='3')

print("Q1, Q2 and Q3 of average daily circulation; x axis = DTV, y axis = DWV")

plt.xlim(-15,2500)
plt.ylim(-30,2800)


Q1, Q2 and Q3 of average daily circulation; x axis = DTV, y axis = DWV
Out[20]:
(-30, 2800)

In [ ]: