Einführung

Dieses Notebook ist eine kleine Einführung in die Data Science Welt mit Python, dem Notebook-System Jupyter und dem Datenanalyse-Framework Pandas.

Die Aufgabe ist, aus dem Mitschnitt von Telefon-Metadaten herauszufinden, wo eine bestimmte Person wohnt. Das Notebook erklärt hierzu immer die Funktionsweisen und die Aufgaben der notwendigen Analyseschritte.

Importieren von Bibliotheken

Als erstes wollen wir über den in Python üblichen Weg das Modul "Pandas" einbinden.

Generell können sog. Module auf mehrere Arten eingebunden werden:

Einfaches Einbinden mittels import <modul>, z. B. import pandas
Einbinden mit separatem Namen mittels import <modul> as <name>, z. B. import pandas as pd
Einbinden eines Untermoduls mittels from <hauptmodul> import <modul>, z. B. from matplotlib import pyplot

Wir möchten gerne das Modul "Pandas" als pd einbinden.

Die Zelle können wir dann mit dem Tastenkürzel Shift + Enter ausführen.



In [23]:

    
import pandas as pd

Interaktive Hilfe

Ob das Importieren des Moduls auch wirklich funktioniert hat, können wir prüfen, in dem wir mit dem pd-Modul arbeiten. Dazu hängen wir an die pd-Variable den ? Operator an und führen die Zelle aus. Es erscheint die Dokumentation des Moduls im unteren Bereich des Notebooks. Diesen Bereich können wir durchlesen und mit der Taste ESC auch wieder verschwinden lassen.



In [20]:

    
pd?

Die weitere Funktionalität der Pandas-Bibliothek können wir erkunden, indem wir die Methoden von Pandas ansehen. Dazu verwenden wir pd. und nutzen die integrierte Autovervollständigung von Jupyter mittels der Tabulatortaste Tab, um zu sehen, welche Methoden uns Pandas bietet. Gehen wir dann mit der Pfeiltaste unten z. B. auf Categorical, drücken Enter und schließend Shift+ Tab, dann erscheint die Signatur des entsprechenden Funktionalität und der Ausschnitt der Hilfedokumentation. Bei zweimaligem Drücken von Shift + Tab erscheint die Hilfe vollständig.



In [24]:

    
pd.Categorical









    Out[24]:





pandas.core.categorical.Categorical

Laden von Daten

Als nächstes Laden wir Telefon-Metadaten (sog. CDRs: Call Data Records) aus einer Datei mit kommaseparierten Daten (einer "CSV"-Datei: "Comma-separated values").

Pandas bietet hierzu eine Methode namens read_csv. Diese nimmt als ersten Parameter einen Pfad entgegen. Diverse andere Argumente können angegeben werden, um spezielle Varianten von CSV-Dateien einlesen zu können. Z. B.

sep: Trennzeichen für Daten in einer Zeile wie etwa sep=";"
encoding: andere Zeichenkodierungen wie etwa encoding="latin-1"
dtype: Zuordnung von Spalten zu speziellen Datentypen wie etwa dtype={ 0:str }

Lesen wir erst einmal die Datei data/CDR_data.csv ohne jegliche Argumente ein. Diese Datei enthält die für 10 Mobilfunknutzer gesammelten Telefon-Metadaten über drei Jahre hinweg. Folgende Informationen sind darin enthalten:

In: Die anrufende Telefonnummer
Out: Die ausgehende Telefonnummer
Direction: Die Richtung des Telefonats
CallTimestamp: Der Zeitstempel des Anrufs
Duration: Die Dauer des Telefonats
TowerID: Die ID des ersten Mobilfunkmasten, der beim Anruf genutzt wurde

Wir lesen die Daten ein und weisen das Ergebnis der Variable cdr zu. Mittels der Methode head() können wir uns dann die ersten fünf Einträge anzeigen lassen.



In [18]:

    
cdr = pd.read_csv('data/CDR_data.csv')
cdr.head()









    Out[18]:







  
    
      
      In
      Out
      Direction
      CallTimestamp
      Duration
      TowerID
    
  
  
    
      0
      4638472273
      2666307251
      Incoming
      2010-12-25 07:16:24.736813
      0:02:41.741499
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
    
    
      1
      4638472273
      1755442610
      Incoming
      2010-12-25 21:18:30.053710
      0:02:47.108750
      aeaf8b43-8034-44fe-833d-31854a75acbf
    
    
      2
      4638472273
      5481755331
      Incoming
      2010-12-25 14:52:42.878016
      0:04:35.356341
      fadaa83f-6001-45fd-aa4a-17d6c6b7ec00
    
    
      3
      4638472273
      1755442610
      Incoming
      2010-12-25 16:02:09.001913
      0:02:23.498499
      fadaa83f-6001-45fd-aa4a-17d6c6b7ec00
    
    
      4
      4638472273
      2145623768
      Incoming
      2010-12-25 15:28:35.028554
      0:03:54.692497
      95d7920d-c3cd-4d20-a568-9a55800dc807



In [1]:

    
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

%matplotlib inline
matplotlib.style.use('ggplot') # Look Pretty



In [3]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53188 entries, 0 to 53187
Data columns (total 6 columns):
In               53188 non-null object
Out              53188 non-null object
Direction        53188 non-null object
CallTimestamp    53188 non-null object
Duration         53188 non-null object
TowerID          53188 non-null object
dtypes: object(6)
memory usage: 2.4+ MB



In [4]:

    
df.CallTimestamp = pd.to_datetime(df.CallTimestamp)
df.Duration = pd.to_timedelta(df.Duration)
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53188 entries, 0 to 53187
Data columns (total 6 columns):
In               53188 non-null object
Out              53188 non-null object
Direction        53188 non-null object
CallTimestamp    53188 non-null datetime64[ns]
Duration         53188 non-null timedelta64[ns]
TowerID          53188 non-null object
dtypes: datetime64[ns](1), object(4), timedelta64[ns](1)
memory usage: 2.4+ MB



In [16]:

    
phoneowners = pd.read_excel("data/phoneowners.xlsx")
phoneowners.head()









    Out[16]:







  
    
      
      Name
      Number
    
  
  
    
      0
      Renee Molina
      9204863989
    
    
      1
      Wendy Arnold
      6115205038
    
    
      2
      Daniel Ellis
      6707476992
    
    
      3
      Jason Bradshaw
      8177960907
    
    
      4
      Francisco Brown
      4443774214



In [9]:

    
phoneowners.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
Name      999 non-null object
Number    999 non-null object
dtypes: object(2)
memory usage: 15.7+ KB



In [10]:

    
#Join them
phoneowners.columns = ['Name', 'In']
joined = pd.merge(df, phoneowners, on='In')
joined.head()

#But where are those towers?
#No for the tower data
towers = pd.read_csv("data/towers.csv")
joined = pd.merge(joined, towers, on='TowerID')
joined.head()

# result = pd.concat([df1, df4], axis=1, join_axes=[df1.index])









    Out[10]:







  
    
      
      In
      Out
      Direction
      CallTimestamp
      Duration
      TowerID
      Name
      TowerLat
      TowerLon
    
  
  
    
      0
      04638472273
      02666307251
      Incoming
      2010-12-25 07:16:24.736813
      00:02:41.741499
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      1
      04638472273
      07922223441
      Missed
      2010-12-26 20:10:13.924779
      00:02:49.402318
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      2
      04638472273
      07756914135
      Incoming
      2010-12-28 19:41:43.122708
      00:05:08.460490
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      3
      04638472273
      04551601645
      Incoming
      2010-12-29 19:16:16.463142
      00:04:20.380243
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      4
      04638472273
      06250348072
      Incoming
      2010-12-29 19:29:54.814022
      00:03:18.870706
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417



In [11]:

    
joined['CallTimestamp'] = pd.to_datetime(joined['CallTimestamp'])
weekdays = joined['CallTimestamp'].dt.dayofweek.isin(Workweek) & joined['CallTimestamp'].dt.hour.isin(range(8,18))
dfweekdays = joined[weekdays]
dfweekdays.head()









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-59dd3a037e51> in <module>()
      1 joined['CallTimestamp'] = pd.to_datetime(joined['CallTimestamp'])
----> 2 weekdays = joined['CallTimestamp'].dt.dayofweek.isin(Workweek) & joined['CallTimestamp'].dt.hour.isin(range(8,18))
      3 dfweekdays = joined[weekdays]
      4 dfweekdays.head()

NameError: name 'Workweek' is not defined



In [174]:

    
targetname = 'John Doe'
user1 = joined[joined['Name'] == targetname]
#user1 = user1[weekdays]
user1 = user1.reset_index(drop=True)
user1.head()









    Out[174]:






  
    
      
      In
      Out
      Direction
      CallTimestamp
      Duration
      TowerID
      Name
      TowerLat
      TowerLon
    
  
  
    
      0
      04638472273
      02666307251
      Incoming
      2010-12-25 07:16:24.736813
      00:02:41.741499
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      1
      04638472273
      07922223441
      Missed
      2010-12-26 20:10:13.924779
      00:02:49.402318
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      2
      04638472273
      07756914135
      Incoming
      2010-12-28 19:41:43.122708
      00:05:08.460490
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      3
      04638472273
      04551601645
      Incoming
      2010-12-29 19:16:16.463142
      00:04:20.380243
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
    
    
      4
      04638472273
      06250348072
      Incoming
      2010-12-29 19:29:54.814022
      00:03:18.870706
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417



In [175]:

    
# INFO: Plot all the call locations
%matplotlib inline
#user1[['TowerLon', 'TowerLat']].plot.scatter()
user1.plot.scatter(x='TowerLon', y='TowerLat', c='purple', alpha=0.12, title='Call Locations', s = 30)
#showandtell()  # Comment this line out when you're ready to proceed









    Out[175]:





<matplotlib.axes._subplots.AxesSubplot at 0xb74cb70>



In [176]:

    
#
# INFO: The locations map above should be too "busy" to really wrap your head around. This is where domain expertise comes into play.
# Your intuition tells you that people are likely to behave differently on weekends:
#
# On Weekends:
#   1. People probably don't go into work
#   2. They probably sleep in late on Saturday
#   3. They probably run a bunch of random errands, since they couldn't during the week
#   4. They should be home, at least during the very late hours, e.g. 1-4 AM
#
# On Weekdays:
#   1. People probably are at work during normal working hours
#   2. They probably are at home in the early morning and during the late night
#   3. They probably spend time commuting between work and home everyday


#
# TODO: Add more filters to the user1 slice you created. Add bitwise logic so that you're only examining records that came in on
# weekends (sat/sun).
#
# .. your code here ..

user1['DOW'] = user1.CallTimestamp.dt.strftime("%a")

user1 = user1[(user1.DOW == 'Sat') | (user1.DOW == 'Sun')]
user1.head()
#
# TODO: Further filter it down for calls that are came in either before 6AM OR after 10pm (22:00:00). You can use < and > to compare
# the string times, just make sure you code them as military time strings, eg: "06:00:00", "22:00:00":
# https://en.wikipedia.org/wiki/24-hour_clock
#
# You might also want to review the Data Manipulation section for this. Once you have your filtered slice, print out its length:
#
# .. your code here ..

user1 = user1[(user1.CallTimestamp < "06:00:00") | (user1.CallTimestamp > "22:00:00")]
user1.head()









    Out[176]:






  
    
      
      In
      Out
      Direction
      CallTimestamp
      Duration
      TowerID
      Name
      TowerLat
      TowerLon
      DOW
    
  
  
    
      0
      04638472273
      02666307251
      Incoming
      2010-12-25 07:16:24.736813
      00:02:41.741499
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
      Sat
    
    
      1
      04638472273
      07922223441
      Missed
      2010-12-26 20:10:13.924779
      00:02:49.402318
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
      Sun
    
    
      5
      04638472273
      07756914135
      Missed
      2011-01-01 19:06:20.686967
      00:02:19.405029
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
      Sat
    
    
      14
      04638472273
      07841019020
      Incoming
      2011-01-30 18:56:08.650259
      00:02:22.357491
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
      Sun
    
    
      15
      04638472273
      05455269258
      Incoming
      2011-02-05 19:32:56.104492
      00:00:30.595089
      0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
      John Doe
      32.731611
      -96.709417
      Sat



In [183]:

    
#
# INFO: Visualize the dataframe with a scatter plot as a sanity check. Since you're familiar with maps, you know well that your
# X-Coordinate should be Longitude, and your Y coordinate should be the tower Latitude. Check the dataset headers for proper column
# feature names.
# https://en.wikipedia.org/wiki/Geographic_coordinate_system#Geographic_latitude_and_longitude
#
# At this point, you don't yet know exactly where the user is located just based off the cell phone tower position data; but
# considering the below are for Calls that arrived in the twilight hours of weekends, it's likely that wherever they are bunched up
# is probably near the caller's residence:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(user1.TowerLon,user1.TowerLat, c='g', marker='o', alpha=0.2)
ax.set_title('Weekend Calls (<6am or >10p)')
#showandtell()  # TODO: Comment this line out when you're ready to proceed

#
# TODO: Run K-Means with a K=1. There really should only be a single area of concentration. If you notice multiple areas that are
# "hot" (multiple areas the usr spends a lot of time at that are FAR apart from one another), then increase K=2, with the goal being
# that one of the centroids will sweep up the annoying outliers; and the other will zero in on the user's approximate home location.
# Or rather the location of the cell tower closest to their home.....
#
# Be sure to only feed in Lat and Lon coordinates to the KMeans algo, since none of the other data is suitable for your purposes.
# Since both Lat and Lon are (approximately) on the same scale, no feature scaling is required. Print out the centroid locations and
# add them onto your scatter plot. Use a distinguishable marker and color.
#
# Hint: Make sure you graph the CORRECT coordinates. This is part of your domain expertise.
#
# .. your code here ..









    Out[183]:





<matplotlib.text.Text at 0xdb43ba8>



In [205]:

    
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 2)
user1 = pd.concat([user1.TowerLon, user1.TowerLat], axis = 1)
labels = kmeans.fit_predict(user1)
centroids = kmeans.cluster_centers_
ax.scatter(x = centroids[:, 0], y = centroids[:, 1], c = 'r', marker = 'x', s = 100)
ax.figure
#showandtell()  # TODO: Comment this line out when you're ready to proceed









    Out[205]:



In [211]:

    
#coordinates = "" + centroids[0][1].to_string().split('.')[0] + "°" + centroids[0][1].split('.')[1][0]  + centroids[0][1].split('.')[1][1] + centroids[0][0]
#str.split(' ', 1 )
centroids

difference1 = centroids[0][1] - centroids[1][1]
difference2 = centroids[0][0] - centroids[1][0]

difference1 = 0.5 * difference1
difference2 = 0.5 * difference2

coordinate1 = centroids[0][1] + difference1
coordinate2 = centroids[0][0] + difference2

coordinates = str(coordinate1) + " " + str(coordinate2)
coordinates

#-96°90'92.4672"N  96°56'57.3"W









    Out[211]:





'32.9813029167 -97.0080512904'

	In	Out	Direction	CallTimestamp	Duration	TowerID
0	4638472273	2666307251	Incoming	2010-12-25 07:16:24.736813	0:02:41.741499	0db53dd3-eb9c-4344-abc5-c2d74ebc3eec
1	4638472273	1755442610	Incoming	2010-12-25 21:18:30.053710	0:02:47.108750	aeaf8b43-8034-44fe-833d-31854a75acbf
2	4638472273	5481755331	Incoming	2010-12-25 14:52:42.878016	0:04:35.356341	fadaa83f-6001-45fd-aa4a-17d6c6b7ec00
3	4638472273	1755442610	Incoming	2010-12-25 16:02:09.001913	0:02:23.498499	fadaa83f-6001-45fd-aa4a-17d6c6b7ec00
4	4638472273	2145623768	Incoming	2010-12-25 15:28:35.028554	0:03:54.692497	95d7920d-c3cd-4d20-a568-9a55800dc807

	Name	Number
0	Renee Molina	9204863989
1	Wendy Arnold	6115205038
2	Daniel Ellis	6707476992
3	Jason Bradshaw	8177960907
4	Francisco Brown	4443774214

	In	Out	Direction	CallTimestamp	Duration	TowerID	Name	TowerLat	TowerLon
0	04638472273	02666307251	Incoming	2010-12-25 07:16:24.736813	00:02:41.741499	0db53dd3-eb9c-4344-abc5-c2d74ebc3eec	John Doe	32.731611	-96.709417
1	04638472273	07922223441	Missed	2010-12-26 20:10:13.924779	00:02:49.402318	0db53dd3-eb9c-4344-abc5-c2d74ebc3eec	John Doe	32.731611	-96.709417
2	04638472273	07756914135	Incoming	2010-12-28 19:41:43.122708	00:05:08.460490	0db53dd3-eb9c-4344-abc5-c2d74ebc3eec	John Doe	32.731611	-96.709417
3	04638472273	04551601645	Incoming	2010-12-29 19:16:16.463142	00:04:20.380243	0db53dd3-eb9c-4344-abc5-c2d74ebc3eec	John Doe	32.731611	-96.709417
4	04638472273	06250348072	Incoming	2010-12-29 19:29:54.814022	00:03:18.870706	0db53dd3-eb9c-4344-abc5-c2d74ebc3eec	John Doe	32.731611	-96.709417