Beispiel für eine explorative Datenanalyse: Erdbeben der letzten 7 Tage (US Geological Survey)

Import zweier Standardpakete für die Datenanalyse: Numpy für mehrdimensionale Arrays, Pandas für Datenanalyse in Tabellen.


In [2]:
import pandas as pd
import numpy as np

Direkter Download vom USGS, Abruf des Downloaddatums, automatischer Import in Pandas-Dataframe


In [3]:
fileUrl = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.csv'
eData = pd.read_csv(fileUrl)
dateDownloaded = !date
dateDownloaded


Out[3]:
['Sat Dec 12 18:23:33 CET 2015']

1. Darstellung als Pandas-DataFrame

Darstellung des Datensatzes als Datensatzes als Pandas-Dataframe (Tabelle der ersten und letzten 30 Einträge, Anzahl Reihen und Spalten). Konvention: Variablen sind die Spalten, einzelne Messungen die Zeilen.


In [4]:
eData


Out[4]:
time latitude longitude depth mag magType nst gap dmin rms net id updated place type
0 2015-12-12T16:48:52.730Z 20.000999 -156.476334 0.00 2.61 md 34 234.0 0.668600 0.18 hv hv61124976 2015-12-12T17:00:47.325Z 59km WNW of Kalaoa, Hawaii earthquake
1 2015-12-12T15:13:52.210Z 51.939000 178.413700 126.22 4.80 mb NaN 96.0 0.788000 0.56 us us20004grd 2015-12-12T15:36:50.040Z 6km W of Little Sitkin Island, Alaska earthquake
2 2015-12-12T15:08:58.170Z -35.009800 -71.962800 42.79 5.00 mb NaN 152.0 0.026000 1.08 us us20004grc 2015-12-12T16:07:22.067Z 54km NNW of Talca, Chile earthquake
3 2015-12-12T14:46:26.780Z -10.173300 161.156100 71.07 4.80 mb NaN 75.0 1.397000 1.00 us us20004gr8 2015-12-12T17:17:07.511Z 89km WNW of Kirakira, Solomon Islands earthquake
4 2015-12-12T14:29:46.770Z 6.742100 94.472300 17.66 4.40 mb NaN 145.0 2.884000 1.07 us us20004gr3 2015-12-12T17:09:50.853Z 132km NW of Sabang, Indonesia earthquake
5 2015-12-12T13:42:50.800Z 18.830400 -64.347900 26.00 2.80 Md 6 320.4 0.484192 0.10 pr pr15346002 2015-12-12T15:39:32.022Z 53km NNE of Road Town, British Virgin Islands earthquake
6 2015-12-12T13:25:13.000Z 61.130000 -150.605700 57.50 2.70 ml NaN NaN NaN 0.67 ak ak12241503 2015-12-12T13:57:29.417Z 39km WSW of Anchorage, Alaska earthquake
7 2015-12-12T12:39:57.290Z 36.281200 -97.473100 4.26 2.80 mb_lg NaN 59.0 0.468000 0.31 us us20004gpy 2015-12-12T12:49:40.283Z 16km W of Perry, Oklahoma earthquake
8 2015-12-12T12:14:39.100Z -12.802000 -14.678100 10.00 4.80 mb NaN 65.0 4.848000 0.77 us us20004gpw 2015-12-12T12:33:20.924Z Southern Mid-Atlantic Ridge earthquake
9 2015-12-12T11:34:38.740Z 16.309200 -96.768400 44.77 4.50 mb NaN 178.0 2.125000 1.22 us us20004gpr 2015-12-12T11:53:33.222Z 10km SE of San Vicente Coatlan, Mexico earthquake
10 2015-12-12T11:30:56.040Z 37.488998 -118.801498 8.42 2.65 ml 32 117.0 0.103200 0.04 nc nc72567171 2015-12-12T16:51:02.060Z 24km SE of Mammoth Lakes, California earthquake
11 2015-12-12T11:23:44.030Z 38.820499 -122.823166 2.34 2.65 md 41 49.0 0.009629 0.05 nc nc72567166 2015-12-12T11:53:04.525Z 7km NW of The Geysers, California earthquake
12 2015-12-12T08:34:46.210Z 37.791400 21.080400 10.00 4.50 mwr NaN 85.0 0.908000 0.97 us us20004gp7 2015-12-12T16:37:26.955Z 7km SSW of Ayios Nikolaos, Greece earthquake
13 2015-12-12T07:34:20.600Z 18.989700 -67.677800 55.00 3.50 Md 16 230.4 0.817467 0.47 pr pr15346001 2015-12-12T15:36:57.881Z 82km NW of San Antonio, Puerto Rico earthquake
14 2015-12-12T05:29:45.000Z 60.571600 -146.006000 18.30 2.60 ml NaN NaN NaN 0.94 ak ak12236580 2015-12-12T13:32:21.644Z 14km WNW of Cordova, Alaska earthquake
15 2015-12-12T04:44:36.850Z 38.819832 -122.823669 2.38 2.52 md 40 30.0 0.009461 0.05 nc nc72567041 2015-12-12T12:47:06.220Z 7km NW of The Geysers, California earthquake
16 2015-12-12T02:28:56.430Z 36.277000 -97.693500 5.00 2.60 mb_lg NaN 92.0 0.326000 0.44 us us20004gni 2015-12-12T04:51:08.341Z 18km E of Waukomis, Oklahoma earthquake
17 2015-12-12T01:36:12.030Z 36.270300 -97.562700 1.20 2.70 mb_lg NaN 64.0 0.412000 0.57 us us20004gnb 2015-12-12T04:53:42.214Z 24km W of Perry, Oklahoma earthquake
18 2015-12-12T01:33:18.890Z 5.635200 -77.590400 36.32 4.60 mb NaN 127.0 2.117000 1.53 us us20004gnc 2015-12-12T09:35:54.470Z 36km WSW of Nuqui, Colombia earthquake
19 2015-12-12T00:45:35.490Z -19.947500 -71.037300 15.41 4.50 mwr NaN 156.0 0.910000 0.66 us us20004gn3 2015-12-12T08:48:12.892Z 98km WNW of Iquique, Chile earthquake
20 2015-12-12T00:32:20.510Z 19.009400 -69.287800 24.35 4.50 mwr NaN 63.0 0.990000 0.70 us us20004gn1 2015-12-12T02:39:06.157Z 10km ENE of El Valle, Dominican Republic earthquake
21 2015-12-11T23:17:15.020Z -42.780500 174.182100 24.25 4.20 mb NaN 176.0 0.598000 0.69 us us20004gmq 2015-12-12T07:19:59.943Z 58km SE of Kaikoura, New Zealand earthquake
22 2015-12-11T23:14:03.190Z -19.935000 -71.045400 32.75 4.00 mwr NaN 179.0 0.921000 0.76 us us20004gmr 2015-12-12T07:16:41.848Z 99km WNW of Iquique, Chile earthquake
23 2015-12-11T22:10:33.600Z 19.443500 -65.286300 112.00 3.30 Md 11 273.6 1.128284 0.37 pr pr15345008 2015-12-12T06:13:12.389Z 126km NNE of Vieques, Puerto Rico earthquake
24 2015-12-11T22:09:39.300Z 29.534000 95.809800 24.63 4.30 mb NaN 139.0 4.080000 0.37 us us20004gm1 2015-12-12T06:12:08.046Z 33km S of Zhamog, China earthquake
25 2015-12-11T22:05:39.520Z -47.668300 85.712300 10.00 5.00 mb NaN 65.0 22.998000 0.94 us us20004gm0 2015-12-12T06:08:10.458Z Southeast Indian Ridge earthquake
26 2015-12-11T22:02:46.460Z 40.329834 -124.410835 8.54 2.57 md 7 303.0 0.097060 0.06 nc nc72566951 2015-12-12T06:05:27.511Z 30km SSW of Ferndale, California earthquake
27 2015-12-11T20:56:16.300Z 9.996900 -84.101700 13.38 3.30 ml NaN 134.0 0.010000 0.42 us us20004gll 2015-12-11T21:52:51.079Z 0km WNW of San Pablo, Costa Rica earthquake

Kürzere Darstellung mit head(): nur die ersten 5 Einträge des Tabellenkopfes


In [5]:
eData.head()


Out[5]:
time latitude longitude depth mag magType nst gap dmin rms net id updated place type
0 2015-12-12T16:48:52.730Z 20.000999 -156.476334 0.00 2.61 md 34 234 0.6686 0.18 hv hv61124976 2015-12-12T17:00:47.325Z 59km WNW of Kalaoa, Hawaii earthquake
1 2015-12-12T15:13:52.210Z 51.939000 178.413700 126.22 4.80 mb NaN 96 0.7880 0.56 us us20004grd 2015-12-12T15:36:50.040Z 6km W of Little Sitkin Island, Alaska earthquake
2 2015-12-12T15:08:58.170Z -35.009800 -71.962800 42.79 5.00 mb NaN 152 0.0260 1.08 us us20004grc 2015-12-12T16:07:22.067Z 54km NNW of Talca, Chile earthquake
3 2015-12-12T14:46:26.780Z -10.173300 161.156100 71.07 4.80 mb NaN 75 1.3970 1.00 us us20004gr8 2015-12-12T17:17:07.511Z 89km WNW of Kirakira, Solomon Islands earthquake
4 2015-12-12T14:29:46.770Z 6.742100 94.472300 17.66 4.40 mb NaN 145 2.8840 1.07 us us20004gr3 2015-12-12T17:09:50.853Z 132km NW of Sabang, Indonesia earthquake

Anzahl der Reihen und Spalten mit Numpy shape().


In [6]:
np.shape(eData)


Out[6]:
(28, 15)

Anzeige der einzelnen Spaltennamen mit Attribut DataFrame.columns


In [7]:
eData.columns


Out[7]:
Index(['time', 'latitude', 'longitude', 'depth', 'mag', 'magType', 'nst',
       'gap', 'dmin', 'rms', 'net', 'id', 'updated', 'place', 'type'],
      dtype='object')

Datentyp der einzelnen Variablen mit Attribut DataFrame.dtypes


In [8]:
eData.dtypes


Out[8]:
time          object
latitude     float64
longitude    float64
depth        float64
mag          float64
magType       object
nst          float64
gap          float64
dmin         float64
rms          float64
net           object
id            object
updated       object
place         object
type          object
dtype: object

2. Aufbereitung des Datensatzes

Überprüfen, ob Tabelle NaN enthält, mit DataFrame.isnull().any()


In [9]:
eData.isnull().any()


Out[9]:
time         False
latitude     False
longitude    False
depth        False
mag          False
magType      False
nst           True
gap           True
dmin          True
rms          False
net          False
id           False
updated      False
place        False
type         False
dtype: bool

Entfernung aller Zeilen bzw. Messungen mit NaNs durch DataFrame.dropna()


In [10]:
eData = eData.dropna()
eData.head()


Out[10]:
time latitude longitude depth mag magType nst gap dmin rms net id updated place type
0 2015-12-12T16:48:52.730Z 20.000999 -156.476334 0.00 2.61 md 34 234.0 0.668600 0.18 hv hv61124976 2015-12-12T17:00:47.325Z 59km WNW of Kalaoa, Hawaii earthquake
5 2015-12-12T13:42:50.800Z 18.830400 -64.347900 26.00 2.80 Md 6 320.4 0.484192 0.10 pr pr15346002 2015-12-12T15:39:32.022Z 53km NNE of Road Town, British Virgin Islands earthquake
10 2015-12-12T11:30:56.040Z 37.488998 -118.801498 8.42 2.65 ml 32 117.0 0.103200 0.04 nc nc72567171 2015-12-12T16:51:02.060Z 24km SE of Mammoth Lakes, California earthquake
11 2015-12-12T11:23:44.030Z 38.820499 -122.823166 2.34 2.65 md 41 49.0 0.009629 0.05 nc nc72567166 2015-12-12T11:53:04.525Z 7km NW of The Geysers, California earthquake
13 2015-12-12T07:34:20.600Z 18.989700 -67.677800 55.00 3.50 Md 16 230.4 0.817467 0.47 pr pr15346001 2015-12-12T15:36:57.881Z 82km NW of San Antonio, Puerto Rico earthquake

In [11]:
eData.isnull().any()


Out[11]:
time         False
latitude     False
longitude    False
depth        False
mag          False
magType      False
nst          False
gap          False
dmin         False
rms          False
net          False
id           False
updated      False
place        False
type         False
dtype: bool

Überprüfen, ob Zeilen bzw. Messungen doppelt vorkommen, mit DataFrame.duplicated()


In [12]:
eData.duplicated().any()


Out[12]:
False

Es kommen also keine Duplikate vor. Bei Bedarf mit DataFrame.drop_duplicates() entfernen.

3. Explorative Statistiken

Statistische Beschreibung der numerischen Variablen mit Dataframe.describe() (count: Anzahl Messungen, mean: Mittelwert, std: Standardabweichung, min: Minimum, 25%: 25-Perzentil, ...)


In [ ]:
eData.describe()


Out[ ]:
latitude longitude depth mag nst gap dmin rms
count 8.000000 8.000000 8.000000 8.00000 8.000000 8.000000 8.000000 8.000000
mean 29.090470 -105.330938 26.835000 2.82500 23.375000 194.675000 0.414737 0.165000
std 10.482432 34.793811 38.985401 0.36789 14.889474 113.995887 0.425432 0.165874
min 18.830400 -156.476334 0.000000 2.52000 6.000000 30.000000 0.009461 0.040000
25% 19.330050 -123.220461 2.370000 2.60000 10.000000 100.000000 0.075202 0.050000
50% 28.744999 -120.812332 8.480000 2.65000 24.000000 232.200000 0.293696 0.080000
75% 38.819999 -67.079925 33.250000 2.92500 35.500000 280.950000 0.705817 0.227500
max 40.329834 -64.347900 112.000000 3.50000 41.000000 320.400000 1.128284 0.470000

Streumatrix für alle numerischen Variablen mit Pandas scattermatrix():


In [ ]:
pd.scatter_matrix(eData, figsize=(14,14), marker='o');

4. Analyse von Untermengen

Zugriff auf die Variable 'Lat' (latitude):


In [ ]:
eData['Lat']

Welche Erdbeben fanden oberhalb einer geographischen Breite von 40 Grad statt?


In [ ]:
eData['Lat'] > 40.0

Gab es überhaupt Erdbeben oberhalb 50 Grad Breite?


In [ ]:
(eData['Lat'] > 40.0).any()

Gab es also. Haben alle verzeichneten Erdbeben eine Breite größer als 18 Grad?


In [ ]:
(eData['Lat'] > 18.0).all()

Es sind also auch Erdbeben unterhalb von 18 Grad verzeichnet.

Alle unterschiedlichen Werte der kategorischen Variable 'Version' mit Dataframe['Variablenname'].unique()


In [ ]:
eData['Version'].unique()

Häufigkeit der verschiedenen Kategorien in 'Version' mit Dataframe['Variablenname'].value_counts():


In [ ]:
eData['Version'].value_counts()

Häufigkeit von Wertepaaren der beiden kategorischen Variablen 'Version' und 'Src' mit Pandas crosstab():


In [ ]:
pd.crosstab(eData['Src'], eData['Version'])

Darstellung der Häufigkeitsverteilung der Erdbebenstärken für die verschiedenen Quellen mit einer Kastengraphik durch Pandas boxplot():


In [ ]:
from pandas.tools.plotting import boxplot
boxplot(eData, column='Magnitude', by='Src');

In [ ]: