This notebook is based on the first chapter sample from Machine Learning for Hackers with some added features. I did this to present Jupyter Notebook with Python 3 for Tech Days in my Job.
The original link is offline so you need to download the file from the author's repository inside ../data
form the r notebook directory.
I will assume the following questions need to be aswers;
In [1]:
import pandas as pd
import numpy as np
Here we are loading the dataset with pandas with a minimal set of options.
<TAB>
special character;¹ Before start to make assumptions of the data I prefer load it as objects and then convert it after make sense of it. Also the data can be corrupted and make it impossible to cast.
In [2]:
ufo = pd.read_csv(
'../data/ufo_awesome.tsv',
sep = "\t",
header = None,
dtype = object,
na_values = ['', 'NaN'],
error_bad_lines = False,
warn_bad_lines = False
)
With the data loaded in ufo dataframe, lets check it composition and first set of rows.
In [3]:
ufo.describe()
Out[3]:
In [4]:
ufo.head()
Out[4]:
The dataframe describe() show us how many itens (without NaN) each column have, how many are uniques, which is more frequent value, and how much this value appear. head() simply show us the first 5 rows (first is 0 on Python).
We need to handle the columns names, to do so is necessary to see the data document. The table bellow shows the fields details get from the metadata:
Short name | Type | Description |
---|---|---|
sighted_at | Long | Date the event occurred (yyyymmdd) |
reported_at | Long | Date the event was reported |
location | String | City and State where event occurred |
shape | String | One word string description of the UFO shape |
duration | String | Event duration (raw text field) |
description | String | A long, ~20-30 line, raw text description |
To keep in sync with the R example, we will set the columns names to the following values:
In [5]:
ufo.columns = [
'DateOccurred',
'DateReported',
'Location',
'Shape',
'Duration',
'LongDescription'
]
Now we have a good looking dataframe with columns.
In [6]:
ufo.head()
Out[6]:
To decide about this lets get back to the questions to be answers.
The first one is about the better place on USA to have UFO sightings, for this we will need the Location
column, and in some place in time we will make filters for it. The second question is about the better month to have UFO sightings, which will lead to the DateOccurred
column.
Based on this Shape
and LongDescription
columns can be stripped high now (it's a bit obvious for the data relevance). But there is 2 others columns which can or cannot be removed, DataRepoted
and Duration
.
I always keep in mind to maintain, at last util second order, columns with some useful information to use it on further data wrangling or to get some statistical sense of it. Both columns have a date (in a YYYYDDMM year format) and a string which can possibly store some useful information if have data treatment to convert it in some numeric format. For the purpose of this demo, I removing it because DateReported will not be used further (the main purpose of the date is when the sight occurs and not when it was registered) and Duration is a relly mess and for a example to show on a Tech Day the effort to decompose it is not worthing.
The drop() command bellow have the following parameters:
In [7]:
ufo.drop(
labels = ['DateReported', 'Duration', 'Shape', 'LongDescription'],
axis = 1,
inplace = True
)
In [8]:
ufo.head()
Out[8]:
Now we are good to start the data transformation, the dates columns must be converted to Python date objects to allow manipulation of theirs time series.
The first problem will happens when trying to run this code using pandas.to_datetime() to convert the string:
ufo['DateOccurred'] = pd.Series([
pd.to_datetime(
date,
format = '%Y%m%d'
) for date in ufo['DateOccurred']
])
This will rise a serie of errors (stack trace) which is cause by this:
ValueError: time data '0000' does not match format '%Y%m%d' (match)
What happen here is bad data (welcome to the data science world, most of data will come corrupted, missing, wrong or with some other problem). Before proceed we need to deal with the dates with wrong format.
So what to do? Well we can make the to_datetime() method ignore the errors putting a NaT values on the field. Lets convert this and then see how the DataOccurred
column will appear.
In [9]:
ufo['DateOccurred'] = pd.Series([
pd.to_datetime(
date,
format = '%Y%m%d',
errors='coerce'
) for date in ufo['DateOccurred']
])
In [10]:
ufo.describe()
Out[10]:
The column now is a datetime object and have 60814 against the original 61069 elements, which shows some bad dates are gone. The following code show us how many elements was removed.
In [11]:
ufo['DateOccurred'].isnull().sum()
Out[11]:
There is no surprise that 60814 + 255 = 61069
, we need to deal with this values too.
So we have a field DateOccurred
with some NaN values. In this point we need to make a importante decision, get rid of the columns with NaN dates or fill it with something.
There is no universal guide to this, we could fill it with the mean of the column or copy the content of the DateReported
column. But in this case the missing date is less then 0.5% of the total, so for the simplicity sakes we will simply drop all NaN values.
In [12]:
ufo.isnull().sum()
Out[12]:
In [13]:
ufo.dropna(
axis = 0,
inplace = True
)
In [14]:
ufo.isnull().sum()
Out[14]:
In [15]:
ufo.describe()
Out[15]:
With the dataframe with clean dates, lets create another 2 columns to handle years and months in separate. This will make some analysis more easy (like discover which is the better month of year to look for UFO sights).
In [16]:
ufo['Year'] = pd.DatetimeIndex(ufo['DateOccurred']).year
ufo['Month'] = pd.DatetimeIndex(ufo['DateOccurred']).month
In [17]:
ufo.head()
Out[17]:
In [18]:
ufo['Month'].describe()
Out[18]:
In [19]:
ufo['Year'].describe()
Out[19]:
A funny thing about year is the most old sight is in 1762! This dataset includes sights from history.
How can this be significative? Well, to figure it out its time to plot some charts. The humans are visual beings and a picture really worth much more than a bunch of numbers and words.
To do so we will use the default matplotlib library from Python to build our graphs.
In [20]:
sightings_by_year = ufo.groupby('Year').size().reset_index()
sightings_by_year.columns = ['Year', 'Sightings']
sightings_by_year.describe()
Out[20]:
In [21]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
In [22]:
plt.style.use('seaborn-white')
In [23]:
%matplotlib inline
plt.xticks(rotation = 90)
sns.barplot(
data = sightings_by_year,
x = 'Year',
y = 'Sightings',
color= 'blue'
)
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))
We can see the number of sightings is more representative after around 1900, so we will filter the dataframe for all year above this threshold.
In [30]:
ufo = ufo[ufo['Year'] >= 1900]
Now lets see how the graph will behave
In [28]:
%matplotlib inline
new_sightings_by_year = ufo.groupby('Year').size().reset_index()
new_sightings_by_year.columns = ['Year', 'Sightings']
new_sightings_by_year.describe()
Out[28]:
In [29]:
%matplotlib inline
plt.xticks(rotation = 90)
sns.barplot(
data = new_sightings_by_year,
x = 'Year',
y = 'Sightings',
color= 'blue'
)
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))
In [27]:
locations = ufo['Location'].str.split(', ').apply(pd.Series)
ufo['City'] = locations[0]
ufo['State'] = locations[1]
In [ ]: