USA UFO sightings (Python 3 version)

This notebook is based on the first chapter sample from Machine Learning for Hackers with some added features. I did this to present Jupyter Notebook with Python 3 for Tech Days in my Job.

The original link is offline so you need to download the file from the author's repository inside ../data form the r notebook directory.

I will assume the following questions need to be aswers;

  • What is the best place to have UFO sightings on USA?
  • What is the best month to have UFO sightings on USA?

Loading the data

This first session will handle with loading the main data file using Pandas.


In [1]:
import pandas as pd
import numpy as np

Here we are loading the dataset with pandas with a minimal set of options.

  • sep: once the file is in TSV format the separator is a <TAB> special character;
  • na_values: the file have empty strings for NaN values;
  • header: ignore any column as a header since the file lacks it;
  • dtype: load the dataframe as objects, avoiding interpret the data types¹;
  • error_bad_lines: ignore lines with more than the number of rows;
  • warn_bad_lines: is set to false to avoid ugly warnings on the screen, activate this if you want to analyse the bad rows.

¹ Before start to make assumptions of the data I prefer load it as objects and then convert it after make sense of it. Also the data can be corrupted and make it impossible to cast.


In [2]:
ufo = pd.read_csv(
    '../data/ufo_awesome.tsv', 
    sep = "\t",
    header = None,
    dtype = object,
    na_values = ['', 'NaN'],
    error_bad_lines = False,
    warn_bad_lines = False
)

With the data loaded in ufo dataframe, lets check it composition and first set of rows.


In [3]:
ufo.describe()


Out[3]:
0 1 2 3 4 5
count 61069 61069 61068 58556 58658 61067
unique 9220 5305 21851 31 9475 61043
top 0000 20050224 Seattle, WA light 5 minutes 4 objects, light gray egg shaped.Clear sunny d...
freq 252 201 441 12140 2349 2

In [4]:
ufo.head()


Out[4]:
0 1 2 3 4 5
0 19951009 19951009 Iowa City, IA NaN NaN Man repts. witnessing &quot;flash, followed by...
1 19951010 19951011 Milwaukee, WI NaN 2 min. Man on Hwy 43 SW of Milwaukee sees large, bri...
2 19950101 19950103 Shelton, WA NaN NaN Telephoned Report:CA woman visiting daughter w...
3 19950510 19950510 Columbia, MO NaN 2 min. Man repts. son&apos;s bizarre sighting of smal...
4 19950611 19950614 Seattle, WA NaN NaN Anonymous caller repts. sighting 4 ufo&apos;s ...

The dataframe describe() show us how many itens (without NaN) each column have, how many are uniques, which is more frequent value, and how much this value appear. head() simply show us the first 5 rows (first is 0 on Python).

Dealing with metadata and column names

We need to handle the columns names, to do so is necessary to see the data document. The table bellow shows the fields details get from the metadata:

Short name Type Description
sighted_at Long Date the event occurred (yyyymmdd)
reported_at Long Date the event was reported
location String City and State where event occurred
shape String One word string description of the UFO shape
duration String Event duration (raw text field)
description String A long, ~20-30 line, raw text description

To keep in sync with the R example, we will set the columns names to the following values:

  • DateOccurred
  • DateReported
  • Location
  • Shape
  • Duration
  • LogDescription

In [5]:
ufo.columns = [
    'DateOccurred', 
    'DateReported', 
    'Location', 
    'Shape', 
    'Duration', 
    'LongDescription'
]

Now we have a good looking dataframe with columns.


In [6]:
ufo.head()


Out[6]:
DateOccurred DateReported Location Shape Duration LongDescription
0 19951009 19951009 Iowa City, IA NaN NaN Man repts. witnessing &quot;flash, followed by...
1 19951010 19951011 Milwaukee, WI NaN 2 min. Man on Hwy 43 SW of Milwaukee sees large, bri...
2 19950101 19950103 Shelton, WA NaN NaN Telephoned Report:CA woman visiting daughter w...
3 19950510 19950510 Columbia, MO NaN 2 min. Man repts. son&apos;s bizarre sighting of smal...
4 19950611 19950614 Seattle, WA NaN NaN Anonymous caller repts. sighting 4 ufo&apos;s ...

Data Wrangling

Now we start to transform our data into something to analyse.

Keeping only necessary data

To decide about this lets get back to the questions to be answers.

The first one is about the better place on USA to have UFO sightings, for this we will need the Location column, and in some place in time we will make filters for it. The second question is about the better month to have UFO sightings, which will lead to the DateOccurred column.

Based on this Shape and LongDescription columns can be stripped high now (it's a bit obvious for the data relevance). But there is 2 others columns which can or cannot be removed, DataRepoted and Duration.

I always keep in mind to maintain, at last util second order, columns with some useful information to use it on further data wrangling or to get some statistical sense of it. Both columns have a date (in a YYYYDDMM year format) and a string which can possibly store some useful information if have data treatment to convert it in some numeric format. For the purpose of this demo, I removing it because DateReported will not be used further (the main purpose of the date is when the sight occurs and not when it was registered) and Duration is a relly mess and for a example to show on a Tech Day the effort to decompose it is not worthing.

The drop() command bellow have the following parameters:

  • labels: columns to remove;
  • axis: set to 1 to remove columns;
  • inplace: set to True to modify the dataframe itself and return none.

In [7]:
ufo.drop(
    labels = ['DateReported', 'Duration', 'Shape', 'LongDescription'], 
    axis = 1,
    inplace = True
)

In [8]:
ufo.head()


Out[8]:
DateOccurred Location
0 19951009 Iowa City, IA
1 19951010 Milwaukee, WI
2 19950101 Shelton, WA
3 19950510 Columbia, MO
4 19950611 Seattle, WA

Converting data

Now we are good to start the data transformation, the dates columns must be converted to Python date objects to allow manipulation of theirs time series.

The first problem will happens when trying to run this code using pandas.to_datetime() to convert the string:

ufo['DateOccurred'] = pd.Series([
    pd.to_datetime(
        date,
        format = '%Y%m%d'
    ) for date in ufo['DateOccurred']
])

This will rise a serie of errors (stack trace) which is cause by this:

ValueError: time data '0000' does not match format '%Y%m%d' (match)

What happen here is bad data (welcome to the data science world, most of data will come corrupted, missing, wrong or with some other problem). Before proceed we need to deal with the dates with wrong format.

So what to do? Well we can make the to_datetime() method ignore the errors putting a NaT values on the field. Lets convert this and then see how the DataOccurred column will appear.


In [9]:
ufo['DateOccurred'] = pd.Series([
    pd.to_datetime(
        date,
        format = '%Y%m%d',
        errors='coerce'
    ) for date in ufo['DateOccurred']
])

In [10]:
ufo.describe()


Out[10]:
DateOccurred Location
count 60814 61068
unique 9216 21851
top 1999-11-16 00:00:00 Seattle, WA
freq 191 441
first 1762-12-11 00:00:00 NaN
last 2010-08-30 00:00:00 NaN

The column now is a datetime object and have 60814 against the original 61069 elements, which shows some bad dates are gone. The following code show us how many elements was removed.


In [11]:
ufo['DateOccurred'].isnull().sum()


Out[11]:
255

There is no surprise that 60814 + 255 = 61069, we need to deal with this values too.

So we have a field DateOccurred with some NaN values. In this point we need to make a importante decision, get rid of the columns with NaN dates or fill it with something.

There is no universal guide to this, we could fill it with the mean of the column or copy the content of the DateReported column. But in this case the missing date is less then 0.5% of the total, so for the simplicity sakes we will simply drop all NaN values.


In [12]:
ufo.isnull().sum()


Out[12]:
DateOccurred    255
Location          1
dtype: int64

In [13]:
ufo.dropna(
    axis = 0,
    inplace = True
)

In [14]:
ufo.isnull().sum()


Out[14]:
DateOccurred    0
Location        0
dtype: int64

In [15]:
ufo.describe()


Out[15]:
DateOccurred Location
count 60813 60813
unique 9216 21763
top 1999-11-16 00:00:00 Seattle, WA
freq 191 440
first 1762-12-11 00:00:00 NaN
last 2010-08-30 00:00:00 NaN

With the dataframe with clean dates, lets create another 2 columns to handle years and months in separate. This will make some analysis more easy (like discover which is the better month of year to look for UFO sights).


In [16]:
ufo['Year'] = pd.DatetimeIndex(ufo['DateOccurred']).year
ufo['Month'] = pd.DatetimeIndex(ufo['DateOccurred']).month

In [17]:
ufo.head()


Out[17]:
DateOccurred Location Year Month
0 1995-10-09 Iowa City, IA 1995 10
1 1995-10-10 Milwaukee, WI 1995 10
2 1995-01-01 Shelton, WA 1995 1
3 1995-05-10 Columbia, MO 1995 5
4 1995-06-11 Seattle, WA 1995 6

In [18]:
ufo['Month'].describe()


Out[18]:
count    60813.000000
mean         6.767665
std          3.195859
min          1.000000
25%          4.000000
50%          7.000000
75%          9.000000
max         12.000000
Name: Month, dtype: float64

In [19]:
ufo['Year'].describe()


Out[19]:
count    60813.000000
mean      2000.597372
std         10.546298
min       1762.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2010.000000
Name: Year, dtype: float64

A funny thing about year is the most old sight is in 1762! This dataset includes sights from history.

How can this be significative? Well, to figure it out its time to plot some charts. The humans are visual beings and a picture really worth much more than a bunch of numbers and words.

To do so we will use the default matplotlib library from Python to build our graphs.

Analysing the years

Before start lets count the sights by year.

The comands bellow are equivalent to the following SQL code:

SELECT Year, count(*) AS Sightings
FROM ufo
GROUP BY Year

In [20]:
sightings_by_year = ufo.groupby('Year').size().reset_index()
sightings_by_year.columns = ['Year', 'Sightings']
sightings_by_year.describe()


Out[20]:
Year Sightings
count 98.000000 98.000000
mean 1954.000000 620.540816
std 44.355639 1313.199494
min 1762.000000 1.000000
25% 1935.250000 2.000000
50% 1961.500000 70.500000
75% 1985.750000 251.750000
max 2010.000000 5080.000000

In [21]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

In [22]:
plt.style.use('seaborn-white')

In [23]:
%matplotlib inline

plt.xticks(rotation = 90)
sns.barplot(
    data = sightings_by_year, 
    x = 'Year', 
    y = 'Sightings',
    color= 'blue'
)
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))


We can see the number of sightings is more representative after around 1900, so we will filter the dataframe for all year above this threshold.


In [30]:
ufo = ufo[ufo['Year'] >= 1900]

Now lets see how the graph will behave


In [28]:
%matplotlib inline

new_sightings_by_year = ufo.groupby('Year').size().reset_index()
new_sightings_by_year.columns = ['Year', 'Sightings']
new_sightings_by_year.describe()


Out[28]:
Year Sightings
count 88.000000 88.000000
mean 1964.863636 690.943182
std 28.331852 1368.787250
min 1901.000000 1.000000
25% 1944.750000 10.750000
50% 1966.500000 146.000000
75% 1988.250000 261.500000
max 2010.000000 5080.000000

In [29]:
%matplotlib inline

plt.xticks(rotation = 90)
sns.barplot(
    data = new_sightings_by_year, 
    x = 'Year', 
    y = 'Sightings',
    color= 'blue'
)
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))


Handling location

Here we will make two steps, first is splitting all locations is city and states, for USA only. Second is load a dataset having the latitude and longitude for each USA city for future merge.


In [27]:
locations = ufo['Location'].str.split(', ').apply(pd.Series)
ufo['City'] = locations[0]
ufo['State'] = locations[1]

In [ ]: