USA UFO sightings (Python 3 version)

This notebook is based on the first chapter sample from Machine Learning for Hackers with some added features. I did this to present Jupyter Notebook with Python 3 for Tech Days in my Job.

The original link is offline so you need to download the file from the author's repository inside ../data form the r notebook directory.

I will assume the following questions need to be aswers;

What is the best place to have UFO sightings on USA?
What is the best month to have UFO sightings on USA?

Loading the data

This first session will handle with loading the main data file using Pandas.



In [1]:

    
import pandas as pd
import numpy as np

Here we are loading the dataset with pandas with a minimal set of options.

sep: once the file is in TSV format the separator is a <TAB> special character;
na_values: the file have empty strings for NaN values;
header: ignore any column as a header since the file lacks it;
dtype: load the dataframe as objects, avoiding interpret the data types¹;
error_bad_lines: ignore lines with more than the number of rows;
warn_bad_lines: is set to false to avoid ugly warnings on the screen, activate this if you want to analyse the bad rows.

¹ Before start to make assumptions of the data I prefer load it as objects and then convert it after make sense of it. Also the data can be corrupted and make it impossible to cast.



In [2]:

    
ufo = pd.read_csv(
    '../data/ufo_awesome.tsv', 
    sep = "\t",
    header = None,
    dtype = object,
    na_values = ['', 'NaN'],
    error_bad_lines = False,
    warn_bad_lines = False
)

With the data loaded in ufo dataframe, lets check it composition and first set of rows.



In [3]:

    
ufo.describe()









    Out[3]:






  
    
      
      0
      1
      2
      3
      4
      5
    
  
  
    
      count
      61069
      61069
      61068
      58556
      58658
      61067
    
    
      unique
      9220
      5305
      21851
      31
      9475
      61043
    
    
      top
      0000
      20050224
      Seattle, WA
      light
      5 minutes
      4 objects, light gray egg shaped.Clear sunny d...
    
    
      freq
      252
      201
      441
      12140
      2349
      2



In [4]:

    
ufo.head()









    Out[4]:






  
    
      
      0
      1
      2
      3
      4
      5
    
  
  
    
      0
      19951009
      19951009
      Iowa City, IA
      NaN
      NaN
      Man repts. witnessing &quot;flash, followed by...
    
    
      1
      19951010
      19951011
      Milwaukee, WI
      NaN
      2 min.
      Man  on Hwy 43 SW of Milwaukee sees large, bri...
    
    
      2
      19950101
      19950103
      Shelton, WA
      NaN
      NaN
      Telephoned Report:CA woman visiting daughter w...
    
    
      3
      19950510
      19950510
      Columbia, MO
      NaN
      2 min.
      Man repts. son&apos;s bizarre sighting of smal...
    
    
      4
      19950611
      19950614
      Seattle, WA
      NaN
      NaN
      Anonymous caller repts. sighting 4 ufo&apos;s ...

The dataframe describe() show us how many itens (without NaN) each column have, how many are uniques, which is more frequent value, and how much this value appear. head() simply show us the first 5 rows (first is 0 on Python).

Dealing with metadata and column names

We need to handle the columns names, to do so is necessary to see the data document. The table bellow shows the fields details get from the metadata:

Short name	Type	Description
sighted_at	Long	Date the event occurred (yyyymmdd)
reported_at	Long	Date the event was reported
location	String	City and State where event occurred
shape	String	One word string description of the UFO shape
duration	String	Event duration (raw text field)
description	String	A long, ~20-30 line, raw text description

To keep in sync with the R example, we will set the columns names to the following values:

DateOccurred
DateReported
Location
Shape
Duration
LogDescription



In [5]:

    
ufo.columns = [
    'DateOccurred', 
    'DateReported', 
    'Location', 
    'Shape', 
    'Duration', 
    'LongDescription'
]

Now we have a good looking dataframe with columns.



In [6]:

    
ufo.head()









    Out[6]:






  
    
      
      DateOccurred
      DateReported
      Location
      Shape
      Duration
      LongDescription
    
  
  
    
      0
      19951009
      19951009
      Iowa City, IA
      NaN
      NaN
      Man repts. witnessing &quot;flash, followed by...
    
    
      1
      19951010
      19951011
      Milwaukee, WI
      NaN
      2 min.
      Man  on Hwy 43 SW of Milwaukee sees large, bri...
    
    
      2
      19950101
      19950103
      Shelton, WA
      NaN
      NaN
      Telephoned Report:CA woman visiting daughter w...
    
    
      3
      19950510
      19950510
      Columbia, MO
      NaN
      2 min.
      Man repts. son&apos;s bizarre sighting of smal...
    
    
      4
      19950611
      19950614
      Seattle, WA
      NaN
      NaN
      Anonymous caller repts. sighting 4 ufo&apos;s ...

Data Wrangling

Now we start to transform our data into something to analyse.

Keeping only necessary data

To decide about this lets get back to the questions to be answers.

The first one is about the better place on USA to have UFO sightings, for this we will need the Location column, and in some place in time we will make filters for it. The second question is about the better month to have UFO sightings, which will lead to the DateOccurred column.

Based on this Shape and LongDescription columns can be stripped high now (it's a bit obvious for the data relevance). But there is 2 others columns which can or cannot be removed, DataRepoted and Duration.

I always keep in mind to maintain, at last util second order, columns with some useful information to use it on further data wrangling or to get some statistical sense of it. Both columns have a date (in a YYYYDDMM year format) and a string which can possibly store some useful information if have data treatment to convert it in some numeric format. For the purpose of this demo, I removing it because DateReported will not be used further (the main purpose of the date is when the sight occurs and not when it was registered) and Duration is a relly mess and for a example to show on a Tech Day the effort to decompose it is not worthing.

The drop() command bellow have the following parameters:

labels: columns to remove;
axis: set to 1 to remove columns;
inplace: set to True to modify the dataframe itself and return none.



In [7]:

    
ufo.drop(
    labels = ['DateReported', 'Duration', 'Shape', 'LongDescription'], 
    axis = 1,
    inplace = True
)



In [8]:

    
ufo.head()









    Out[8]:






  
    
      
      DateOccurred
      Location
    
  
  
    
      0
      19951009
      Iowa City, IA
    
    
      1
      19951010
      Milwaukee, WI
    
    
      2
      19950101
      Shelton, WA
    
    
      3
      19950510
      Columbia, MO
    
    
      4
      19950611
      Seattle, WA

Converting data

Now we are good to start the data transformation, the dates columns must be converted to Python date objects to allow manipulation of theirs time series.

The first problem will happens when trying to run this code using pandas.to_datetime() to convert the string:

ufo['DateOccurred'] = pd.Series([
    pd.to_datetime(
        date,
        format = '%Y%m%d'
    ) for date in ufo['DateOccurred']
])

This will rise a serie of errors (stack trace) which is cause by this:

ValueError: time data '0000' does not match format '%Y%m%d' (match)

What happen here is bad data (welcome to the data science world, most of data will come corrupted, missing, wrong or with some other problem). Before proceed we need to deal with the dates with wrong format.

So what to do? Well we can make the to_datetime() method ignore the errors putting a NaT values on the field. Lets convert this and then see how the DataOccurred column will appear.



In [9]:

    
ufo['DateOccurred'] = pd.Series([
    pd.to_datetime(
        date,
        format = '%Y%m%d',
        errors='coerce'
    ) for date in ufo['DateOccurred']
])



In [10]:

    
ufo.describe()









    Out[10]:






  
    
      
      DateOccurred
      Location
    
  
  
    
      count
      60814
      61068
    
    
      unique
      9216
      21851
    
    
      top
      1999-11-16 00:00:00
      Seattle, WA
    
    
      freq
      191
      441
    
    
      first
      1762-12-11 00:00:00
      NaN
    
    
      last
      2010-08-30 00:00:00
      NaN

The column now is a datetime object and have 60814 against the original 61069 elements, which shows some bad dates are gone. The following code show us how many elements was removed.



In [11]:

    
ufo['DateOccurred'].isnull().sum()









    Out[11]:





255

There is no surprise that 60814 + 255 = 61069, we need to deal with this values too.

So we have a field DateOccurred with some NaN values. In this point we need to make a importante decision, get rid of the columns with NaN dates or fill it with something.

There is no universal guide to this, we could fill it with the mean of the column or copy the content of the DateReported column. But in this case the missing date is less then 0.5% of the total, so for the simplicity sakes we will simply drop all NaN values.



In [12]:

    
ufo.isnull().sum()









    Out[12]:





DateOccurred    255
Location          1
dtype: int64



In [13]:

    
ufo.dropna(
    axis = 0,
    inplace = True
)



In [14]:

    
ufo.isnull().sum()









    Out[14]:





DateOccurred    0
Location        0
dtype: int64



In [15]:

    
ufo.describe()









    Out[15]:






  
    
      
      DateOccurred
      Location
    
  
  
    
      count
      60813
      60813
    
    
      unique
      9216
      21763
    
    
      top
      1999-11-16 00:00:00
      Seattle, WA
    
    
      freq
      191
      440
    
    
      first
      1762-12-11 00:00:00
      NaN
    
    
      last
      2010-08-30 00:00:00
      NaN

With the dataframe with clean dates, lets create another 2 columns to handle years and months in separate. This will make some analysis more easy (like discover which is the better month of year to look for UFO sights).



In [16]:

    
ufo['Year'] = pd.DatetimeIndex(ufo['DateOccurred']).year
ufo['Month'] = pd.DatetimeIndex(ufo['DateOccurred']).month



In [17]:

    
ufo.head()









    Out[17]:






  
    
      
      DateOccurred
      Location
      Year
      Month
    
  
  
    
      0
      1995-10-09
      Iowa City, IA
      1995
      10
    
    
      1
      1995-10-10
      Milwaukee, WI
      1995
      10
    
    
      2
      1995-01-01
      Shelton, WA
      1995
      1
    
    
      3
      1995-05-10
      Columbia, MO
      1995
      5
    
    
      4
      1995-06-11
      Seattle, WA
      1995
      6



In [18]:

    
ufo['Month'].describe()









    Out[18]:





count    60813.000000
mean         6.767665
std          3.195859
min          1.000000
25%          4.000000
50%          7.000000
75%          9.000000
max         12.000000
Name: Month, dtype: float64



In [19]:

    
ufo['Year'].describe()









    Out[19]:





count    60813.000000
mean      2000.597372
std         10.546298
min       1762.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2010.000000
Name: Year, dtype: float64

A funny thing about year is the most old sight is in 1762! This dataset includes sights from history.

How can this be significative? Well, to figure it out its time to plot some charts. The humans are visual beings and a picture really worth much more than a bunch of numbers and words.

To do so we will use the default matplotlib library from Python to build our graphs.

Analysing the years

Before start lets count the sights by year.

The comands bellow are equivalent to the following SQL code:

SELECT Year, count(*) AS Sightings
FROM ufo
GROUP BY Year



In [20]:

    
sightings_by_year = ufo.groupby('Year').size().reset_index()
sightings_by_year.columns = ['Year', 'Sightings']
sightings_by_year.describe()









    Out[20]:






  
    
      
      Year
      Sightings
    
  
  
    
      count
      98.000000
      98.000000
    
    
      mean
      1954.000000
      620.540816
    
    
      std
      44.355639
      1313.199494
    
    
      min
      1762.000000
      1.000000
    
    
      25%
      1935.250000
      2.000000
    
    
      50%
      1961.500000
      70.500000
    
    
      75%
      1985.750000
      251.750000
    
    
      max
      2010.000000
      5080.000000



In [21]:

    
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns



In [22]:

    
plt.style.use('seaborn-white')



In [23]:

    
%matplotlib inline

plt.xticks(rotation = 90)
sns.barplot(
    data = sightings_by_year, 
    x = 'Year', 
    y = 'Sightings',
    color= 'blue'
)
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))

We can see the number of sightings is more representative after around 1900, so we will filter the dataframe for all year above this threshold.



In [30]:

    
ufo = ufo[ufo['Year'] >= 1900]

Now lets see how the graph will behave



In [28]:

    
%matplotlib inline

new_sightings_by_year = ufo.groupby('Year').size().reset_index()
new_sightings_by_year.columns = ['Year', 'Sightings']
new_sightings_by_year.describe()









    Out[28]:






  
    
      
      Year
      Sightings
    
  
  
    
      count
      88.000000
      88.000000
    
    
      mean
      1964.863636
      690.943182
    
    
      std
      28.331852
      1368.787250
    
    
      min
      1901.000000
      1.000000
    
    
      25%
      1944.750000
      10.750000
    
    
      50%
      1966.500000
      146.000000
    
    
      75%
      1988.250000
      261.500000
    
    
      max
      2010.000000
      5080.000000



In [29]:

    
%matplotlib inline

plt.xticks(rotation = 90)
sns.barplot(
    data = new_sightings_by_year, 
    x = 'Year', 
    y = 'Sightings',
    color= 'blue'
)
ax = plt.gca()
ax.xaxis.set_major_locator(ticker.MultipleLocator(base=5))

Handling location

Here we will make two steps, first is splitting all locations is city and states, for USA only. Second is load a dataset having the latitude and longitude for each USA city for future merge.



In [27]:

    
locations = ufo['Location'].str.split(', ').apply(pd.Series)
ufo['City'] = locations[0]
ufo['State'] = locations[1]



In [ ]:

	0	1	2	3	4	5
count	61069	61069	61068	58556	58658	61067
unique	9220	5305	21851	31	9475	61043
top	0000	20050224	Seattle, WA	light	5 minutes	4 objects, light gray egg shaped.Clear sunny d...
freq	252	201	441	12140	2349	2

	0	1	2	3	4	5
0	19951009	19951009	Iowa City, IA	NaN	NaN	Man repts. witnessing "flash, followed by...
1	19951010	19951011	Milwaukee, WI	NaN	2 min.	Man on Hwy 43 SW of Milwaukee sees large, bri...
2	19950101	19950103	Shelton, WA	NaN	NaN	Telephoned Report:CA woman visiting daughter w...
3	19950510	19950510	Columbia, MO	NaN	2 min.	Man repts. son's bizarre sighting of smal...
4	19950611	19950614	Seattle, WA	NaN	NaN	Anonymous caller repts. sighting 4 ufo's ...

	DateOccurred	Location
count	60814	61068
unique	9216	21851
top	1999-11-16 00:00:00	Seattle, WA
freq	191	441
first	1762-12-11 00:00:00	NaN
last	2010-08-30 00:00:00	NaN

	DateOccurred	Location
count	60813	60813
unique	9216	21763
top	1999-11-16 00:00:00	Seattle, WA
freq	191	440
first	1762-12-11 00:00:00	NaN
last	2010-08-30 00:00:00	NaN

	DateOccurred	Location	Year	Month
0	1995-10-09	Iowa City, IA	1995	10
1	1995-10-10	Milwaukee, WI	1995	10
2	1995-01-01	Shelton, WA	1995	1
3	1995-05-10	Columbia, MO	1995	5
4	1995-06-11	Seattle, WA	1995	6

	Year	Sightings
count	98.000000	98.000000
mean	1954.000000	620.540816
std	44.355639	1313.199494
min	1762.000000	1.000000
25%	1935.250000	2.000000
50%	1961.500000	70.500000
75%	1985.750000	251.750000
max	2010.000000	5080.000000

	Year	Sightings
count	88.000000	88.000000
mean	1964.863636	690.943182
std	28.331852	1368.787250
min	1901.000000	1.000000
25%	1944.750000	10.750000
50%	1966.500000	146.000000
75%	1988.250000	261.500000
max	2010.000000	5080.000000