Unidentified Flying Objects (UFOs) have been an interesting topic for most enthusiasts and hence people all over the United States report such findings online at National UFO Report Center (NUFORC). Some of these reports are hoax and amongst those that seem legitimate, there isn’t currently an established method to confirm that they indeed are events related to flying objects from aliens in outer space. However, the database provides a wealth of information that can be exploited to provide various analyses and insights such as social reporting, identifying real-time spatial events and much more. We perform analysis to localize these time-series geospatial events and correlate with known real-time events. This paper does not confirm any legitimacy of alien activity, but rather attempts to gather information from likely legitimate reports of UFOs by studying the online reports. These events happen in geospatial clusters and also are time-based. We present a scheme consisting of feature extraction by filtering related datasets over a time-band of 24 hrs and use multi-dimensional textual summaries along with geospatial information to determine best clusters of UFO activity. Later, we look at cluster density and data visualization to search the space of various cluster realizations to decide best probable clusters that provide us information about proximity of such activity. A random forest classifier is also presented that is used to identify true events and hoax events, using the best possible features available such as region, week, time-period and duration. Lastly, we show the performance of the scheme on various days and discover interesting correlations with real-time events!
In [375]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import pandas as pd
import numpy as np
import geocoder
import re
import math
The UFO database as you can see below has the following columns:
* Date/Time of the event.
* City where the event was reported.
* State of the city.
* Shape that the observer thought they saw.
* Duration of the event.
* Summary - description of the UFO event.
* Date posted by the UFO website.
In [376]:
ufo_data = pd.read_csv("data/ufo/ufo_data.csv", sep='\t')
ufo_data.head()
Out[376]:
The data used in this research is collected and made public by the National UFO Reporting Center launched in 1974. The NUFORC site hosts an extensive database of UFO sighting reports that are submitted either online or though a 24-hour telephone hotline. The data undergoes an internal quality check by the NUFORC staff before being made public and, at the moment, presents one of the most comprehensive UFO reports databases available online. It provides the following information: Date/Time, City, State, Shape, Duration, Summary, and Posting date. The data gets occasionally used for local news reports as well as a broader-level reporting.
The Date/Time needs to be parsed to extract the date components. The datetime utility cannot be easily used as the format of Date doesnt come with padded 0s for single digits.
In [377]:
ufo_data['Month'] = [int(r.split('/')[0]) for r in ufo_data['Date/Time']]
ufo_data['Day'] = [int(r.split('/')[1]) for r in ufo_data['Date/Time']]
ufo_data['Date'] = [(r.split(' ')[0]) for r in ufo_data['Date/Time']]
ufo_data['Time'] = [(r.split(' ')[-1]) for r in ufo_data['Date/Time']]
Let us convert the bin the events by time since the time of reporting is continous and by the minute. A binning of the events helps analyse statistics such as, how many events occurred during noon?
In [378]:
def time_period(time):
'''
Convert time into periods 1,2,...,12. If the time is hh: mm
and 2i<hh<2(i+1), then hh:mm belongs to period t.
Suppose the time is 6:30am, since 2*3<6<2*4, then it belongs to period 3.
Args:
time (time): Time period.
Returns:
periods (time): Formatted time period.
'''
periods=[]
for t in time:
try:
p = int(t.split(':')[0])
for i in range(12):
if(p>=2*i) & (p<2*(i+1)):
periods.append(i+1)
except ValueError:
periods.append(-1)
return periods
ufo_data['TimePeriod'] = time_period(ufo_data['Time'])
ufo_data.head(1)
Out[378]:
Let us look at events where durations are reported as null
In [379]:
null_data = ufo_data[ufo_data.Duration.isnull()]
null_data.head(1)
Out[379]:
In [380]:
def duration_sec(duration_text):
'''
Add a duration column with normalized units of measurement (seconds). Extracts the duration in seconds
by infering the duration from the text.
Args:
text (str): String of text.
Returns:
(int): Time duration in seconds.
'''
try:
metric_text = ["second","s","Second","minute","m","min","Minute","hour","h","Hour"]
metric_seconds = [1,1,1,60,60,60,60,3600,3600,3600]
for m,st in zip(metric_text, metric_seconds):
regex = "\s*(\d+)\+?\s*{}s?".format(m)
a = re.findall(regex, duration_text)
if len(a)>0:
return int(int(a[0]) * st)
else:
return None
except:
return None
In [381]:
ufo_data["Duration_Sec"] = ufo_data["Duration"].apply(duration_sec)
ufo_data["Duration_Sec"] = ufo_data.Duration_Sec.fillna(int(ufo_data.Duration_Sec.mean()))
ufo_data['Duration_Sec'].unique()
Out[381]:
From the above you can see that there are no null values in Durations.
Let us look at the unique states and which states are most reporting UFO events.
In [382]:
sns.set(style="darkgrid")
plt.figure(figsize=(8, 12))
sns.countplot(y="State", data=ufo_data, palette="Greens_d");
Let us look at how many of these are in the United States.
In [383]:
all_states = ufo_data['State'].value_counts(dropna=False)
US = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE",
"FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY",
"LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT",
"NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH",
"OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT",
"VT", "VA", "WA", "WV", "WI", "WY"]
print([state for state in all_states.index if state not in US])
We can see from above that many states are outside of United States. Since we are looking to study most events that occurred in the United States, let us create a column "US".
In [384]:
def is_US(state):
'''
Check if the state is in United States or not.
Args:
state (str): state reported in UFO data.
Returns:
(boolean): True if state is in US or False otherwise.
'''
if state in US:
return True
else:
return False
In [385]:
ufo_data['US'] = ufo_data['State'].apply(is_US)
ufo_data.head(1)
Out[385]:
Let us see how much we stand to lose in terms of considering data inside of US by plotting a count plot with a hue.
In [386]:
plt.figure(figsize=(8, 12))
g = sns.countplot(y="State", hue="US", data=ufo_data)
We can study from countplot that there is a small percentage of data that we stand to lose by ignoring states outisde of United States. Let us keep only states within US for our initial analysis.
In [387]:
ufo_data = ufo_data[ufo_data.US == 1]
Let us explore how the city column looks. It is important to take a look at lots of values in the dataset to check for anomalies or data with noise. For example the city data has noise with text containing additional information within () and other such noises.
In [388]:
#print(ufo_data['City'].unique()[0:1000])
Transfrom the City column to exclude all irrelevant text entries (e.g., additional comments).
In [389]:
def clean_city_data(city_name):
'''
Cleans the city string of additional comments and irrelevant data.
Args:
city_name (str): Name of the city
Returns:
(str): correct city name.
'''
try:
city_name = city_name.split('/')[0]
city_name = city_name.split('(')[0]
city_name = city_name.split(',')[0]
city_name = city_name.split('?')[0]
return city_name
except AttributeError:
return 'Unknown'
ufo_data['City'].apply(clean_city_data)
Out[389]:
These events are reported at various cities. We need the latitude and longitude information to perform geospatial analysis. A process of coverting an address to a latitude and longitude is called forward geocoding. The geopy library is useful for forward geocoding. It connects to a network and looks up the address and returns back the latitude and longitude information. The code below is used to determine the coordianates. We have already run the code below and generated a file. Hence, do not run this portion of the code below. Also it takes a while to look up all the addresses.
In [390]:
# from geopy.geocoders import ArcGIS
# from geopy.exc import GeocoderTimedOut
# Latitude=[]
# Longitude=[]
# geolocator = ArcGIS()
# fails=[]
# for i in range(len(df)):
# try:
# location = geolocator.geocode(df.iloc[i,1]+','+df.iloc[i,2])
# df.iloc[i,-2] = location.latitude
# df.iloc[i,-1] = location.longitude
# print (i, location.address, df.iloc[i,-2], df.iloc[i,-1])
# df.to_csv('data_coord_1.csv',sep='\t', encoding='utf-8', index=False)
# except (AttributeError, GeocoderTimedOut) as e:
# df.to_csv('data_coord_1.csv',sep='\t', encoding='utf-8', index=False)
# print ('exception:', i)
Now that we have the cleaned and processed the data, let us extract right columns from the dataframe that are useful and shall be our features for modeling.
In [391]:
features = ['Date', 'Month', 'Day', 'Time', 'TimePer', 'City', 'State', 'Lat', 'Long',
'Shape', 'Duration', 'Duration_Sec', 'Summary', 'Posted', 'US_STATE']
ufo_data = pd.DataFrame(ufo_data, columns = features)
ufo_data.head(1)
Out[391]:
In [392]:
ufo_data['Lat'] = np.nan
ufo_data['Long'] = np.nan
ufo_data = pd.read_csv("data/ufo/data_coord.csv", sep='\t')
ufo_data = ufo_data[(~ufo_data.Long.isnull()) & (~ufo_data.Lat.isnull())]
In [393]:
ufo_data['Date'] = pd.to_datetime(ufo_data['Date'])
ufo_data.set_index('Date', inplace=True)
ufo_data = ufo_data.reset_index()
ufo_data['WeekDay'] = ufo_data['Date'].dt.dayofweek
ufo_data['Week'] = ufo_data['Date'].dt.weekofyear
ufo_data['Quarter'] = ufo_data['Date'].dt.quarter
ufo_data['Year'] = ufo_data['Date'].dt.year
In [394]:
ufo_data.columns
Out[394]:
In order to have a better understanding of the UFO reports, let us add the following external-data sources:
In [395]:
ufo_stats = pd.read_excel("data/ufo/stats2.xlsx")
ufo_stats.head(2)
Out[395]:
We can now merge the two datasets on state abbreviations.
In [396]:
ufo_stats['State'] = ufo_stats['state abbr']
ufo_data = pd.merge(ufo_data, ufo_stats, on='State')
ufo_data.columns
Out[396]:
In [397]:
ufo_data.loc[ufo_data['Year'] == 2014, 'Pop'] = ufo_data['2014 Popualtion'][ufo_data['Year'] == 2014]
ufo_data.loc[ufo_data['Year'] == 2015, 'Pop'] = ufo_data['2015 Population'][ufo_data['Year'] == 2015]
ufo_data.loc[ufo_data['Year'] == 2014, 'Milit_Share'] = ufo_data['Number of Active Duty members 2014'][ufo_data['Year'] == 2014]/ufo_data['2014 Popualtion'][ufo_data['Year'] == 2014]
ufo_data.loc[ufo_data['Year'] == 2015, 'Milit_Share'] = ufo_data['Number of Active Duty members 2014'][ufo_data['Year'] == 2015]/ufo_data['2015 Population'][ufo_data['Year'] == 2015]
The inevitable presence of IFO reports in the dataset can, in fact, be considered an added value, since the non-UFO reports are still indicative of actual events taking place. Therefore, our analysis focuses on the events that are reported as UFOs, regardless of them being an alien activity or in future recognized as an IFO. In addition to general reporting trends, the analysis of NUFORC data can offer insight into the UFO perception and their validity as some of the latter are labeled to be hoax reports by NUFORC.
In [398]:
#Adding a Hoax column derived from the Summary column
pattern = '|'.join(["HOAX","NUFORC Note"])
ufo_data['Validity'] = ufo_data.Summary.str.contains(pattern)
def binary_convert(value):
if value==True:
return 0
else:
return 1
ufo_data['Validity'] = ufo_data['Validity'].apply(lambda x: binary_convert(x))
In [399]:
ufo_data_hoax = ufo_data[['Summary','Validity']]
ufo_data_hoax[ufo_data_hoax.Validity==0].shape
Out[399]:
In [400]:
# sns.set(style="white", palette="muted", color_codes=True)
g = sns.distplot(ufo_data.groupby(['State'])['Time'].count().nlargest(10), color="r")
In [401]:
ufo_data['WeekEnd'] = "WeekDay"
ufo_data.ix[ufo_data['WeekDay']>4, 'WeekEnd'] = "WeekEnd"
Let us look at the states which reported the highest UFO events and look at their violin plots.
In [438]:
ufo_states = ufo_data.groupby('State')['Year'].count().nlargest(10)
In [443]:
plt.figure(figsize=(10, 10))
sns.set(style="whitegrid", palette="pastel", color_codes=True)
sns.violinplot(y="State", x="Validity", hue="WeekEnd",
data=ufo_data.loc[ufo_data['State'].isin(ufo_states.index.tolist())],
split=True, inner="quart", palette={"WeekDay": "b", "WeekEnd": "y"})
Out[443]:
Let us look at the largest shapes reported.
In [447]:
ufo_shapes = ufo_data.groupby('Shape')['Year'].count().nlargest(10)
print(ufo_shapes)
In [451]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 8))
# Draw a violinplot with a narrower bandwidth than the default
sns.violinplot(y="Validity", x="Shape", hue="WeekEnd", data=ufo_data.loc[ufo_data['Shape'].isin(ufo_shapes.index.tolist())],
split=True, inner="quart", palette="Set3", bw=.2, cut=2, linewidth=1)
# Finalize the figure
sns.despine(left=True, bottom=True)
Remove the rows where State or City is unknown. Alternatively impute the rows for missing values.
In [ ]:
ufo_data = ufo_data[~ufo_data.State.isnull() & ~ufo_data.City.isnull()]
ufo_data.isnull().sum()
In [ ]:
import folium
# Get a basic world map.
UFOmap = folium.Map(location=[30, 0], zoom_start=2)
# Draw markers on the map.
for name, row in ufo_data.iterrows():
UFOmap.circle_marker(location=[row["Lat"], row["Long"]], popup=row["City"])
# Create and show the map.
UFOmap.create_map('UFOmap.html')
UFOmap