What is the likelihood of a gap?

Our goal is to develop a statistical model the tells us, for a given vessel, the chance that the satellites will fail to receive a message in a given period of time because the density of vessels in the surrounding area is too high. We will then look for the very unlikely gaps and classify these as events that were likely due to the AIS being shut off.

This notebook focuses on finding the relationship between density of vessels and the number of positions the satellite sees from a given vessel.

A few challenges:

  • The odds will change depending on the number of satellites in the sky, and how well the satellites are operating. Thus we'll have to modify the model as more satellites are launched.
  • Satellites don't cover the globe uniformly. They will spend more time at different latitudes, meaning that there is likely going to be a function of latitude in this equation.
  • Some vessels appear to have weak AIS signals. It would be useful to identify the poorly operating devices.
  • AIS devices broadcast more frequently if a vessel is moving more quickly or changing its course. This has to be accounted for somehow.
  • There are probably other weird stuff going on too, such as the satellites more easily downloading data in some parts of the world than others.
  • We are using Orbcomm data, which isn't as good as ExactEarth. Also, ExactEarth satellites are supposedly much better at this interference problem, meaning that the best strategy is probably just to wait until we have ExactEarth data.

Initial Assumptions about AIS Interference

We assume that at a given time T, the satellite can hear S signals, each vessel broadcasts one signal, and there are N other vessels in the area. Then the chance that your vessels' signal will not interfere with any other vessels is ((C-1)/C)^N. For instance, if N=1, and there are 500 channels, then the odds are 499/500 that your signal will go through -- only one out of every 500 times will your signal and another vessel's overlap. (AIS is, I think, smart about not broadcasting at the same time when the vessels are in line of sight with each other -- about 100km radius -- but the satellites see an area of the earth ~1500km radius, so we can approximate that the vessels can't see each other.)

So, the odds that you'll be heard in a given time T are proportional to ((C-1)/C)^N, which is the same as (1-1/C)^N.

If the satellite stayed direclty overhead, you'd see that positions/day = K (1-1/C)^N, where K is some constant, N is the number of vessels in the footprint of the satellite. But, the satellites move. And you can't just average the number of vessels -- you have to average this function.

Sometimes the vessel you care about will be at the corner of the satellite's radius, and sometimes the satellite will be directly above the vessel. The total number of vessels seen by the satellite (N) could be very different in those two cases, and you can't just average them. So, we need to calculate the number of vessels the satellite sees at every position of its orbit, and how frequently the satellite is in each position, and which positions the satellite can see a given vessel. For simplicity, let's say the satellite visits every lat and lon equally (later we can correct for latitude).

So, for each lat and lon, we calculate N, the number of vessels within the satellite radius. Then, for each lat and lon, we calculate (1-1/C)^n, and average this across the satellite radius (to get every position of the satellite that will see the vessels.

The Frequency of AIS Brodcast

The movement of the vessel affects how frequently it will try to broadcast its location. Based on http://www.milltechmarine.com/faq.htm:

Class A: Ships Dynamic Conditions | Dual Channel Receiver | Single Channel Receiver
Ship at anchor or moored | 3 min | 6 min
SOG 0-14 knots | 10 sec | 20 sec
SOG 0-14 knots, changing course | 3.3 sec | 6.6 sec
SOG 14-23 knots | 6 sec | 12 sec
SOG 14-23 knots, changing course| 2 sec | 4 sec
SOG >23 knots | 2 sec | 4 sec
Ship Static Information | 6 min | 12 min

Class B:
Ships Dynamic Conditions | Dual Channel Receiver | Single Channel Receiver SOG < 2 knots | 3 min | 6 min
SOG > 2 knots | 10 sec | 20 sec
Ship Static Information | 6 min | 12 min

Message Types Class A: 1,2,3 Class B: 18,19

For a first cut, we will analyze the positions of a vessel that never goes slower than 2 knots if it is Class B, or is never 0 knots if it is Class A, as these messages are supposed to be ~20 to ~60 times less frequent. Otherwise we'll ignore the effects of speed and course changes on positions.


In [5]:
import time
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib import colors,colorbar
%matplotlib inline
import csv 
import math
from math import radians, cos, sin, asin, sqrt
from scipy import stats
import math
import cPickle

In [6]:
import bq  
client = bq.Client.Get()

In [7]:
def Query(q):
    t0 = time.time()
    answer = client.ReadTableRows(client.Query(q)['configuration']['query']['destinationTable'])
    print 'Query time: ' + str(time.time() - t0) + ' seconds.'
    return answer

In [8]:
def make_map(raster,title=None, colorbar_label=None, maximum=100, 
             minimum=1, scale_type ="log",cutoff=0,savefig=0, 
             png="temp.png", continent_color = '#111111' ):
    plt.clf()
    plt.rcParams["figure.figsize"] = [12,7]
    firstlat = 90-cutoff
    lastlat = -90+cutoff
    firstlon = -180
    lastlon = 180
    scale = cellsize
    truncated_grid = raster[cutoff/cellsize:(180/cellsize)-cutoff/cellsize][:]

    numlats = int((firstlat-lastlat)/scale+.5)
    numlons = int((lastlon-firstlon)/scale+.5)
    lat_boxes = np.linspace(lastlat,firstlat,num=numlats,endpoint=False)
    lon_boxes = np.linspace(firstlon,lastlon,num=numlons,endpoint=False)
    fig = plt.figure()
    m = Basemap(llcrnrlat=lastlat, urcrnrlat=firstlat,
              llcrnrlon=lastlon, urcrnrlon=firstlon, lat_ts=0, projection='robin',resolution="h", lon_0=0)
    m.drawmapboundary(fill_color='#111111')
    # m.drawcoastlines(linewidth=.2)
    m.fillcontinents(continent_color,lake_color=continent_color)#, lake_color, ax, zorder, alpha)
    x = np.linspace(-180, 180, 360/cellsize)
    y = np.linspace(lastlat, firstlat, (firstlat-lastlat)/cellsize)
    x, y = np.meshgrid(x, y)
    converted_x, converted_y = m(x, y)
    from matplotlib import colors,colorbar
    if scale_type == 'log':
        norm = colors.LogNorm(vmin=minimum, vmax=maximum)
    else:
        norm = colors.Normalize(vmin=minimum, vmax=maximum)

    m.pcolormesh(converted_x, converted_y, truncated_grid, norm=norm, vmin=minimum, vmax=maximum, cmap = plt.get_cmap('viridis'))
    if title == None:
        title = "A Raster of Some Sorts"
    plt.title(title, color = "#000000", fontsize=18)

    ax = fig.add_axes([0.2, 0.1, 0.4, 0.02]) 
    if scale_type == "log":
        lvls = np.logspace(np.log10(minimum),np.log10(maximum),num=8)
        cb = colorbar.ColorbarBase(ax,norm = norm, orientation='horizontal', ticks=lvls, cmap = plt.get_cmap('viridis'))
        the_labels = []
        for l in lvls:
            if l>=1:
                l = int(l)
            the_labels.append(l)
        cb.ax.set_xticklabels(the_labels, fontsize=10, color = "#000000")
    else:
        cb = colorbar.ColorbarBase(ax,norm = norm, orientation='horizontal', cmap = plt.get_cmap('viridis'))
        
    cb.set_label(colorbar_label,labelpad=-40, y=0.45, color = "#000000")
    ax.text(1.7, -0.5, 'Data Source: Orbcomm\nMap by Global Fishing Watch',
            verticalalignment='bottom', horizontalalignment='right',
            transform=ax.transAxes, color='#000000', fontsize=6)
    if savefig:
        plt.savefig(png,bbox_inches='tight',dpi=300,transparent=True,pad_inches=.1, facecolor="#000000")
    plt.show()

In [39]:
q = '''
SELECT
  integer(FLOOR(first_lat*10)) lat_bin,
  integer(FLOOR(first_lon*10)) lon_bin,
  integer(FLOOR(avg_lat*10)) lat_bin_avg,
  integer(FLOOR(avg_lon*10)) lon_bin_avg,
  satellite_positions sat_positions,
  terrestrial_positions terrestrial_positions,
  positions_weighted,
  avg_speed,
  slow_pings,
  mmsi
FROM
(SELECT
  mmsi,
  SUM( CASE WHEN speed = 0 OR (speed<=2 AND type IN (18, 19)) THEN 180 
     WHEN (speed > 0 AND speed <14 AND type IN (1,2,3)AND turn = 0 )
         OR (speed>2 AND type IN (18,19)) THEN 10 
      when speed>0 and speed<14 and type in (1,2,3) and turn !=0 then 3.3
      when speed>=14 and speed<23 and type in (1,2,3) and turn = 0 then 6
      when type in (1,2,3) and (speed>=23 or (speed>=14 and turn !=0)) then 2
      END) positions_weighted,
   first(lat) first_lat,
   first(lon) first_lon,
   avg(lat) avg_lat,
   avg(lon) avg_lon,
   max(lat) max_lat,
   min(lat) min_lat,
   max(lon) max_lon,
   min(lon) min_lon,
   avg(speed) avg_speed,
   sum(if( (speed=0 and type in (1,2,3)) or (speed<2 and type in (18,19)),1,0 )) slow_pings,
   sum( if(REGEXP_REPLACE(tagblock_station, 'u', '') IN ('rORBCOMM000',
        'rORBCOMM01',
        'rORBCOMM008',
        'rORBCOMM009',
        'rORBCOMM010'),1,0)) terrestrial_positions,
   sum( if(REGEXP_REPLACE(tagblock_station, 'u', '') not IN ('rORBCOMM000',
        'rORBCOMM01',
        'rORBCOMM008',
        'rORBCOMM009',
        'rORBCOMM010'),1,0)) satellite_positions,       
FROM
  [pipeline_normalize.20150103]
WHERE
  type IN (1,2,3,18,19) and lat is not null and lon is not null and speed is not null and turn is not null
group by mmsi
)
  where
  max_lat - min_lat <5
  AND (max_lon - min_lon < 10
    OR first_lon > 170
    OR first_lon < -170)
  AND mmsi IN (select mmsi from
[scratch_david_gapanalysis.good_mmsi_2015_1000pings])
'''

positions = Query(q)


Waiting on bqjob_r7e1770a08e992c2e_00000153a4f3e393_3 ... (15s) Current status: DONE   
Query time: 46.7837259769 seconds.

In [40]:
cPickle.dump(positions, open('../../data/density/20150103_v2_vessels.p', 'wb'))

In [41]:
positions = cPickle.load(open('../../data/density/20150103_v2_vessels.p', 'rb'))

In [27]:
# first calculate a raster of vessel locations from the query
cellsize = 2
num_lons = 360/cellsize
num_lats = 180/cellsize

vessels = np.zeros(shape=(num_lats,num_lons)) 
for row in positions:
    lat = int(row[0])
    lon = int(row[1])
    if lat<900 and lat>-900 and lon>-1800 and lon<1800:
        lat_index = (lat+900)/(cellsize*10)
        lon_index = (lon+1800)/(cellsize*10)
        vessels[lat_index][lon_index] += 1 # one vessel

In [28]:
make_map(vessels,title="Number of Vessels, Jul 1 2015",
             colorbar_label="Number of Vessels in each 2 by 2 degreee cell", maximum=vessels.max(), minimum=1)


<matplotlib.figure.Figure at 0x18d5cdd10>

In [29]:
# functions used in calculating satellite footprint

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

def distance_between_boxes(lat_box_0,lon_box_0,lat_box_1,lon_box_1):
    """
    Convert our lat and lon index to the lat and lon of the middle of
    a cell
    """
    lat1 = lat_box_0*cellsize-90+cellsize/2.
    lat2 = lat_box_1*cellsize-90+cellsize/2.
    lon1 = lon_box_0*cellsize-180+cellsize/2.
    lon2 = lon_box_1*cellsize-180+cellsize/2.
    return haversine(lon1, lat1, lon2, lat2)

In [30]:
# Now we have the problem of averaging on a sphere. There are geospatial
# functions to do this, but I'll just write my own here
# grid_for_average[lat_box][lon_box] is a list of indices representing
# the satellite's footprint when it is at an index lat_box, lon_box
# on the grid.

satellite_radius = 2700 # measured in km, ~1450 nautical miles
grid_for_average = [ [[] for i in range(num_lons)] for k in range(num_lats)]
for lat_box in range(num_lats):
    for lon_box in range(num_lons):
        # lat_above and lat_below are so that we don't have to 
        # loop through the entire grid
        lat_above = int(lat_box + satellite_radius/110. + 1)
        if lat_above > num_lats: lat_above = num_lats
        lat_below = int(lat_box - satellite_radius/110. - 1)
        if lat_below < 0: lat_below = 0
        for i in range(lat_below,lat_above):
            for k in range (num_lons):
                dist = distance_between_boxes(i,k,lat_box,lon_box)
                if dist < satellite_radius:
                    grid_for_average[lat_box][lon_box].append([i,k])

In [31]:
# Now we add count the number of vessels that are in the satellite's footprint
# when the satellite is at a given lat and lon
sat_footprints = np.zeros(shape=(num_lats,num_lons)) 
for i in range(num_lats):
    for j in range(num_lons):
        for item in grid_for_average[i][j]:
            sat_footprints[i][j] += vessels[item[0]][item[1]]

In [32]:
make_map(sat_footprints,title="Number of Vessels Seen by Satellite \nat each Lat and Lon, Jul 1 2015",
             colorbar_label="Number of Vessels", maximum=sat_footprints.max(), minimum=1)


<matplotlib.figure.Figure at 0x1a4351690>

In [33]:
#estimate the pings effect for each point
sat_pings_footprint = np.zeros(shape=(num_lats,num_lons))
for i in range(num_lats):
    for j in range(num_lons):
        # note that I don't know if this number should be 2500 -- I just guessed
        # with a few different numbers to see what worked best.
        sat_pings_footprint[i][j] = (1-1/2500.)**sat_footprints[i][j]

In [34]:
# Now, average again, because the satellite is moving. 
# This assumes that 
estimated_positions = np.zeros(shape=(num_lats,num_lons))
density_averaged = np.zeros(shape=(num_lats,num_lons))
for i in range(num_lats):
    for j in range(num_lons):
        a = 0
        for item in grid_for_average[i][j]:
            # crap, have to adjust by area 
            latitude = (i-45)*2+1 # i goes from 0 to 90, map to -91 to 89, every two degrees
            area = math.cos(latitude*3.1416/180.)
            estimated_positions[i][j] += sat_pings_footprint[item[0]][item[1]]*area
            density_averaged[i][j] +=  sat_footprints[item[0]][item[1]]/area
            a += area
        estimated_positions[i][j] = (157/a)*estimated_positions[i][j]
        # so, the a is just to make sure that the footprint area is the same at 
        # all latitudes, and the 157 is just a constant to get the predicted
        # number of positions roughly correct. I played around with different values
        # here until I got one that seemed the clossest

In [35]:
# what are the actual positions per day?
actual_positions = np.zeros(shape=(num_lats,num_lons))
number_of_vessels = np.zeros(shape=(num_lats,num_lons))

for row in positions:
    lat = int(row[2])
    lon_ave = int(row[3])
    lon_f = int(row[1])
    if abs(lon_f - lon_ave > 50): # use average, except near the dateline
        lon = lon_f
    else:
        lon = lon_ave
    sat_pos = int(row[4])
    all_pos = int(row[5])+int(row[4])
    speed = float(row[7])
    slow_pos = int(row[8])
    # must only be satellite positions
    if sat_pos == all_pos and lat<900 and lat>-900 and lon>-1800 and lon<1800 and speed>2 and slow_pos == 0:
    # and abs(lat)<300:
        lat_index = (lat+900)/(cellsize*10)
        lon_index = (lon+1800)/(cellsize*10)
        actual_positions[lat_index][lon_index] += sat_pos
        number_of_vessels[lat_index][lon_index] += 1
        
for i in range(num_lats):
    for j in range(num_lons):
        if number_of_vessels[i][j]:
            actual_positions[i][j] = float(actual_positions[i][j])/number_of_vessels[i][j]

actual_positions[actual_positions==0] = np.nan

In [36]:
make_map(estimated_positions,title="Predicted Number of Positions",scale_type="linear",
             colorbar_label="Predicted Number of Positions", maximum=estimated_positions.max(), minimum=0)


<matplotlib.figure.Figure at 0x1a43510d0>

In [37]:
make_map(actual_positions, title="Actual Number of Positions",scale_type="linear",
             colorbar_label="Number of Positions", maximum=estimated_positions.max(), minimum=1)


<matplotlib.figure.Figure at 0x1696d1a10>

In [38]:
# Plot the relationship for all vessels

den = []
pos = []
predicted_pos = []
pos_weighted = []

for row in positions:
    lat = int(row[2])
    lon_ave = int(row[3])
    lon_f = int(row[1])
    if abs(lon_f - lon_ave > 50): # use average, except near the dateline
        lon = lon_f
    else:
        lon = lon_ave
    sat_pos = int(row[4])
    all_pos = int(row[5])+int(row[4])
    speed = float(row[7])
    slow_pos = int(row[8])
    # must only be satellite positions, with a resaonable position, and not be stopped at all during the day
    if sat_pos == all_pos and lat<900 and lat>-900 and lon>-1800 and lon<1800 and speed>2 and slow_pos == 0:
        lat_index = (lat+900)/(cellsize*10)
        lon_index = (lon+1800)/(cellsize*10)
        den.append(sat_footprints[lat_index][lon_index])
        predicted_pos.append(estimated_positions[lat_index][lon_index])
        pos_weighted.append(float(row[6]))
        pos.append(sat_pos)


fig = plt.figure()
plt.rcParams["figure.figsize"] = [12,7]

ax = fig.add_subplot(1,1,1)
plt.scatter(predicted_pos, pos,  alpha=.05)
slope, intercept, r_value, p_value, std_err = stats.linregress(predicted_pos,pos)
print r_value, std_err, intercept, slope
y = [slope*x + intercept for x in predicted_pos]

plt.plot(predicted_pos,y, "r")
plt.title("Satellite positions per day versus Predicted")
plt.xlabel('Predicted Positions')
plt.ylabel('Actual Positions')
plt.xlim([0,250])
plt.ylim([0,250])
thetext = "slope: "+ str(round(slope,2))+"\nr_squared: "+str(round(r_value**2,2))+"\nstandard error: "+str(round(std_err,3))
ax.text(200, 150, thetext, fontsize=10)
plt.show()


0.68226855858 0.00896366750821 11.7998501373 1.1277801328

In [ ]:


In [ ]: