Satellites Characterization

</header> EE-558, A Network Tour of Data Science
Lucas Gauchoux, Laurier Loiselle, Antoine Weber

Abstract

Satellites are a key part of our everyday life, from meteorology, to telecomunications and localization. Regardless, it is hard to relate to the vast amount of satellites above our heads. This project aims to give details on those satellites by exploiting two large databases (space-track.org and SATCAT).

To do so, we will provide an analysis of two main components of satellites, their orbits and their launch-site.

Fig. 1: Example of orbits around earth

This analysis will use graph processing tools to extract meaning from the datasets we use. Our goals will be in two part:

  1. Find correlations between satellites orbits to show their patterns and to distinguish the difference between the orbit types.
  2. From orbital informations and qualitative data, attempt to determine the launch sites of a subset of satellites.

The first goal aims to address the need of better understanding satellite orbits by showing how subsets of satellites have very similar orbits and to show the scale of the satellite cluster orbiting above us.

The second goal aims to show how a critical part of space mission design, in part due to the cost of orbital maneuver, is to place the satellites directly at or close to their desired orbit. Because of this, we will show that, if classified by their orbital parameters, satellites launched from the same institutions, mainly NASA and USA allied foreign countries, can be traced back to their spaceport location.

Imports, global names and helper functions


In [1]:
%matplotlib inline

from tqdm import tqdm
import json

from numpy import pi
import numpy as np
import pandas as pd
from scipy import sparse, stats, spatial, linalg

import networkx as nx
from networkx.algorithms import community
from pygsp import graphs, filters, plotting

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import reverse_geocoder as rg

import random

plt.rcParams['figure.figsize'] = (17, 5)
plotting.BACKEND = 'matplotlib'

from helpers import *
from helpers_satcat import *

Analysis of the satellite orbits

Data Acquisition

The dataset comes from the website space-track.org. To extract the data, it is necessary to create an account. Once the account created, we where able to download datasets in bulk.

From the website, we extracted datasets that we found meaningfull for the task at hand. Those datasets are as follow:

  1. Satellite current TLE: The Two Line Element (TLE) format describes the main satellites information. It provides current orbital status and position. It also gives the satellite designator that can be used to extract more informations about the satellite. The extracted file gives the current TLE for 16731 satellites or debrits in orbit. The file is given in a text format, but it is standardized, so it will be easily scraped with regexes. A more extensive TLE dataset is provided on the website, we hitted rate limiting when downloading it, but theorically, it countains the TLE of all satellites since the beginning of the space age, with various degrees of precisions. For now we will not extract the complete historical TLE dataset, but it is an option if we want to do a more complete/complex analysis.

  2. Satellite in orbit info: This file is given in a csv format. It gives all the satellites currently in orbits. It gives the NORAD_CAT_ID, so it can easily be linked to the TLE. Also, it gives information on the satelitte name, the launch date and the country of launch which will be valuable in our analysis.

  3. Satellite statistics by country: This file is given in a csv format. It gives information per country on various metrics: the number of satellites in orbit, the number of debris, the number of rocket bodies, the same information for the decayed objects and totals per country. This is interesting to give a per country view of the space programs without having to scrape the other (big) datasets.

Data Exploration

As said before we retrieved three different files from the SpaceTrack website. The most important one is the file containing the two-line elements. We will therefore start by exploring this one.

Two-Line Elements

A two line element is a data format containing a list of orbital information. Those information allow us to compute the position of a satellite at a given time. It looks like this:

- 0 VANGUARD 2
- 1    11U 59001A   17329.92226964 +.00000367 +00000-0 +19707-3 0  9992
- 2    11 032.8697 323.5388 1468703 016.5820 347.7999 11.85506786164031

The first line is the name of the satellite and the following lines are the so called two line element for this satellite. The full description of each parameters can be found here. However we are interested in the position of satellites. In order to convert this raw data into a localisation we will use the pyephem library.
We thus compute the longitude, latitude and elevation for each sattelite of the text file "satellite_tle_current".


In [2]:
lines_tot = read_tle('Dataset/satellite_tle_current.txt')
assert [lines_tot.count('\n') == 0]

names,long,lat,elevation_km = compute_tle_informations(lines_tot)
data = pd.DataFrame(np.transpose([names,long,lat,elevation_km]))
data.columns = ['OBJECT_NAME','Longitude [°]','Latitude [°]','Elevation [km]']


Index number 5361 is not compatible to perform computation
Index number 49563 is not compatible to perform computation

As we can see from the debug output above, some TLE were not in the right format and their position is not computable. The retrieved data set contains now the position of all satellites that had the right format.


In [3]:
data.head()


Out[3]:
OBJECT_NAME Longitude [°] Latitude [°] Elevation [km]
0 VANGUARD 1 -63.01315328238157 -26.65173623575367 950.1053125
1 VANGUARD 2 33.607191075144456 -32.25819456312109 2825.77825
2 VANGUARD R/B 158.90670062542114 1.4622271746404283 746.4226875
3 VANGUARD R/B -117.00344908625789 12.395302654786787 2056.953375
4 VANGUARD 3 -14.979412519707317 -30.811335100669776 3271.14

In [4]:
data.OBJECT_NAME.is_unique


Out[4]:
False

As we can see the name of the satellite are not unique and by a quick verification, they don't share the same TLE. This means that, for two satellites with the same name, we get two different positions which is inconsistent. Most of the objects having the same name have the extension "DEB". This extension means that this object is a debris of the original satellite. This is due to explosion or colision with anothoer object of the satellite. It makes sense to keep track of those object to predict future collisions with those debris for exemple. However we are interested in "real" satellites. Therefore we remove all the satellites whose object name has the extension "DEB". Moreover, we also removed all the remaining duplicates in order to use the satellite names as indexes


In [5]:
no_deb = data[["DEB" not in x for x in data.OBJECT_NAME ]].drop_duplicates(subset = "OBJECT_NAME")
no_deb.head()


Out[5]:
OBJECT_NAME Longitude [°] Latitude [°] Elevation [km]
0 VANGUARD 1 -63.01315328238157 -26.65173623575367 950.1053125
1 VANGUARD 2 33.607191075144456 -32.25819456312109 2825.77825
2 VANGUARD R/B 158.90670062542114 1.4622271746404283 746.4226875
4 VANGUARD 3 -14.979412519707317 -30.811335100669776 3271.14
5 EXPLORER 7 165.9903034740197 9.845133707399345 517.82565625

We can now visualize the distibustion of those three values.


In [6]:
#no_deb["Elevation [km]"].astype("float").plot(kind="bar", bins= 30)
plt.hist(no_deb["Elevation [km]"].astype("float"), bins = 30, log=True)
plt.xlabel("Elevation [km]")
plt.ylabel("#Satellites")
plt.show()


We can observe two main peaks. One more or less representing satellites orbiting below ~4500 km and one for those orbiting around 35'000 km. The two bars corresponds to the LEO (Low Earth Orbit) and the GEO/GSO (Geosynchronous/Geostationary orbit). Between the two peaks, there is still a lot of satellites. They are called the MEO (Medium Earth Orbit). The remaining satellites higher than the GEO/GSO are called HEO (High Earth Orbit).


In [7]:
plt.hist(no_deb["Longitude [°]"].astype("float"), bins = 50, log=False)
plt.xlabel("Longitude [°]")
plt.ylabel("#Satellites")
plt.show()



In [8]:
plt.hist(no_deb["Latitude [°]"].astype("float"), bins = 50, log=False)
plt.xlabel("Latitude [°]")
plt.ylabel("#Satellites")
plt.show()


As we can see there is a nice spread over all the possible values of longitudes. There are no peaks meaning that the satellites are uniformly distributed over all the longitude values. However it's not the case for the latitudes. Indeed, we can observe a huge peak at the 0° lattitude which corresponds to the equator.
The equator is where Geostationary satellites are located. Geostationary satellites have the same period as the one from the earth and always have the same positions relatively to the ground.

Satellites in Orbit Info

The second file that we retrieved contain aditional information for each satellites as we can see below.


In [9]:
sioi = pd.read_csv("Dataset/satellites_in_orbit_info.csv")
sioi.head()


Out[9]:
OBJECT_ID OBJECT_NAME NORAD_CAT_ID COUNTRY PERIOD INCLINATION APOGEE PERIGEE RCS_SIZE RCSVALUE LAUNCH COMMENT
0 2017-075D CZ-2C R/B 43031 PRC 95.19 35.21 599.0 457.0 NaN 0 2017-11-24 NaN
1 2017-075C YAOGAN-30 F 43030 PRC 96.62 35.00 604.0 590.0 NaN 0 2017-11-24 NaN
2 2017-075B YAOGAN-30 E 43029 PRC 96.62 35.00 604.0 590.0 NaN 0 2017-11-24 NaN
3 2017-075A YAOGAN-30 D 43028 PRC 96.62 35.00 603.0 590.0 NaN 0 2017-11-24 NaN
4 1998-067NL OSIRIS-3U 43027 US 92.59 51.64 405.0 398.0 NaN 0 1998-11-20 NaN

We start by cleaning the dataframe before vizualizing it. We start by dropping the columns COMMENT, RCSVALUE and RCS_SIZE because they contain, most of the time, useless or no informations at all.


In [10]:
sioi = sioi.drop(["COMMENT", "RCSVALUE", "RCS_SIZE"], axis = 1)

We then remove the satellites that have duplicates and that have the DEB extension to be consistent with our previous processing on the TLE file. We also remove rows that still contain NaN values.


In [11]:
sioi = sioi[["DEB" not in x for x in sioi.OBJECT_NAME ]].drop_duplicates(subset = "OBJECT_NAME")
sioi = sioi.dropna(axis = 0)
sioi.head()


Out[11]:
OBJECT_ID OBJECT_NAME NORAD_CAT_ID COUNTRY PERIOD INCLINATION APOGEE PERIGEE LAUNCH
0 2017-075D CZ-2C R/B 43031 PRC 95.19 35.21 599.0 457.0 2017-11-24
1 2017-075C YAOGAN-30 F 43030 PRC 96.62 35.00 604.0 590.0 2017-11-24
2 2017-075B YAOGAN-30 E 43029 PRC 96.62 35.00 604.0 590.0 2017-11-24
3 2017-075A YAOGAN-30 D 43028 PRC 96.62 35.00 603.0 590.0 2017-11-24
4 1998-067NL OSIRIS-3U 43027 US 92.59 51.64 405.0 398.0 1998-11-20

We can now plot the apogee and expect a similar distribution as the elevation feature for the TLE file.


In [12]:
plt.hist(sioi.APOGEE, bins=50, log=True)
plt.show()


As we can see the two files don't seem to contains the same set of satellites. We can plot the satellites that have an apogee lower than 175'000 km to compare more precisely the two plots.


In [13]:
plt.hist(sioi.APOGEE[sioi.APOGEE < 175000], bins=50, log=True)
plt.show()


We can observe a similar distribution as the elevation. Apparently this file contains mainly informations about LEO and GEO/GSO satellites. In order to use both files in the exploitation part of the project, we will have to merge them.

We can plot the launchdate histogram of this file.


In [14]:
sioi["LAUNCH"]= pd.to_datetime(sioi["LAUNCH"]).apply(lambda x: x.strftime('%Y'))
sioi.groupby("LAUNCH").size().plot(kind="bar", figsize=(20,10))
plt.show()


There is clearly an increasing tendency throughout the years. The exponential behaviour in the latest year could be explained by the democratization of cube satellites lauch for commercial use.

Satellites Statistics per Country

The last file contains various statistics per country.


In [15]:
sspc = pd.read_csv("Dataset/sat_statistics_per_country.csv")
sspc.head()


Out[15]:
COUNTRY SPADOC_CD ORBITAL_TBA ORBITAL_PAYLOAD_COUNT ORBITAL_ROCKET_BODY_COUNT ORBITAL_DEBRIS_COUNT ORBITAL_TOTAL_COUNT DECAYED_PAYLOAD_COUNT DECAYED_ROCKET_BODY_COUNT DECAYED_DEBRIS_COUNT DECAYED_TOTAL_COUNT COUNTRY_TOTAL
0 ARAB SATELLITE COMMUNICATIONS ORGANIZATION AB 0 13 0 0 13 1 0 0 1 14
1 ASIASAT CORP AC 0 8 0 0 8 0 0 0 0 8
2 ALGERIA ALG 0 5 0 0 5 0 0 0 0 5
3 ARGENTINA ARGN 0 17 0 0 17 2 0 0 2 19
4 AUSTRIA ASRA 0 3 0 0 3 0 0 0 0 3

This file will not be used in the exploitation part. However, for the sake of curiosity, we can still plot the distribution of satellites per country.


In [16]:
sspc = sspc[["COUNTRY", "ORBITAL_TOTAL_COUNT"]].set_index("COUNTRY")
sspc = sspc.drop("ALL")
sspc.plot(kind='bar', figsize = (30, 10), logy=True)
plt.show()


Not suprisingly the most active countries are the USA, China and Russia (more precisely Commonwealth of independant states)

Data Exploitation

We will combine the first two files from the exploratory part, namely the TLE file and the "Satellites in Orbit Info" file. As we saw before they don't contain the same data. In order to use data present on both file we will merge both dataframe on their "Object_Name"


In [17]:
data_m = pd.merge(no_deb,sioi,how='inner', on="OBJECT_NAME")
data_m.head()


Out[17]:
OBJECT_NAME Longitude [°] Latitude [°] Elevation [km] OBJECT_ID NORAD_CAT_ID COUNTRY PERIOD INCLINATION APOGEE PERIGEE LAUNCH
0 VANGUARD 1 -63.01315328238157 -26.65173623575367 950.1053125 1958-002B 5 US 132.75 34.24 3834.0 650.0 1958
1 VANGUARD 2 33.607191075144456 -32.25819456312109 2825.77825 1959-001A 11 US 121.47 32.87 2940.0 553.0 1959
2 VANGUARD R/B 158.90670062542114 1.4622271746404283 746.4226875 1958-002A 16 US 137.32 34.28 4223.0 655.0 1958
3 VANGUARD 3 -14.979412519707317 -30.811335100669776 3271.14 1959-007A 20 US 124.62 33.35 3263.0 510.0 1959
4 EXPLORER 7 165.9903034740197 9.845133707399345 517.82565625 1959-009A 22 US 96.36 50.28 683.0 486.0 1959

In [18]:
data_final = data_m 
np.shape(data_final)


Out[18]:
(4347, 12)

As one can observe, we had a rich amount of data. The goal of our first investigation was to find a way to group different satellite categories. To perform such a thing, we decided to cluster the satellites with respect to their orbit as the use of a Graph was a huge plus to perform this clustering. Moreover, we observed on this website that we should find a principal orbit oscillating around the equator which was a welcomed hint to be able to assess wheter our results are coherent.


In [19]:
dict_tle = {}
for i in range(0,len(lines_tot)-1,3):
    dict_tle[lines_tot[i][2:].strip('\n')] = (lines_tot[i+1],lines_tot[i+2])
    
keys = list(dict_tle.keys())
for i in keys:
    if i not in data_final['OBJECT_NAME'].values:
        del dict_tle[i]

As said before, we want to solve our problematic using a graph. To do so, we had to think about a metric to compare the different satellites.
We decided to create a binary map for each satellites showing their trajectory during a single period. First, we sampled each satellite period by a fixed number of samples (in our case, we chose 200) and computed the satellites positions at each time sample using the PyEphem library and the TLE file. The results were plotted on an array of size 180x360 where 180 and 360 represents the spread of the latitude and longitude respectively.
Such a metric is similar to projecting the satellites' positions into a 2D plane representing the earth's surface. A single projection into a discrete point was too precise in our case and would make the comparison between the satellites impossible. To solve this problem, we projected the satellites positions into a square of fixed side length.


In [20]:
res = []
for i in (dict_tle.keys()):
    res.append(satellite_orbit(i,sioi,dict_tle))

In [21]:
orbit_grids = compute_grids(res)

Typical results of our mapping is shown on the next Figure.


In [22]:
fig, axes = plt.subplots(2,2)
axes[0,0].spy(orbit_grids[0])
axes[0,1].spy(orbit_grids[5])
axes[1,0].spy(orbit_grids[200])
axes[1,1].spy(orbit_grids[150])


Out[22]:
<matplotlib.image.AxesImage at 0x1c17f1cc88>

We could already observe from the first two plots that most of the satellites had a close trajectory with different phases.
One can observe that a recurrent gap is present between the two tails of each plot. This is due to the approximation of the movement function computed by PyEphem and the period being supposedly not totally accurate.
The third plot is a typical example of the movement of a Geostationnary or Geosynchronous satellite during one entire period. Indeed, as said before, a geostationnary satellite always cover the same area over time.
The fourth image shows some flaws of our representation. Considering a fixed sampling rate and a squared representation of the earth's surface, the speed of the satellites are not constant in our referential. Hence, the sampled positions are not equidistant. Moreover, the lattitude 0° and 180° correspond to both poles. At these positions, variating the longitude does not make any sense as it remains the same point. In our 2D projection, this latter development appears as a discontinuities.
Moreover, after a single period, some satellites undergo a large phase shift as can be seen on the fourth plot. This is the principal reason why we did not compute the trajectory of the satellites over more than one period. From one point, the satellite would have covered most of the map and this would falsify our results.

Let's have a look at all our satellites on a world map.


In [23]:
plot_map(data_final)


Without any processing, satellites are covering most of the globe. Hence this raw data cannot be used to process some analysis.
However, The latter plot already illustrates some of the different clusters we want to highlight. The most remarkable is the geostationnary cluster being the dense line along the equator. One pattern moving down the Atlantic sea and reaching up over Australia and the Philippines sea can also be noticed.

Now we had to define a metric to compare our different grid maps. To do so, we decided to apply a logical "and" between the binary maps and taking the sum of the resulting matrix to define the weight between two satellites. This would give a high coefficient for satellites having the same trajectory. A high coefficient is directly proportional to a high weight, hence we did not have use another mathematical tool such as a decreasing exponential to make small coefficients become a high weight.


In [24]:
adja = compute_adja(orbit_grids)
plt.spy(adja)


Out[24]:
<matplotlib.image.AxesImage at 0x1ca69783c8>

The resulting blank lines are probably from the geostationnary satellites having only a small amount of pixels in common with other trajectories.
The next step is to filter the satellites having a small amount of pixels in common. To do such a thing, we applied a simple threshold.


In [25]:
threshold = 8000
final = adja*(adja>threshold)
plt.spy(final)


Out[25]:
<matplotlib.image.AxesImage at 0x1c14369b38>

The resulting matrix is really sparse. For such a huge matrix (4347x4347) it was necessary to sparsify it to be able to perform a coherent clustering without too much noise.

Let's have a look at the graph with the PyGSP library.


In [26]:
G = graphs.Graph(final)
G.set_coordinates('spring')
G.plot()


The spring coordinates allowed us to identify a central pattern. Let's apply a DBSCAN algorithm to the latter coordinates to isolate this central pattern


In [27]:
labels1 = dbscan_func(G,0.3,100)


In order to assess which color corresponds to which label, we plot the graph using the labels vector as a signal.


In [28]:
G.plot_signal(labels1)


We are interested in the points having the label 1. Let's plot the respective sattellites on a world map.


In [29]:
plot_map(data_final,labels1,1)


We can observe that this central pattern really had a meaning with respect to our goal. The plotted satellites have the same orbit with different phase shifts. The next step is to cluster these different phases. To do so, let's process another DBSCAN on these satellites.


In [30]:
labels2 = dbscan_func(G,0.2,5,labels1,1)



In [31]:
indices = np.where(labels1==1)
labels3 = np.ones(len(labels1))*(-2)
labels3[[indices]] = labels2
G.plot_signal(labels3)


This latter processing allowed us to cluster the trajectories with respect to their phase shift. The plot illustrated below shows the pricipal trajectory that we already observed in the first world map plot. The previous processing allowed us to isolate it in a single cluster. Moreover the second plot confirms that the clustering has been done with respect to their phase shift.
There are some noise on the first plot. To avoid having noise, we could have simply be more strict during the second dbscan. However, this would result in having a lot of outliers, which is not optimal.


In [32]:
plot_map(data_final,labels3,1)



In [33]:
plot_map(data_final,labels3,4)


As said before, the geostationnary satellites always cover the same area of the earth's surface over time. That's why, considering our metric, we cannot cluster them.
However, there is a simpler way without using any graph to observe them. As their area of observed surface is really small, we can perform a threshold directly on the metric's coefficients. This will isolate the satellites covering small areas. As all the other satellites oscillates, they're area must be bigger that the geostationary's.


In [34]:
summ = np.zeros([len(orbit_grids),1])
for n,i in enumerate(orbit_grids):
    summ[n] = np.sum(i)
summ = summ*(summ<1000)
mask = summ>0
mask = (mask==True).T[0]
plot_map(data_final,mask.T,True)


The simple threshold worked without any noise. We were able to highlight the geostationnary satellites without the use of any graph processing.
The precise constant line over the equator corresponds to the Geostationnary satellites. However, the other trajectory oscillating with a very small amplitude is called the Geosynchronous orbit. These respective satellites are at the same height of the Geostationnary satellites, but with a small angle, resulting in a small vertical oscillation.
To be able to separate these two orbits, one could simply make the threshold more or less relaxed.

SATCAT launch site analysis

In this analysis, we will attempt to identify from which launch site a satellite has been launched from informations provided by the SATCAT database, such as its orbital parameters, country of origin, nature, etc.

The goal will be to identify satellites launch site by using only a subset of labeled satellites. We want to see what type of precision we can obtain with graph-based analysis.

This analysis provide an other angle of approach to the previous orbit analysis. Before, we tried to match satellites regarding their orbit. Now, the goal is to link satellites to where they launched using not only the orbit data, but other discrete informations.

Note: Various helper functions have been used for this section. They can be found here.

SATCAT data extraction

To have a cleaner notebook, we extracted the data from the SATCAT dataset using an external script.

This script generates a json file containing the formatted SATCAT data and more complete information on the files.

Extract celestrack SATCAT data


In [35]:
with open("Dataset/satcat_info.json") as f:
    satcat_info_json = json.load(f)

In [36]:
# Extract the main dataset and put it in a dataframe
sat_info_array = satcat_info_json["sat_data"]
satcat_df = pd.DataFrame(sat_info_array)
satcat_df = satcat_df.set_index("NORAD")
display(satcat_df.head(5))


apogee decay_date inclination launch_date launch_nbr launch_piece launch_site launch_year multiple_name_flag name operational_status orbital_period orbital_status payload_flag perigee radar_cross_section source
NORAD
1 938.0 1957-12-01 65.1 1957-10-04 1 A TYMSC 1957 False SL-1 R/B Decayed 96.2 EA False 214.0 20.420 CIS
2 945.0 1958-01-03 65.0 1957-10-04 1 B TYMSC 1957 False SPUTNIK 1 Decayed 96.1 EA True 227.0 NaN CIS
3 1659.0 1958-04-14 65.3 1957-11-03 2 A TYMSC 1957 False SPUTNIK 2 Decayed 103.7 EA True 211.0 0.080 CIS
4 215.0 1970-03-31 33.2 1958-02-01 1 A AFETR 1958 False EXPLORER 1 Decayed 88.5 EA True 183.0 NaN US
5 3834.0 34.3 1958-03-17 2 B AFETR 1958 False VANGUARD 1 Unknown 132.8 EA True 650.0 0.122 US

In [37]:
# Extract complementary information dictionary
operational_status_dict = satcat_info_json["operational_status"]
launch_site_full_name_dict = satcat_info_json["launch_site"]
source_full_name_dict = satcat_info_json["source"]

Fill NaN in satcat data


In [38]:
satcat_df = satcat_df.fillna(value=0)

Data Exploration

In this section, we want to see how the different parameters of the satellites are distributed.

To do so, we will analyze the various parameters. We used a systematic approach, where for each parameter of interest, we did a short analysis

Launch site


In [39]:
# Create dict to transform launch site to numerical value
num_launch_site_dict = {}
for index,site in enumerate(satcat_df.launch_site.unique()):
    num_launch_site_dict[site] = index

In [40]:
num_launch_site = satcat_df.launch_site.map(lambda x:num_launch_site_dict[x])
num_launch_site.name = "num_launch_site"
satcat_df = pd.concat([satcat_df,num_launch_site], axis=1)

In [41]:
# How many launch of payloads (functionning satellites) per site
launches_per_site = satcat_df.loc[satcat_df.payload_flag == True].launch_site.value_counts()
display(launches_per_site.head(5))


PLMSC    2088
TYMSC    1882
AFETR     892
AFWTR     851
FRGUI     472
Name: launch_site, dtype: int64

In [42]:
sns.distplot(launches_per_site.values,kde=False, rug=True)
plt.show()


We can see from this figure that there's many small launch sites, but very few larger one. To do data analysis, we want relatively big datasets, sot this encourages us to analyse bigger launch sites.

In the end, we chose to analyse the larger United States aunch sites, but a similar analysis could be done for the Russian (USSR historically) launch sites


In [43]:
bigger_launch_sites = ["TYMSC", "PLMSC"]
print("Russian launch sites:")
for index in range(2):
    print("\t{}) {} ({})".format(index+1, launch_site_full_name_dict[bigger_launch_sites[index]], bigger_launch_sites[index]))


Russian launch sites:
	1) Tyuratam Missile and Space Center, Kazakhstan (Also known as Baikonur Cosmodrome) (TYMSC)
	2) Plesetsk Missile and Space Complex, Russia (PLMSC)

On another note, one might think that the discrepancy in the number of launch means that the Russian's launched more satellites than the United States. In fact, this difference could be caused by the USA Classified Satellites which didn't make it into the list.

Country of origin


In [44]:
# Create dict to transform launch site to numerical value
num_source_dict = {}
for index,site in enumerate(satcat_df.source.unique()):
    num_source_dict[site] = index

In [45]:
# How many launch per sources
launches_per_source = satcat_df.source.value_counts()
display(launches_per_source.head(5))


CIS    21491
US     11387
PRC     5192
FR      1263
IND      536
Name: source, dtype: int64

In [46]:
sns.distplot(launches_per_source.values,kde=False, rug=True)
plt.show()


Operational Status


In [47]:
num_operational_status_dict = {}
for index, status in enumerate(satcat_df.operational_status.unique()):
    num_operational_status_dict[status] = index
print(num_operational_status_dict)


{'Decayed': 0, 'Unknown': 1, 'Nonoperational': 2, 'Operational': 3, 'Partially Operational': 4, 'Backup/Standby': 5, 'Spare': 6, 'Extended Mission': 7}

In [48]:
# How many satellites per operational status
operational_status_count = satcat_df.operational_status.value_counts()
display(operational_status_count)


Decayed                  23970
Unknown                  15041
Operational               1754
Nonoperational            1070
Partially Operational       53
Backup/Standby              17
Spare                        2
Extended Mission             2
Name: operational_status, dtype: int64

In [49]:
sns.distplot(operational_status_count.values,kde=False, rug=True)
plt.show()


Most satellites decay over time, and there's actually not that many active ones. Studying only active ones might be to limiting, so we will include historic satellites in our analysis.

Orbital status

See: Orbital Status Code for the meaning of the code


In [50]:
orbital_statuses = satcat_df.orbital_status.unique()
print(len(orbital_statuses), "\n", orbital_statuses)


8 
 ['EA' 'NEA' 'EA2' 'MO2' 'MA0' 'EL0' 'DOC' 'ISS']

In [51]:
num_orbital_status_dict = {}
for index, status in enumerate(orbital_statuses):
    num_orbital_status_dict[status] = index
print(num_orbital_status_dict)


{'EA': 0, 'NEA': 1, 'EA2': 2, 'MO2': 3, 'MA0': 4, 'EL0': 5, 'DOC': 6, 'ISS': 7}

In [52]:
# How many satellites per orbital status
orbital_status_count = satcat_df.orbital_status.value_counts()
display(orbital_status_count)


EA     41832
NEA       68
ISS        4
DOC        1
MA0        1
EA2        1
MO2        1
EL0        1
Name: orbital_status, dtype: int64

In [53]:
sns.distplot(orbital_status_count.values,kde=False, rug=True)
plt.show()


Almost all satelites are "EA", which means that they are in earth orbits. Because of this, orbital status have limited interest for our analysis.

Orbital Parameters

For each orbital parameters we want to do a distribution analysis by launch site


In [54]:
ORBITAL_PARAMETERS_COLS = ["apogee", "inclination", "launch_year", "orbital_period", "perigee"]

In [55]:
fig, axes = plt.subplots(len(ORBITAL_PARAMETERS_COLS), 1, figsize=(10,20))

for index, parameter in enumerate(ORBITAL_PARAMETERS_COLS):
    axes[index].hist(satcat_df[parameter].values, bins=100);
    axes[index].set_xscale("log")
    axes[index].set_yscale("log")
    axes[index].set_title(parameter)
plt.show()


We can see that there's a vast distribution for each orbital parameters. Because we have limited astrophisics knoweldge, it would be hard to differenciate those parameters manually.

The next sections will attempt to use those parameters, without using specific knoweldge on their meaning, to classify the satellites.

Data reduction

To do clustering, we want to reduce the complexity of the target clusters.

To do so, we will first reduce the number of launch sites and/or sources.

We chose to reduce the number of launch sites by selecting two of the launch sites with the most satellite launches.

Those are the main launch sites for the United States of America, the Air Force Eastern Test Range, Florida, USA (AFETR), which is linked to the Cape Canaveral Air Force Station and the Kennedy Space Center, and the Air Force Western Test Range, California, USA (AFWTR), which is linked to the Vandenberg Air Force Base.

We used those specific launch sites because they are the 3rd and 4th largest in term of satellite sent and they have very simmilar numbers of satellite sent (AFETR: 892, AFWTR: 851) In addition, because they are located at either part of the same country, there's a good chance that they were used for different purposes. Because difference between the launch-sites's satellites is key in our analysis, this is a key reason for which we got good results.

Note: In addition, we created function to reduce the number of launch sites by using their relative importance. The interested user can look into the function "data_reduction_per_percentile" for more information.


In [56]:
LAUNCH_SITES = ["AFETR", "AFWTR"]

In [57]:
reduced_satcat_df = data_reduction_per_launch_site(satcat_df, LAUNCH_SITES)

In [58]:
print(reduced_satcat_df.source.unique())
display(reduced_satcat_df.head(5))


['US' 'UK' 'ITSO' 'ESRO' 'NATO' 'CA' 'FGER' 'INDO' 'ESA' 'JPN' 'IT' 'IND'
 'ARGN' 'GER' 'AB' 'MEX' 'AUS' 'RP' 'SWED' 'NOR' 'IM' 'EUTE' 'NETH' 'GREC'
 'BRAZ' 'SES' 'SKOR' 'ABS' 'PAKI' 'GLOB' 'CIS' 'ROC' 'SPN' 'NICO' 'AC'
 'ORB' 'THAI' 'TMMC' 'BUL' 'FR' 'DEN' 'SAFR']
apogee decay_date inclination launch_date launch_nbr launch_piece launch_site launch_year multiple_name_flag name operational_status orbital_period orbital_status payload_flag perigee radar_cross_section source num_launch_site
NORAD
4 215.0 1970-03-31 33.2 1958-02-01 1 A AFETR 1958 False EXPLORER 1 Decayed 88.5 EA True 183.0 0.000 US 1
5 3834.0 34.3 1958-03-17 2 B AFETR 1958 False VANGUARD 1 Unknown 132.8 EA True 650.0 0.122 US 1
6 1739.0 1958-06-28 33.5 1958-03-26 3 A AFETR 1958 False EXPLORER 3 Decayed 103.6 EA True 117.0 0.000 US 1
9 585.0 1959-10-23 50.3 1958-07-26 5 A AFETR 1958 False EXPLORER 4 Decayed 92.8 EA True 239.0 0.000 US 1
10 1187.0 1959-01-21 32.3 1958-12-18 6 A AFETR 1958 False SCORE Decayed 98.2 EA True 159.0 0.000 US 1

We can see that many countries/groups used those launch sites. This is good, because they likely used them selectively for their specific launch properties. Because of this, we will keep the various sources in the dataset

Reduce by payload or operational

To reduce even more the size of the dataset, we can chose to keep only payloads and/or operational satellites in the dataset.

We chose to keep only payloads, because the dataset has many other types, such as debris and launchers, which are not really interesting for our analysis.

We chose to use all historic satellites to have more data on which to work. Indeed, we showed above that there's relatively few satellites remaining in orbit compared to the amount that where launched historically. Also historic satellites are likely to follow similar orbits as modern satellites.


In [59]:
ONLY_PAYLOAD=True
ONLY_OPERATIONAL=False
print("Length before further reduction: ", len(reduced_satcat_df))
if ONLY_PAYLOAD:
    reduced_satcat_df = reduced_satcat_df.loc[reduced_satcat_df.payload_flag == True]
if ONLY_OPERATIONAL:
    reduced_satcat_df = reduced_satcat_df.loc[reduced_satcat_df.operational_status == "Operational"]
    
print("Length after reduction: ", len(reduced_satcat_df))


Length before further reduction:  10253
Length after reduction:  1743

In [60]:
# We can now delete satcat_df, it will not be needed again
del satcat_df

Extract Features and Labels

Create Features DataFrame

We now want to extract the features for each satellite. As a first glance, we want to keep as much data as possible. Later on we might drop some features.


In [61]:
display(reduced_satcat_df.head(5))


apogee decay_date inclination launch_date launch_nbr launch_piece launch_site launch_year multiple_name_flag name operational_status orbital_period orbital_status payload_flag perigee radar_cross_section source num_launch_site
NORAD
4 215.0 1970-03-31 33.2 1958-02-01 1 A AFETR 1958 False EXPLORER 1 Decayed 88.5 EA True 183.0 0.000 US 1
5 3834.0 34.3 1958-03-17 2 B AFETR 1958 False VANGUARD 1 Unknown 132.8 EA True 650.0 0.122 US 1
6 1739.0 1958-06-28 33.5 1958-03-26 3 A AFETR 1958 False EXPLORER 3 Decayed 103.6 EA True 117.0 0.000 US 1
9 585.0 1959-10-23 50.3 1958-07-26 5 A AFETR 1958 False EXPLORER 4 Decayed 92.8 EA True 239.0 0.000 US 1
10 1187.0 1959-01-21 32.3 1958-12-18 6 A AFETR 1958 False SCORE Decayed 98.2 EA True 159.0 0.000 US 1

In [62]:
display(reduced_satcat_df.orbital_status.unique())


array(['EA', 'NEA', 'EA2', 'ISS'], dtype=object)

In [63]:
# Note: because this function is key to understanding how we create the features, we kept it in the notebook

def get_feature_dataframe(reduced_satcat_df, only_payload=True, only_operational=False):
    """Function to create the feature dataframe"""
    # We keep all the features that could have an impact on the clustering
    # Note: We do this in a general fashion, so we keep features that could have been reduced just in case

    # Numeric Features : "apogee", "inclination", "launch_year", "orbital_period", "perigee", "radar_cross_section"
    # Boolean Features : "payload_flag"
    # List Features    : "operational_status", "orbital_status", "source"
    numeric_features = ["apogee", "inclination", "launch_year", "orbital_period", "perigee", "radar_cross_section"]
    boolean_features = ["payload_flag"]
    list_features = ["operational_status", "orbital_status", "source"]
    features_columns = numeric_features + boolean_features + list_features

    # Numeric features don't require special management
    features_df = reduced_satcat_df[numeric_features]
    display(features_df.head(5))

    # Transform boolean features to numeric
    num_payload_flag = reduced_satcat_df.payload_flag.map (\
                                lambda x : 1 if x else 0
                       )
    if not only_payload:
        features_df = features_df.assign(payload_flag = num_payload_flag)

    # We need to transform the List features in a numerical form, we will use the unique value index to do so
    # We previously created indexes to be able to find them easily
    # "operational_status" : num_operational_status_dict
    # "orbital_status" : num_orbital_status_dict
    # "source": num_source_dict
    num_operational_status = reduced_satcat_df.operational_status.map( \
                                    lambda x : num_operational_status_dict[x] \
                             )
    num_orbital_status = reduced_satcat_df.orbital_status.map( \
                                    lambda x : num_orbital_status_dict[x] \
                             )
    num_source = reduced_satcat_df.source.map( \
                                    lambda x : num_source_dict[x] \
                             )
    if not only_operational:
        features_df = features_df.assign(operational_status = num_operational_status)
    # We could use orbital status, but as we have seen above, it provides very little 
    # information because most of the statuses are the same
    #features_df = features_df.assign(orbital_status = num_orbital_status)
    features_df = features_df.assign(source = num_source)
    
    return features_df

In [64]:
features_df = get_feature_dataframe(reduced_satcat_df, ONLY_PAYLOAD, ONLY_OPERATIONAL)
display(features_df.head(5))


apogee inclination launch_year orbital_period perigee radar_cross_section
NORAD
4 215.0 33.2 1958 88.5 183.0 0.000
5 3834.0 34.3 1958 132.8 650.0 0.122
6 1739.0 33.5 1958 103.6 117.0 0.000
9 585.0 50.3 1958 92.8 239.0 0.000
10 1187.0 32.3 1958 98.2 159.0 0.000
apogee inclination launch_year orbital_period perigee radar_cross_section operational_status source
NORAD
4 215.0 33.2 1958 88.5 183.0 0.000 0 1
5 3834.0 34.3 1958 132.8 650.0 0.122 1 1
6 1739.0 33.5 1958 103.6 117.0 0.000 0 1
9 585.0 50.3 1958 92.8 239.0 0.000 0 1
10 1187.0 32.3 1958 98.2 159.0 0.000 0 1

Normalize features

We want to normalize the features so they have equivalent weight in the rest of our analysis


In [65]:
features_df = normalize_features(features_df)
display(features_df.head(5))


apogee inclination launch_year orbital_period perigee radar_cross_section operational_status source
NORAD
4 -0.390693 -0.766486 -1.44112 -0.192732 -0.501090 -0.171139 -0.943187 -0.312776
5 -0.285957 -0.734247 -1.44112 -0.176636 -0.473941 -0.168288 -0.147073 -0.312776
6 -0.346588 -0.757693 -1.44112 -0.187246 -0.504927 -0.171139 -0.943187 -0.312776
9 -0.379985 -0.265311 -1.44112 -0.191170 -0.497834 -0.171139 -0.943187 -0.312776
10 -0.362563 -0.792864 -1.44112 -0.189208 -0.502485 -0.171139 -0.943187 -0.312776

Create Labels


In [66]:
num_launch_site = reduced_satcat_df.num_launch_site
label_df = pd.DataFrame(num_launch_site)
display(label_df.head(5))


num_launch_site
NORAD
4 1
5 1
6 1
9 1
10 1

Graph Analysis

Weight matrix calculation

Distances

Using the features we calculate, we want to find the distance between the feature vectors.

We used the braycurtis distance. Because it uses element wise difference of the feature vectors, it allows for comparison of parameters that are not numeric, such as the list_features above by giving each its relative importance. This is something that is not well achieved by euclidean distances for example.


In [67]:
distance_metric = "braycurtis"
distances = spatial.distance.squareform(spatial.distance.pdist(features_df, metric=distance_metric))

In [68]:
plt.hist(distances.reshape(-1), bins=100)
plt.show()



In [69]:
print('{} distances equal exactly zero.'.format(np.sum(distances == 0)))
print(len(distances))


2027 distances equal exactly zero.
1743

Weights

With the distances between the different features, we want to create a weight matrix that will form the basis of our graph analysis


In [70]:
KERNEL_WIDTH_PERCENTILE = 0.5
weights = get_weights_from_distance(distances, KERNEL_WIDTH_PERCENTILE)
# Validate that all the weights on the diagonal are null
print(np.sum(np.diagonal(weights)==0))
print(len(weights))


1743
1743

In [71]:
# Plot the resulting weight value
fix, axes = plt.subplots(1, 2, figsize=(17, 8))
plot_weight_hist(weights, axes[:], name="satcat_base_weight_hist")
plt.show()


Sparcify

We can see that most weights are very close to zero in the graph above. We want to limit the connectiveness of graph and therefore we need to avoid edges that have very low values.

To reduce the connectiveness, we will select only the neighbors with the strongest connections.

Then, we will filter the remaining weights to remove those too close to zero, by replacing all weights that are smaller than epsilon with the value 0.


In [72]:
NEIGHBORS = 100
epsilon = 1e-8

weights = sparse_weights(weights, NEIGHBORS, epsilon)
fix, axes = plt.subplots(1, 2, figsize=(10, 8))
plot_weight_hist(weights, axes[:], name="satcat_sparse_weight_hist")


100%|██████████| 1743/1743 [00:02<00:00, 718.11it/s]
100%|██████████| 1743/1743 [00:02<00:00, 720.57it/s] 
100%|██████████| 1743/1743 [00:03<00:00, 493.46it/s]

We can see that the resulting matrix is much more sparse than the previous one. Also, we can clearly see clusters emerging in the dataset.

This is a good sign that graph analysis will be able to extract information from this weight matrix.

Graph analysis

To identify the clustering we seem to observe in the weight matrix, we will need to do more complete graph analysis. We will use networkx as the main tool for this step.


In [73]:
graph = create_graph_from_weights(weights, reduced_satcat_df)

We attributed to each node its NORAD id number, so we can track it even if the node numbering changes, which will happen when we subdivise the graph in subgraphs


In [74]:
print(graph.node[0])


{'NORAD': 4}

Draw the network

Lets first overview the graph structure by visualizing it. We colored the two different launch sites in the graph to help with analysis.

Note: Altough the graph is colored, the actual algorithm has no idea of the colors in the graph. The coloring is simply aimed at helping with the analysis


In [75]:
fig, axes = plt.subplots(1,1, figsize=(10,5))
draw_graph(graph, axes, reduced_satcat_df, name="base_graph")
plt.show()


Filter the graph degree

We will first remove nodes with small amount of edges, because we can see that they don't really provide information on the clusters.

In the final analysis, we realized that it was better to only filter out "lonely nodes", i.e. nodes with no edges, because some node with only one edge are actually part of bigger clusters, so they have meaningfull information in them.


In [76]:
# Filter out lonely nodes
MINIMUM_DEGREE = 0 
print(len(graph.nodes()))
graph = remove_lonely_nodes(graph, minimum_degree = MINIMUM_DEGREE)
print(len(graph.nodes()))
# Do it again to remove resulting nodes with degree of 0
# Only if the minimum degree is greater than 0, otherwise 
# it would simply repeat the same thing
if MINIMUM_DEGREE > 0:
    graph = remove_lonely_nodes(graph, minimum_degree = 0)
print(len(graph.nodes()))


100%|██████████| 1743/1743 [00:00<00:00, 300937.38it/s]
1743
1631
1631


In [77]:
fig, axes = plt.subplots(1,1, figsize=(10,5))
draw_graph(graph, axes, reduced_satcat_df, name="cleaned_base_graph")
plt.show()


We can see that there's clearly some major clusters appearing in the graph. The next step will be to attempt to segregate those clusters and extract see if we can identify they came from which launch site, and to which precision we can identify them.

Separate the graph into connected subgraphs

Altough the graph seems connected in the figures, the weight matrix above tells us another story. There's clearly some subgraphs in there, even if we cannot see them.

In this step, we will separate the main unconnected graph into connected subgraphs.


In [78]:
connected_subgraphs = []
for subgraph in  nx.connected_component_subgraphs(graph):
    connected_subgraphs.append(nx.Graph(subgraph))
print(len(connected_subgraphs))


57

In [79]:
fig, axes = plt.subplots(1, 1, figsize=(10,5))
print_subgraphs_nodes_dist(connected_subgraphs, axes)
plt.show()


We note that there's a lot of small graphs, without many nodes.

For the visual analysis, we will only use a subset of the graphs, otherwise there would be too many plots

We will keep those graph in the label analysis, this is why we save the smaller graph in the "small_subgraphs" variable.


In [80]:
nodes_nbr = get_nodes_nbr(connected_subgraphs)
MINIMUM_NBR_OF_NODES = np.percentile(nodes_nbr,
                                     90)
small_subgraphs = [connected_subgraphs[index] for index, val in enumerate(nodes_nbr) if val < MINIMUM_NBR_OF_NODES]
kept_subgraphs = [connected_subgraphs[index] for index, val in enumerate(nodes_nbr) if val >= MINIMUM_NBR_OF_NODES]
print(len(kept_subgraphs))


6

In [81]:
fig, axes = plt.subplots(1, 1, figsize=(5,3))
print_subgraphs_nodes_dist(kept_subgraphs, axes)
plt.title("Kept subgraphs number of nodes")
plt.show()



In [82]:
print("Lets visualize the bigger subgraphs we kept")
print_subgraphs_network(kept_subgraphs, reduced_satcat_df, name = "subgraph_networks_before_clique")


Lets visualize the bigger subgraphs we kept
Separate the biggest graph in subgraphs

We see from the result above that most subgraphs are well separated but for the largest subgraphs.

To improve the results, we will separate the bigger subgraphs into smaller ones.

To determine how we will end the separation, we will separate the bigger graphs in cliques.


In [83]:
MAXIMUM_SUBGRAPH_NODES_PERCENT = 0.20 # How many nodes you need to be a big subgraph,
                                      # relative to the initial graph number of nodes
maximum_subgraph_nodes = int(len(graph.nodes())*MAXIMUM_SUBGRAPH_NODES_PERCENT)

# Get the index of the big subgraphs
big_subgraphs_index = get_big_subgraphs_index(kept_subgraphs,
                                              maximum_subgraph_nodes
                                             )
print(big_subgraphs_index)


[0]

In [84]:
# We will not need the initial graph anymore
del graph

We see that there's only one big subgraph in this case. Though, our methodology would apply if there was more.


In [85]:
# Segment the big subgraphs into cliques
SIZE_OF_SMALLEST_CLIQUE = 20           # Determine the minimum number of nodes in the smalles clique
                                       # i.e. no clique can have less than 20 nodes
clique_subgraphs = []
for subgraph_index in big_subgraphs_index:
    current_subgraph = kept_subgraphs[subgraph_index]
    current_subgraphs = get_graph_cliques(current_subgraph,
                                          smallest_clique=SIZE_OF_SMALLEST_CLIQUE
                                         )
    clique_subgraphs += current_subgraphs
print(get_nodes_nbr(clique_subgraphs))


[74, 357, 31, 24, 25, 29, 23, 21, 20, 22, 20]

In [86]:
print("Lets show the resulting clique subgraphs, after the big subgraph division")
print_subgraphs_network(clique_subgraphs[:6], reduced_satcat_df, name = "subgraph_networks_after_clique")


Lets show the resulting clique subgraphs, after the big subgraph division

We can see that nodes with different labels were effectively separated by the division into cliques.

Create an unified subgraph list

Now that we have many subgraphs, from the initial connected subgraphs and the clique subgraphs, we want to create an unified subgraph list that we will be able to use in the labelling.


In [87]:
# Construct a new subgraph without the subgraphs that where
# divided in cliques, but with the cliques
kept_subgraphs_no_big = [subgraph for index, subgraph in enumerate(kept_subgraphs)
                         if not index in big_subgraphs_index]
print(get_nodes_nbr(kept_subgraphs))
print(get_nodes_nbr(kept_subgraphs_no_big))


[670, 136, 169, 96, 43, 131]
[136, 169, 96, 43, 131]

In [88]:
# Merge the subgraphs
subgraphs = clique_subgraphs + kept_subgraphs_no_big + small_subgraphs
print(get_nodes_nbr(subgraphs))
print(sum(get_nodes_nbr(subgraphs)))
print(len(subgraphs))


[74, 357, 31, 24, 25, 29, 23, 21, 20, 22, 20, 136, 169, 96, 43, 131, 14, 6, 6, 5, 2, 6, 2, 6, 2, 2, 5, 40, 7, 2, 2, 2, 4, 2, 23, 39, 5, 2, 2, 2, 7, 2, 31, 28, 2, 2, 3, 2, 2, 2, 17, 4, 2, 4, 5, 3, 4, 8, 5, 6, 4, 27, 5, 6, 3, 10, 4]
1607
67

Labeling of the nodes

Although we have used coloring in the graphs to show the launch site of each node, the algorithm doesn't have any information on the actual label to give to each node.

In this section, we will provide the algorithm with a subset of the nodes that have an actual label. From there, we will try to give a probability of each subgraph to represent satellites that have been launched by a particular launch site.

We will then identify the nodes of each subgraph in bulk by using the most probable launch site. If the probability is equal or we don't know it (i.e. we don't have any node with a label in the graph) we will identify the node launch site as "unknown".

Finally, we will compare the results with the actual

Create the labeled subset of nodes

First create an array with all nodes, their label and if they were part of the labeled subset.

If they are unidentified, their label will be -1.

The label will be the the numerical value given by num_launch_site.


In [89]:
PERCENT_OF_LABELED = 10
label_df = create_labeled_df(reduced_satcat_df, PERCENT_OF_LABELED) 
print(label_df.head(5))
print("Number of labeled sats: {}".format(label_df.is_labeled.sum()))
print("Total number of sats: {}".format(len(label_df)))


       is_labeled  label  real_label
NORAD                               
4               0     -1           1
5               0     -1           1
6               0     -1           1
9               0     -1           1
10              0     -1           1
Number of labeled sats: 174
Total number of sats: 1743

Calculate launch site probability of each subgraph

Depending on how many nodes where labeled in each subgraph, we will give the subgraph a probability to contain a specific label. Those probabilities changes depending on how many nodes where labeled

Note: Subgraphs containing no identified nodes will have an empty probability dictionnary


In [90]:
subgraph_probabilities = [get_label_probs(label_df, subgraph) for subgraph in subgraphs]
print(subgraph_probabilities)


[{1: 1.0}, {2: 1.0}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {2: 1.0}, {1: 1.0}, {2: 1.0}, {1: 1.0}, {2: 1.0}, {1: 1.0}, {}, {1: 1.0}, {1: 1.0}, {}, {1: 1.0}, {}, {}, {}, {}, {1: 1.0}, {1: 1.0}, {1: 1.0}, {}, {}, {}, {}, {}, {1: 1.0}, {1: 1.0}, {}, {}, {1: 1.0}, {}, {1: 1.0}, {}, {1: 1.0}, {}, {}, {}, {}, {}, {}, {}, {1: 1.0}, {}, {}, {}, {}, {}, {2: 1.0}, {2: 1.0}, {}, {2: 1.0}, {}, {2: 1.0}, {}, {2: 1.0}, {}, {2: 1.0}, {2: 1.0}]

Identify the nodes of each subgraph

Now we identify the nodes. The subgraphs are identified in bulk, i.e. every node will have the same label. This label the most probable in the subgraph_probabilities dictionnary.


In [91]:
label_df = identify_nodes_from_prob(label_df, subgraphs)
display(label_df.head())


is_labeled label real_label
NORAD
4 1 1 1
5 1 1 1
6 1 1 1
9 0 -1 1
10 1 1 1

In [92]:
nodes_nbr = [len(graph.nodes()) for graph in subgraphs]
sum(nodes_nbr)
print(len(nodes_nbr))
print(len(label_df))


67
1743

In [93]:
sns.distplot(label_df.label, kde=False, rug=True)
plt.title("Node label distribution, -1 is unknown")
plt.show()


Calculate the error of the labels

In the graph above, we can see that for a fixed probability of label, there's some nodes that are unknown and some that are identified.

We wan't to know what are those distributions depending on the probability we assign.

Also, we wan't to measure the error in the labels we obtained.


In [94]:
error = {}
values = range(0,101,1)
for percent_labeled in tqdm(values):
    labeled_df = get_labeled_df(reduced_satcat_df, subgraphs, percent_labeled)
    error[percent_labeled] = get_error_properties(labeled_df, reduced_satcat_df, show=False)

key_of_interest = [0, 1, 2, 5, 10, 20, 40, 60, 80, 100]
for key in key_of_interest:
    print("Fraction of labelized data: {} %, error: {}, unidentified:{}".format(key,
                                                                                error[key]["good_classification_percent"],
                                                                                error[key]["total_unknown_label"]))


100%|██████████| 101/101 [01:24<00:00,  1.20it/s]
Fraction of labelized data: 0 %, error: 0.042328042328042326, unidentified:1448
Fraction of labelized data: 1 %, error: 0.6316137566137566, unidentified:547
Fraction of labelized data: 2 %, error: 0.8159630606860159, unidentified:265
Fraction of labelized data: 5 %, error: 0.8764783180026281, unidentified:174
Fraction of labelized data: 10 %, error: 0.9302022178734507, unidentified:94
Fraction of labelized data: 20 %, error: 0.939922480620155, unidentified:83
Fraction of labelized data: 40 %, error: 0.9769182782283219, unidentified:28
Fraction of labelized data: 60 %, error: 0.9926784624771202, unidentified:6
Fraction of labelized data: 80 %, error: 0.9964391691394658, unidentified:2
Fraction of labelized data: 100 %, error: 1.0, unidentified:0


In [95]:
print_error_graph(error, "satcat_error_graph")


We can see that we obtain very few errors even with low fraction of labelized node. From around 10% of labels we are above 90% and from around 20% of labels we are above 95%

We can see that the "Unknown" and "Bad" label decrease exponentially as we increase the fraction of labelized nodes.

In addition, the Total Label parameters gives the number of nodes that where labeled. It is not 100%, because we don't label nodes that where not part of the random selection of pre-labelized nodes or part of the subgraphs. Because of this, nodes with no edges can end up not labelled.

Overall, this is a solid result and shows that in this context the clustering methodology we used was efficient.

Conclusion

This project allowed us to use different tools learned during the course A Network Tour of Data Science to a concrete project in order to perform some analysis over acquired raw data. All the different main steps of a typical Data Science project, such as Data Acquisition, Data Exploration and Exploitation has been performed.
In the first part of the project, we succeeded in clustering the different satellite orbits. We mainly observed the Low Earth Orbit, Geosynchronous and Geostationnary trajectories.
In the second part, we succeeded in identifying from which launch site between the Air Force Eastern Test Range, Florida, USA and the Air Force Western Test Range, California, USA satellites where launch using minimal labels.

If you want more information on the subject of space missions, a great course is given at EPFL on the subject during the spring semester by the astronaut Claude Nicollier)