Radio Frequency Interference: Identifying Nearby Sources

In the world of radio astronomy, observations of signals from extraterrestrial origins are often compounded by human-made sources. Many electronical sources, such as radio stations, cell phone towers, WiFi, even GPS satellites and your personal microwave oven, can interfere with radio astronomy. This tutorial combines usage of hardware with some code to identify nearby interference sources. This analysis is part of the larger practice of filtering out any human-made and Earth-based radio signals to better identify potential signals of interest from space.

For my experiment, I used a Realtek RTL2838 dongle as my hardware receiver. While I specifically looked for nearby radio stations within the limited scope of the hardware, the code can be applied with little modification, depending on the hardware used, to processing and identifying a wide range of other sources.

Ultimately, this script produces two outputs:

  1. A plot showing signal sources as peaks, which are marked according to a given dictionary of the sources' names. The plot is a spectrum of signal power against frequency
  2. A dictionary matching the peaking frequencies with their sources Familiarizing with the properties of human-made sources can be convenient for manually dismissing uninteresting signals.

From the Command Line: rtl_power

Before diving into python code, we first need to generate a file to store in the data from the dongle. The command to do that is rtl_power:

rtl_power min:max:bin filename.ext

min: initial frequency max: terminal frequency bin: frequency interval

The file extension I used is csv. There are additional parameters for further options; example:

rtl_power 87M:108M:1k -g 20 -i 10 -e 15m logfile.csv

Here, I set the gain to 20 and the runtime to 15 minutes, taking data in 10-second intervals. All the data is stored in a csv file logfile.csv. With our data in hand, open a python file with your favorite editor and begin the processing.

Modules to Import and Global Variables

Several modules are important for the basic functioning of this script:

  1. Numpy: essential for computing and number crunching
  2. Matplotlib.pyplot: useful for plotting
  3. csv: used to generate the raw data, which is stored as a csv file, into a workable text file. The appropriate modules, depending on the format of the raw data file and the format of the desired data conversion, should be used here
  4. OrderedDict from collections: outputs any dictionary with the entries in the order that you entered them. This will be useful later for ordering peaks, but is useful in general when dealing with dictionaries

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import csv
from collections import OrderedDict
from FMstations import *

raw_data = np.genfromtxt("stationsdata.csv", delimiter = ",", dtype = None)

MINFREQ = 87900000
MAXFREQ = 107900000 
FREQ_BIN = raw_data[0][4]
INTERVAL = 10 
TOTAL_TIME = 900 
SCAN = TOTAL_TIME / INTERVAL 
DATAPOINTS = MAXFREQ - MINFREQ / FREQ_BIN

FMstations is a separate .py file storing a dictionary to be used in a later function. While dictionaries can be included in the executing script without functional issues, their potential large size warrants storage in a separate file.

stationsdata.csv is the file containing the raw data, an array of lines. Each line is an array itself, consisting of a number of power values corresponding to the sampling rate. Specifically for my experiment, the sampling rate is 2.5 MHz and each eight lines constitute a scan of the bandwidth of interest, the 20 MHz band between 87.9 and 107.9 MHz. U.S. radio stations are allocated broadcast frequencies in that range, and these values should be changed according to the experiment at hand. Data is collected every 10 seconds for every 610.35 (raw_data[0][4]) frequency difference in that range over a total of 900 seconds.

The raw data file consists of 720 lines because the scan is performed 90 times, the total number of seconds divided by the scan interval. Each line unfortunately comes with an initial six outputs, such the date of when the data was taken, that we have no interest of for purposes of data processing and plotting. Those outputs would have to be truncated.

Processing Raw Data

First, we have to idenfity the frequency range we are interested in nd have it be consistent with the type and number of data in the file for plotting. There are DATAPOINTS number of data (32776 for my experiment) for each scan, with each line having SAMP_RATE / FREQ_BIN number of data (4107).


In [3]:
def total_freq(samp_rate, line):
    """total frequency band of the data

    samp_rate: sample rate used to collect the data
    freq_bin: difference in frequency between each data point
    line: number of iterations to create the band;
    each sampling may sample over only a part of the desired bandwidth
    """
    if samp_rate < (MAXFREQ - MINFREQ):
        allfreq = []  
        i = 0
        while i < line:
            freqband = np.arange(MINFREQ + (samp_rate * i), MINFREQ + (samp_rate * (i+1)), FREQ_BIN)
            allfreq = np.append(allfreq, freqband)
            i += 1
        return allfreq
    else:
        return np.arange(MINFREQ, MAXFREQ, FREQ_BIN)

TotalBand = total_freq(raw_data[0][3] - raw_data[0][2], 8)

There are many ways to go about processing the raw data values to correspond with our frequency range. Since my data came in the form of an array of arrays, that's how I initially kept my data as, though with the fir several unneeded elements truncated from each line. The follow code does just that.


In [4]:
def power_total(data, line, trunc):
    """returns an array of arrays; all power values from the scans

    data: input raw data; usually an array of arrays
    line: number of lines of data corresponding to a full scan over the bandwidth
    trunc: truncate number of, if any, of unneeded data, such as time, at the start of each line
    """
    P_tot = []
    j = 0
    while j < SCAN:
        P_band = []
        i = 0
        while i < line:
            P_band = np.append(P_band, data[j*line+i].tolist()[trunc:])
            i += 1
        P_tot.append(P_band)
        j += 1
    return P_tot

total_data = power_total(raw_data, 8, 6)

Unlike the raw data, however, each inner array now consists of a full scan of the relevant band (as opposed to each line in the raw data corresponding to only an eighth of the band). For my data, there's now only 90 inner arrays, each a scan at a specific time, down from 720 lines in the raw data.

Having this result is useful for, as an example, plotting multiple graphs to see the time-evolution of the signal. For our immediate purposes, we look at the time-averaged values to produce a single plot.


In [5]:
def power_avg(data, xax=0, yax=1):
    """returns the time-average power values of each index-sharing set of numbers

    data: input data; usually an array of arrays
    """
    indices_arr = np.swapaxes(data, xax, yax)
    y = []
    for index in indices_arr:
        y.append(np.average(index))
    return y

avg_data = power_avg(total_data)

By swapping the axes of the array, I can put all the index-sharing values in the same inner array. For example, the first value of each of the scans are now in the first inner array and the second values of all the scans are in the second inner array. Averaging the values in each of those arrays would now be a simple task.

We now have a single array of values ripe for plotting.

Spectrum Improvement and Plotting

Let's see a visual of our plot of the average power values against our frequency band.


In [8]:
plt.figure(figsize=(15,5))
plt.title('Time-Averaged Radio Spectrum over 90 Scans')
plt.xlabel('MHz')
plt.ylabel('Power (dBm)')
plt.plot(TotalBand, avg_data)
plt.show()


We have a functional plot, but it's riddled with issues associated with signal process. We need to flatten the noise floor, remove the noisy peaks, and smooth out the graph so that we can systematically identify the peaks, which correspond to frequencies which signals are received at. Power is in units of dBm, decibels referencing milliwatts. Online resources have more in-depth explanation of the bel unit and power representation in signal processing.

To flatten the noise floor, we have the libery of choosing the agent for our task. We have to divide our plot by a fitted curve or any set of values that can accomplish the task.


In [9]:
def Flatten(spec, flatter, n):
    """flattens the noise level of our original plot

    dataSpec: input spectrum to be flatted
    flatter: an array derived from the input spectrum used to divided by the spectrum
    n: half the number of a set of points that the median is taken over
    """
    MED = []
    i = 0
    for x in flatter:
        if i < n:
            MED.append(np.median(flatter[0 : i + n + 1])) 
            i += 1 
        elif i >= (len(spec) - (n + 1)):
            MED.append(np.median(flatter[i - n :]))
            i += 1
        else:
            MED.append(np.median(flatter[i - n : i + n + 1]))
            i += 1
    Spectrum = np.array(spec) / np.array(MED) * np.average(spec)
    return Spectrum

Fortunately for my data, the seventh arc as seen above is devoid of peaks so it's convenient to apply that arc to my entire plot to be divided by. The Flatten function takes the median of a certain number of points to remove, for instance, the few noisy peaks seen on the arc. The fitted arc is then swiftly applied to flatten the noise floor.


In [11]:
FlatSpec = Flatten(avg_data, avg_data[24582:28679] * 8, 10)

plt.figure(figsize=(15,5))
plt.title('Time-Averaged Radio Flattened Spectrum over 90 Scans')
plt.xlabel('MHz')
plt.ylabel('Power (dBm)')
plt.plot(TotalBand, FlatSpec)
plt.show()


The code, though looks nicer, is still pretty rough due to all the noise. This can be simply solved by applying a median averaging across the data; the noisy peaks that correspond to a single frequency would thus be filtered out.


In [13]:
def Reduce(spec, n):
    """takes the median of a set of points, removing noise-produced peaks and much noise
    
    n: half the number of a set of points that the median is taken over
    """
    R = []
    i = 0
    for x in spec:
        if i < n:
            R.append(np.median(spec[0 : i + n + 1]))
            i += 1
        elif i >= 32769:
            R.append(np.median(spec[i - n :]))
            i += 1
        else:
            R.append(np.median(spec[i - n : i + n + 1]))
            i += 1
    return R

ReduSpec = Reduce(FlatSpec, 10)

plt.figure(figsize=(15,5))
plt.title('Noise-Reduced, Time-Averaged Radio Spectrum')
plt.xlabel('MHz')
plt.ylabel('Power (dBm)')
plt.plot(TotalBand, ReduSpec)
plt.show()


This is seemingly our desired plot, but upon closer inspection...


In [22]:
plt.figure(figsize=(15,5))
plt.title('Second Highest Peak Zoomed-In')
plt.xlabel('MHz')
plt.ylabel('Power (dBm)')
plt.ylim(-43,-42)
plt.plot(TotalBand[14000:14200], ReduSpec[14000:14200])
plt.show()


Despite the plot looking great from afar, it's still rather rugged even at the peaks upon closer inspection. The roughness can make it difficult to mathematically identify peaks. What we need is a smoothing. The process of smoothing a function over another (convolution) is rooted in mathematical rigor, so we do not go in depth there. Essentially, a smoothing function, traditionally called the kernel, is 'applied' and moved along our spectrum. There are various ways to implement convolution and thankfully numpy has built-in function to make the job simpler.


In [16]:
def smooth(spec, win_len, window, beta = 20):
    """smooths a signal with kernel window of win_len number of inputs
    
    signal: Input data spectrum to be smoothed
    win_len: the size of the kernel window used to smooth the input
    window: type of kernel; e.g. 'blackman'
    """
    if window == 'kaiser':
        w = eval('np.'+window+'(win_len, beta)')
    elif window == 'flat':
        w = np.ones(len(win_len, 'd'))
    else:
        w = eval('np.'+window+'(win_len)')
    s = np.r_[spec[win_len-1 : 0 : -1], spec, spec[-1 : -win_len : -1]]
    y = np.convolve(w/w.sum(), s, mode='valid')
    return y[(int(win_len / 2) - 1) : (int(-win_len / 2))]

The code evaluates based on what type of smoothing we're using. Numpy has a number of built-in smoothing functions. For my plots, I used the kaiser kernel, which is derived from Bessel functions. If flat smoothing is chosen, then our plot is smoothed by a line. The result is a much smoother graph, with the side effect of reduced power at every point.


In [23]:
SmooSpec = smooth(ReduSpec, 150, 'kaiser')

plt.figure(figsize=(15,5))
plt.title('Smoothed, Time-Averaged Radio Spectrum')
plt.xlabel('MHz')
plt.ylabel('Power (dBm)')
plt.plot(TotalBand, SmooSpec)
plt.show()



In [27]:
plt.figure(figsize=(15,5))
plt.title('Second Highest Peak Zoomed-In')
plt.xlabel('MHz')
plt.ylabel('Power (dBm)')
plt.ylim(-43.5,-42.5)
plt.plot(total_freq(8)[14000:14200]/1000000, SmooSpec[14000:14200],8)
plt.show()


  File "<ipython-input-27-caf301213110>", line 6
    plt.plot(total_freq(8)([14000:14200]/1000000, SmooSpec[14000:14200]))
                                 ^
SyntaxError: invalid syntax

We finally have our smoothed plot and can systematically identify characteristics with numpy without dealing with much of the interference due to rough data patterns.

Identifying and Marking Peaks and their Frequencies

We can isolate the peak frequencies with simple methods.


In [13]:
def get_peaks(spec, threshold, xreducer, rounder=2):
    """identifies the peaks of a plot. Returns an array of 2 lists:

    1. the indices of the frequencies corresponding to the peaks;
    2. said frequencies, divided by xreducer for simpler units and rounded to rounder decimals
   
    spec: input data spectrum
    threshold: only data above which are taken into account to ignore the noise level
    """
    Peaks = []
    spec = spec.tolist()
    for i in np.arange(len(spec)):
        if spec[i] > threshold and spec[i] > spec[i-1] and spec[i] > spec[i+1]:
            Peaks.append(spec[i])
        else:
            continue
    Ordered_Indices = []
    while True:
        if np.array(Peaks).tolist() == []:
            Ordered_Freq = [(x * FREQ_BIN + MINFREQ) for x in Ordered_Indices]
            Reduced_Freq = np.around((np.array(Ordered_Freq) / xreducer), rounder)
            return [Ordered_Indices, Reduced_Freq.tolist()]
        elif len(Peaks) == 1:
            Ordered_Indices.append(spec.index(Peaks[0]))
            Peaks = np.delete(Peaks, 0)   
        else:
            Ordered_Indices.append(spec.index(np.amax(Peaks)))
            Peaks = np.delete(Peaks, np.array(Peaks).tolist().index(np.amax(Peaks)))

THe function iterates over all the values in the spectrum and stores any frequency (and its associated index) of the peaks (values above a certain threshold that's a local maxima) in arrays. The indices will be useful for corresponding the frequencies with their peak location. The majority of the function is ordering the frequencies in descending order of the peak intensity by iterating over the array of peaks and determining each's corresponding peak value. The ordered list is then rounded to one decimal and is in units of MHz or in a desired unit as determined by xreducer.

With a list of peaks, we can use them to physically mark their sources on our plot, provided a dictionary of the sources.


In [14]:
def mark_peaks(src_dict, spec, threshold, line, title, xreducer, error=.01, bound1='left', bound2='bottom', rot=90):
    """returns both a plot and a dictionary
    
    plot: shows the stations next to the marked peaks
    dictionary: matches the relevant peak frequencies with the corresponding station(s)
    
    src_dict: input dictionary of frequencies and stations from which the results are selected from
    spec: input spectrum data
    threshold: only data above which are taken in account to ignore the noise level
    title: title for the plot
    xreducer: the values of the x-axis divided by which to simpler units
    error: within which the obtained frequencies are acceptable as equivalent to that of a station
    remaining parameters: used the adjust the markings and labels of the plots
    """
    stations = []
    peakfreqs = []
    stations_i = []
    peaker = get_peaks(spec, threshold, xreducer)
    p0 = peaker[0]
    p1 = peaker[1]
    for i in np.arange(len(p1)):
        if p1[i] in src_dict.keys():
            stations.append(src_dict[p1[i]])
            peakfreqs.append(p1[i])
            stations_i.append(p0[i])
        else:
            for x in np.arange(p1[i]-error, p1[i]+error, error):
                if x in src_dict.keys():
                    stations.append(src_dict[x])
                    peakfreqs.append(p1[i])
                    stations_i.append(p0[i])
                else:
                    continue
    peaks = [spec[y] for y in stations_i]
    plt.figure(figsize=(15,5))
    plt.title(title)
    plt.xlabel('Frequency (MHz)')
    plt.ylabel('Reduced Power (dBm)')
    yoffset = (np.amax(spec) - np.amin(spec)) / 4
    plt.ylim(np.amin(spec) - yoffset, np.amax(spec) + yoffset)
    plt.plot(np.array(total_freq(line)) / 1000000, spec)
    plt.scatter(peakfreqs, peaks, marker = 'o', color = 'r', s = 40)
    text_bounds = {'ha':bound1, 'va':bound2}
    for i in np.arange(len(peakfreqs)):
        plt.text(peakfreqs[i], peaks[i] + (yoffset / 10), stations[i], text_bounds, rotation=rot)
    plt.savefig('stations_peaks.pdf')
    plt.show()
    stations_dict = OrderedDict()
    for i in np.arange(len(stations)):
        stations_dict[peakfreqs[i]] = stations[i]
    return stations_dict

It's a rather long function, but a lot of it is making adjustments to the final plot. The function requires an input [ython dictionary, so that dictionary either has to be from an established source online or be manually made. For my code, I made a dictionary of all Bay Area FM Radio Stations (BAFMRS) with their corresponding frequencies as the dictionary keys. For example, an entry in the dictionary is '88.5: KQED' with 88.5 MHz being the transmitting frequency and KQED being the name of the station.

A radio station signal comes in the form of a primary analog signal (almost all the peaks you see on the plot) and two digital signals on the side; only the stronger signals, such as the one corresponding to the leftmost peak, visibly show the digital parts. This function naturally filters out almost all digital peaks because their frequencies do not correspond to any of that in a dictionary of radio station signals. Unfortunately, it's not easy to feasibly filter multiple stations broadcasting the same frequency with this level of coding. Naturally, however, there won't be many of these cases at any given location.

The final plot, all processed and labeled:


In [15]:
print(mark_peaks(BAFMRS, SmooSpec, -54, 8, 'Bay Area FM Radio Stations', 1000000))


OrderedDict([(88.5, 'KQED'), (96.5, 'KOIT'), (98.1, 'KISQ'), (94.1, 'KPFA'), (97.3, 'KLLC'), (94.9, 'KYLD'), (102.1, 'KRBQ'), (101.3, 'KIOI'), (89.7, 'KFJC'), (89.3, ['KPFB', 'KOHL', 'KMTG']), (91.7, 'KALW'), (90.7, 'KALX'), (106.9, 'KFRC'), (93.31, 'KRZZ'), (99.7, 'KMVQ'), (92.7, 'KREV'), (106.1, 'KMEL')])

We got our final result. This particular approach is useful for identifying nearby signal sources for any purpose. Another approach uses not the average power values, but all the power values from all the scans. Visualizing the time-evolution can be useful for other purposes, such as noting the fluctuations of the peaks' strengths or finding sudden peaks arising at certain moments. The "waterfall" plot displays, in this case, intensity (power) with color codes and has time instead on the y-axis.


In [28]:
def waterfall(data, line, flat1, flat2, n, win_len, window, title, axesrange, gridshape='auto'):
    """returns a waterfall grid consisting off all the spectra from the input 

    data: an array of arrays
    line: number of arrays that make up one full scan of the band
    flat1, flat2: boundaries of the data points from each array used for flattening the spectrum
    n: half the number used to take medians of the spectra for noise-reducing
    win_len: size of kernel window for smoothing
    window: type of kernel
    title: title of grid
    axesrange: boundaries of the values of the grid
    """
    wf = []
    i = 1
    while i <= len(data):
        flatter = data[-i][flat1:flat2]
        flatspec = Flatten(data[-i], np.array(flatter.tolist() * line), n)
        reduspec = Reduce(flatspec, n)
        smoospec = smooth(reduspec, win_len, window)
        wf.append(smoospec)
        i += 1
    fig = plt.figure(figsize=(15, 5))
    ax = fig.add_subplot(111)
    ax.set_title(title)
    ax.set_xlabel('Frequency (MHz)')
    ax.set_ylabel('Time (s)')
    plt.imshow(wf, extent=axesrange)
    ax.set_aspect(gridshape)
    plt.colorbar(orientation='vertical')
    plt.show()

waterfall(total_data, 8, 24582, 28679, 10, 150, 'kaiser', 'Radio Spectra Waterfall', [87.9, 107.9, 0, 900])


Within the function, it's important to note that all the spectra processing significantly reduces the power at every frequency -- the raw_data has the highest peak at at around -25 dBm whereas the final smoothed spectrum has the peak at around -35 dBm. This may cause discrepancies if the goal is to visualize unadulterated intensity, but for our purposes the relative power among the peaks and between them and the noise are more important. This function satisfies our goals. It's important to also note that plotting many spectra takes quite a bit of time for your computer to run. This is unfortunately unavoidable and large amounts of data, such has thousands of scans (our has 90 scans), can take up to hours to run.

Nonetheless, we have our waterfall plot and our averaged spectrum displaying the signal sources. Both such plots are potent ways to visualize data transmission in the world of signal processing and radio frequency interference.