Module 7- Data Sorting and Searching

Computer scripts excel at performing repetetive tasks that would normally be tedious or uninteresting to do by hand. Therer are many useful jobs that programs can perform, but in this module I will demonstrate three common data-processing techniques: sorting, searching, and manipulating. These jobs are fundamental functions of computer scripts and are encountered in nearly any field of computational data analysis.

Also, this module requires a strong familiarity and comprehension of indexing in Python. If you are not fully comfortable with list indices, you can find introductory material on Python lists online.

For this module I will be using AirMonitor's archived weather data from July 23, 2015 to July 23, 2016: https://www.wunderground.com/history/airport/KOAK/2015/7/23/CustomHistory.html?dayend=23&monthend=7&yearend=2016&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=&format=1


In [1]:
import csv
import io
import urllib.request            
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

url = 'https://www.wunderground.com/history/airport/KOAK/2015/7/23/CustomHistory.html?dayend=23&monthend=7&yearend=2016&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=&format=1'
response= urllib.request.urlopen(url)
reader = csv.reader(io.TextIOWrapper(response)) 
datalist = []
timedata = []
meantemp = []    
meanwind = []
rain = []
line = 0

for row in reader:
    if line != 0:
        datalist.append(row)  # intermediate step of piling data into one list because url is in a comma-delimited format.
    line += 1
     
for i in range(len(datalist)):
    if i !=0:
        timedata.append(datetime.strptime(datalist[i][0], '%Y-%m-%d'))
        meantemp.append(float(datalist[i][2]))
        meanwind.append(float(datalist[i][17]))
        rain.append(datalist[i][19])
        
data = np.array((timedata,meantemp,meanwind,rain))
    # now all the data is gathered in a multidimensional array in which the 1st column has dates, 2nd column 
    # has mean temperature, 3rd column has mean wind velocity, and 4th column has precipitation data.

In [2]:
def sort_func(type):
    # INPUT: type is a string, either 'temp,'wind', or 'rain' to determine how how the list array is sorted
    if type == 'temp':
        sorted_index = np.argsort(data[1]) # argsort outputs a sorted list of indices from lowest to highest for the (1+1)nd row
        sorted_data = data[:,sorted_index] # which is used to sort the columns in the multi-dimensional array
    elif type == 'wind':
        sorted_index = np.argsort(data[2]) # outputs a sorted list of indices from lowest to highest for the (1+2)rd row
        sorted_data = data[:,sorted_index]
    elif type == 'rain':
        sorted_index = np.argsort(data[3]) # outputs a sorted list of indices from lowest to highest for the (1+3)th row
        sorted_data = data[:,sorted_index]
    else:
        print('invalid input string')
    return sorted_data                     # module outputs the sorted_data

The command below prints the first 9 columns of sort_func('temp'), excluding the first row. I ignore the first row of dates because it clutters the array with datetime information, but you can change to the index to [:,:9] to include it. Finally, I utilize the repr function to print the data in a cleaner array format.

In order to make this information more concrete: notice how the data with the coldest days (1st row) generally correlate with days of rain.


In [3]:
print(repr(sort_func('temp')[1:,:9]))


array([[32.0, 40.0, 42.0, 42.0, 44.0, 44.0, 44.0, 44.0, 44.0],
       [19.0, 4.0, 5.0, 4.0, 6.0, 4.0, 5.0, 7.0, 7.0],
       ['0.30', 'T', '0.04', '0.00', '0.00', 'T', '0.00', 'T', '0.00']], dtype=object)

In [4]:
print(repr(sort_func('wind')[1:,:9]))


array([[53.0, 54.0, 48.0, 54.0, 58.0, 47.0, 56.0, 50.0, 61.0],
       [2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0],
       ['0.00', '0.00', 'T', '0.00', '0.25', '0.00', '0.00', '0.00', 'T']], dtype=object)

In [5]:
print(repr(sort_func('rain')[1:,:8]))


array([[65.0, 59.0, 56.0, 54.0, 59.0, 58.0, 59.0, 58.0],
       [13.0, 11.0, 11.0, 8.0, 14.0, 9.0, 6.0, 7.0],
       ['0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']], dtype=object)

Now that we've performed a basic sorting operation, let's make a function that searches in which rainfall is detected (including trace amounts of rainfall, 'T'). To do this, we can use a technique called list comprehension to compile indices that fulfill our requirements. The list comprehension command I use is:

indices = [i for i, target in enumerate(rainfall) if target > 0]

which breaks down to:

indices = []                               # initialize with an empty list
    for i, target in enumerate(my_list):
        if target > 0
            indices.append(i)

In [6]:
def search_func(data):
    rainfall = list(data[3:,].flatten())
    indices_trace = [i for i, target in enumerate(rainfall) if target == 'T']
    
    # Next, we replace indices where 'T' appears with 0 so it doesn't interfere with next search
    for index in indices_trace:
        rainfall[index] = 0
    
    # Lastly, we convert list to floating values so we can compare values them numerically
    rainfall = [float(j) for j in rainfall]           
    indices_rain = [i for i, target in enumerate(rainfall) if target > 0]
    
    # When we combine the two indices, we place T before the numerical values
    search_index = indices_trace + indices_rain
    return search_index

print(data[1:,search_func(data)])


[[65.0 71.0 68.0 67.0 66.0 61.0 66.0 67.0 64.0 63.0 55.0 46.0 44.0 54.0
  60.0 51.0 56.0 52.0 40.0 44.0 52.0 50.0 48.0 53.0 53.0 52.0 55.0 50.0
  49.0 56.0 59.0 60.0 62.0 51.0 58.0 56.0 55.0 57.0 63.0 62.0 61.0 62.0
  64.0 64.0 65.0 63.0 66.0 61.0 62.0 64.0 63.0 58.0 54.0 32.0 53.0 46.0
  50.0 56.0 51.0 52.0 52.0 46.0 54.0 48.0 42.0 55.0 52.0 51.0 53.0 50.0
  53.0 56.0 58.0 58.0 57.0 57.0 55.0 58.0 61.0 57.0 62.0 61.0 57.0 54.0]
 [8.0 11.0 8.0 14.0 12.0 3.0 7.0 9.0 4.0 6.0 8.0 7.0 4.0 5.0 5.0 5.0 20.0
  12.0 4.0 7.0 7.0 5.0 3.0 8.0 7.0 4.0 10.0 10.0 4.0 9.0 7.0 7.0 6.0 7.0
  6.0 17.0 10.0 13.0 8.0 13.0 14.0 13.0 9.0 8.0 14.0 13.0 9.0 13.0 10.0
  14.0 6.0 8.0 13.0 19.0 14.0 5.0 10.0 10.0 8.0 18.0 9.0 12.0 14.0 12.0 5.0
  8.0 11.0 13.0 12.0 9.0 5.0 6.0 10.0 3.0 12.0 13.0 14.0 9.0 18.0 12.0 8.0
  19.0 17.0 11.0]
 ['T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T'
  'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T'
  'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' 'T' '0.04' '0.31'
  '0.29' '0.30' '0.12' '0.05' '0.13' '0.30' '0.06' '0.59' '0.04' '0.13'
  '0.18' '0.14' '0.04' '0.08' '0.36' '0.51' '0.06' '0.02' '0.04' '0.02'
  '0.67' '0.25' '0.43' '0.15' '0.01' '0.01' '0.07' '0.07' '0.06' '0.86'
  '0.36' '0.18']]