Generating Features from GeoTiff Files

From GeoTiff Files available for India over a period of more than 20 years, we want to generate features from those files for the problem of prediction of district wise crop yield in India.

Due to gdal package, had to make a separate environment using conda. So install packages for this notebook in that environment itself. Check from the anaconda prompt, the names of all the envs are available:

$ conda info --envs 
$ activate env_name

In [1]:
from osgeo import ogr, osr, gdal

import fiona
from shapely.geometry import Point, shape

import numpy as np
import pandas as pd

import os
import sys
import tarfile
import timeit
  • For Windows
    base_ = "C:\Users\deepak\Desktop\Repo\Maps\Districts\Census_2011"
    
  • For macOS
    base_ = "/Users/macbook/Documents/BTP/Satellite/Data/Maps/Districts/Census_2011"
    

In [2]:
# Change this for Win7,macOS
bases = "C:\Users\deepak\Desktop\Repo\Maps\Districts\Census\Dist.shp"
# base_ = "/Users/macbook/Documents/BTP/Satellite/Data/Maps/Districts/Census_2011"
fc = fiona.open(bases)

In [3]:
def reverse_geocode(pt):
    for feature in fc:
        if shape(feature['geometry']).contains(pt):
            return feature['properties']['DISTRICT']
    return "NRI"

In [4]:
# base = "/Users/macbook/Documents/BTP/Satellite/Data/Sat" # macOS
base = "G:\BTP\Satellite\Data\Test"  # Win7

In [6]:
def extract(filename, force=False):
    root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
    if os.path.isdir(os.path.join(base,root)) and not force:
        # You may override by setting force=True.
        print('%s already present - Skipping extraction of %s' % (root, filename))
    else:
        print('Extracting data for %s' % root)
        tar = tarfile.open(os.path.join(base,filename))
        sys.stdout.flush()
        tar.extractall(os.path.join(base,root))
        tar.close()

In [7]:
# extracting all the tar files ... (if not extracted)
for directory, subdirList, fileList in os.walk(base):
    for filename in fileList:
        if filename.endswith(".tar.gz"): 
            d = extract(filename)


LE07_L1TP_146039_20101223_20161211_01_T1 already present - Skipping extraction of LE07_L1TP_146039_20101223_20161211_01_T1.tar.gz
LE07_L1TP_146041_20101223_20161211_01_T1 already present - Skipping extraction of LE07_L1TP_146041_20101223_20161211_01_T1.tar.gz

In [8]:
directories = [os.path.join(base, d) for d in sorted(os.listdir(base)) if os.path.isdir(os.path.join(base, d))]
# print directories

In [9]:
ds = gdal.Open(base + "\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B1.TIF")

Prepare one ds variable here itself, for the transformation of the coordinate system below.


In [25]:
# get the existing coordinate system
old_cs= osr.SpatialReference()
old_cs.ImportFromWkt(ds.GetProjectionRef())

# create the new coordinate system
wgs84_wkt = """
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0,
        AUTHORITY["EPSG","8901"]],
    UNIT["degree",0.01745329251994328,
        AUTHORITY["EPSG","9122"]],
    AUTHORITY["EPSG","4326"]]"""
new_cs = osr.SpatialReference()
new_cs.ImportFromWkt(wgs84_wkt)

# create a transform object to convert between coordinate systems
transform = osr.CoordinateTransformation(old_cs,new_cs)

In [26]:
type(ds)


Out[26]:
osgeo.gdal.Dataset

In [11]:
def pixel2coord(x, y, xoff, a, b, yoff, d, e):
    """Returns global coordinates from coordinates x,y of the pixel"""
    xp = a * x + b * y + xoff
    yp = d * x + e * y + yoff
    return(xp, yp)

In [12]:
ricep = pd.read_csv("C:\Users\deepak\Desktop\Repo\BTP\Ricep.csv")
ricep = ricep.drop(["Unnamed: 0"],axis=1)
ricep["value"] = ricep["Production"]/ricep["Area"]
ricep.head()


Out[12]:
State_Name ind_district Crop_Year Season Crop Area Production phosphorus X1 X2 X3 X4 value
0 Andhra Pradesh anantapur 1999 kharif Rice 37991.0 105082.0 0.0 96800.0 75400.0 643.720 881.473 2.765971
1 Andhra Pradesh anantapur 2000 kharif Rice 39905.0 117680.0 0.0 105082.0 96800.0 767.351 643.720 2.949004
2 Andhra Pradesh anantapur 2001 kharif Rice 32878.0 95609.0 0.0 117680.0 105082.0 579.338 767.351 2.907993
3 Andhra Pradesh anantapur 2002 kharif Rice 29066.0 66329.0 0.0 95609.0 117680.0 540.070 579.338 2.282013
4 Andhra Pradesh anantapur 2005 kharif Rice 25008.0 69972.0 0.0 85051.0 44891.0 819.700 564.500 2.797985

New features


12 months (Numbered 1 to 12)

10 TIF files (12 for SAT_8)

Mean & Variance


In [13]:
a = np.empty((ricep.shape[0],1))*np.NAN

The value of n is going to b same for all the 10 Band files of a month hence ame for all the 20 features at a time. Can reduce the number 480 here using this.


In [14]:
""" 'features' contain collumn indexes for the new features """
""" 'dictn' is the dictionary mapping name of collumn index to the index number """
features = []
dictn = {}
k = 13
for i in range(1,13):
    for j in range(1,11):
        s = str(i) + "_B" + str(j) + "_"
        features.append(s+"M")
        features.append(s+"V")
        dictn[s+"M"] = k
        dictn[s+"V"] = k+1
        k = k+2

In [15]:
for i in range(1,13):
    for j in range(1,11):
        s = str(i) + "_B" + str(j) + "_"
        features.append(s+"Mn")
        features.append(s+"Vn")

In [16]:
len(features)


Out[16]:
480

In [17]:
tmp = pd.DataFrame(index=range(ricep.shape[0]),columns=features)
ricex = pd.concat([ricep,tmp], axis=1)

In [18]:
ricex.head()


Out[18]:
State_Name ind_district Crop_Year Season Crop Area Production phosphorus X1 X2 ... 12_B6_Mn 12_B6_Vn 12_B7_Mn 12_B7_Vn 12_B8_Mn 12_B8_Vn 12_B9_Mn 12_B9_Vn 12_B10_Mn 12_B10_Vn
0 Andhra Pradesh anantapur 1999 kharif Rice 37991.0 105082.0 0.0 96800.0 75400.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Andhra Pradesh anantapur 2000 kharif Rice 39905.0 117680.0 0.0 105082.0 96800.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Andhra Pradesh anantapur 2001 kharif Rice 32878.0 95609.0 0.0 117680.0 105082.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Andhra Pradesh anantapur 2002 kharif Rice 29066.0 66329.0 0.0 95609.0 117680.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andhra Pradesh anantapur 2005 kharif Rice 25008.0 69972.0 0.0 85051.0 44891.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 493 columns


In [19]:
k = 10
hits = 0
times = [0,0,0,0,0,0,0,0]
nums = [0,0,0,0,0,0,0,0]

bx = False

In [20]:
stx = timeit.default_timer()

for directory in directories:
    
    if bx: continue
    else: bx = True
    
    dictx = {}
    
    """ Identifying Month, Year, Spacecraft ID """
    date = directory.split('\\')[-1].split('_')[3] # Change for Win7
    satx = directory.split('\\')[-1][3]
    month = date[4:6]
    year = date[0:4]
    
    """ Visiting every GeoTIFF file """ 
    for _,_,files in os.walk(directory):
        for filename in files:
            
            # make sure not going into the extra folders
            
            if filename.endswith(".TIF"):
                
                if filename[-5] == '8': continue
                #--------------------------------------------------------------------------------------
                # Check for a single iteration. Which step takes the longest time.
                # improve that step
                
                # Do it all for B1.tif only.
                # for others save the row indexes found in B1's iteration
                # so dont have to search the dataframe again and again.
                # but keep track of the pixels which were not found, so have to skip them too
                # so that correct pixel value goes to the correct row in dataframe
                # have to traverse the tiff file to get the pixel values
                
                # what all steps are common for all the 10 tif files
                # the district search from the pixel's lat,lon coordinates
                # the row index search using district and the year
                # the same n is read and wrote for all the ...
                
                # If nothing works, maybe we can reduce the number of features, by just visiting 
                # First 5 TIF files for each scene.
                #--------------------------------------------------------------------------------------
                
                
                print os.path.join(directory,filename)
                
                ds = gdal.Open(os.path.join(directory,filename))
                if ds == None: continue
                col, row, _ = ds.RasterXSize, ds.RasterYSize, ds.RasterCount
                xoff, a, b, yoff, d, e = ds.GetGeoTransform()
                
                """ Now go to each pixel, find its lat,lon. Hence its district, and the pixel value """
                """ Find the row with same (Year,District), in Crop Dataset. """
                """ Find the feature using Month, Band, SATx """
                """ For this have to find Mean & Variance """
                
                for i in range(0,col,col/k):
                    for j in range(0,row,row/k):
                        
                        st = timeit.default_timer()
                        ########### fetching the lat and lon coordinates 
                        x,y = pixel2coord(i, j, xoff, a, b, yoff, d, e)
                        lonx, latx, z = transform.TransformPoint(x,y)
                        times[0] += timeit.default_timer() - st
                        nums[0] += 1
                        
                        
                        st = timeit.default_timer()
                        ########### fetching the name of district
                        
                        district = ""
                        #----------------------------------------------------------
                        if filename[-5] == '1':
                            point = Point(lonx,latx)
                            district = reverse_geocode(point)
                            dictx[str(lonx)+str(latx)] = district
                        else:
                            district = dictx[str(lonx)+str(latx)]
                        #----------------------------------------------------------
                        times[1] += timeit.default_timer() - st
                        nums[1] += 1
                        
                        if district == "NRI": continue
                        
                        
                        st = timeit.default_timer()
                        ########### Locating the row in DataFrame which we want to update
                        district = district.lower()
                        district = district.strip()
                        r = ricex.index[(ricex['ind_district'] == district) & (ricex['Crop_Year'] == int(year))].tolist()
                        times[3] += timeit.default_timer() - st
                        nums[3] += 1
                        
                        
                        if len(r) == 1:
                            
                            st = timeit.default_timer()
                            ########### The pixel value for that location
                            px,py = i,j
                            pix = ds.ReadAsArray(px,py,1,1)
                            pix = pix[0][0]
                            times[2] += timeit.default_timer() - st
                            nums[2] += 1
                            
                            
                            st = timeit.default_timer()
                            """ Found the row, so now .."""
                            """ Find Collumn index corresponding to Month, Band """
                            hits = hits + 1
                            #print ("Hits: ", hits)
                            ####### Band Number ########
                            band = filename.split("\\")[-1].split("_")[7:][0].split(".")[0][1]
                            bnd = band
                            if band == '6':
                                if filename.split("\\")[-1].split("_")[7:][2][0] == '1':
                                    bnd = band
                                else:
                                    bnd = '9'
                            elif band == 'Q':
                                bnd = '10'
                            
                            sm = month + "_B" + bnd +"_M"
                            
                            cm = dictn[sm]
                            
                            r = r[0]
                            # cm is the collumn indexe for mean
                            # r[0] is the row index
                            times[4] += timeit.default_timer() - st
                            nums[4] += 1
                            
                            
                            ##### Checking if values are null ...
                            valm = ricex.iloc[r,cm]
                            if pd.isnull(valm): 
                                
                                st = timeit.default_timer()
                                ricex.iloc[r,cm] = pix
                                ricex.iloc[r,cm+1] = pix*pix
                                ricex.iloc[r,cm+240] = 1
                                times[5] += timeit.default_timer() - st
                                nums[5] += 1
                        
                                continue
                                
                                
                            st = timeit.default_timer()
                            ##### if the values are not null ...
                            valv = ricex.iloc[r,cm+1]
                            n = ricex.iloc[r,cm+240]
                            n = n+1
                            times[6] += timeit.default_timer() - st
                            nums[6] += 1
                        
                            
                            
                            st = timeit.default_timer()
                            # Mean & Variance update
                            ricex.iloc[r,cm] = valm + (pix-valm)/n
                            ricex.iloc[r,cm+1] = ((n-2)/(n-1))*valv + (pix-valm)*(pix-valm)/n
                            ricex.iloc[r,cm+240] = n
                            
                            times[7] += timeit.default_timer() - st
                            nums[7] += 1
                        
                            
                        
                            #print ("No match for the district " + district + " for the year " + year)
                            
                            
elapsed = timeit.default_timer() - stx
print (elapsed)
print "Seconds"


G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B1.TIF
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:161: RuntimeWarning: overflow encountered in ubyte_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:140: RuntimeWarning: overflow encountered in ubyte_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:160: RuntimeWarning: overflow encountered in ubyte_scalars
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B3.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B4.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B5.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B6_VCID_1.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B6_VCID_2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B7.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_BQA.TIF
280.835855605
Seconds
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:161: RuntimeWarning: overflow encountered in ushort_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:140: RuntimeWarning: overflow encountered in ushort_scalars

Overflow is happening because the pixel value is returned in small int type. Need to convert it into int type.


In [21]:
print hits


360

Calculating the time per iteration of each code block in the huge loop above helped in recognizing the culprit


In [22]:
print times
print nums


[0.014337681040335185, 278.56343599866284, 0.018627583254261726, 1.5887563573121126, 0.002408005960389481, 0.17163055146581163, 0.04105598833821, 0.3787758911972565]
[1089, 1089, 360, 1062, 360, 108, 252, 252]

In [23]:
for i in range(8):
    x = times[i]/nums[i]
    print (str(i) + ": " + str(x))


0: 1.31659146376e-05
1: 0.25579746189
2: 5.17432868174e-05
3: 0.00149600410293
4: 6.68890544553e-06
5: 0.00158917177283
6: 0.000162920588644
7: 0.00150307893332

Observations:

  • [1] & [2] are the most time consuming
  • [1] is reverse geocoding
  • [2] is pixel value extraction
  • move [2] inside the if condition to check if the row exists
    • only then we need the pixel value
  • deal with [1] using dictionary

In [23]:
ricex.describe()


Out[23]:
Crop_Year Area Production phosphorus X1 X2 X3 X4 value
count 1931.000000 1931.000000 1.931000e+03 1931.000000 1.931000e+03 1.931000e+03 1931.000000 1931.000000 1931.000000
mean 2004.201968 69936.771103 1.556052e+05 0.557224 1.588625e+05 1.567246e+05 861.667570 871.507723 1.950687
std 3.631699 81619.498004 2.151250e+05 0.665363 2.176855e+05 2.170873e+05 476.198957 458.603987 1.102191
min 1999.000000 1.000000 0.000000e+00 0.000000 1.000000e+00 1.000000e+00 76.944000 108.800000 0.000000
25% 2001.000000 6392.500000 6.326000e+03 0.000000 7.545500e+03 6.987500e+03 592.700000 607.830000 1.021009
50% 2005.000000 41040.000000 7.190000e+04 0.000000 7.574800e+04 7.204800e+04 765.714000 773.600000 1.882997
75% 2007.000000 111052.000000 2.333050e+05 1.000000 2.372640e+05 2.285185e+05 1023.702000 1045.679500 2.645009
max 2010.000000 545965.000000 1.637000e+06 2.000000 1.710000e+06 1.710000e+06 4755.700000 4076.200000 9.886125

In [22]:
ricex.to_csv("ricex_test1.csv")

In [23]:
fc


Out[23]:
<open Collection 'C:\Users\deepak\Desktop\Repo\Maps\Districts\Census\Dist.shp:Dist', mode 'r' at 0x41ebf28L>

In [6]:
fc.schema


Out[6]:
{'geometry': 'Polygon',
 'properties': OrderedDict([(u'DISTRICT', 'str:28'),
              (u'ST_NM', 'str:24'),
              (u'ST_CEN_CD', 'int:9'),
              (u'DT_CEN_CD', 'int:9'),
              (u'censuscode', 'float:14')])}

In [7]:
fc.crs


Out[7]:
{'init': u'epsg:4326'}

In [8]:
len(fc)


Out[8]:
641

641 Districts in India in total


In [24]:
import timeit

In [27]:
a = timeit.default_timer()
for i in range(1000000):
    j = 1
b = timeit.default_timer() - a

In [28]:
b


Out[28]:
0.119725337737691

DOUBT: Why is the size of LANDSAT 7 file smaller than LANDSAT 8 file? Inspite of the fact that, the number of pixels is equal in the band files for both cases. Investigate this ...

To calculate the relative dimensions of all 10 band files for a scene. Whatever the dimension, its same for all 10 files, except Band 8, who has 4 times the pixels as any other file


In [10]:
for directory in directories:
    
    """ Identifying Month, Year, Spacecraft ID """
    date = directory.split('\\')[-1].split('_')[3] # Change for Win7
    satx = directory.split('\\')[-1][3]
    month = date[4:6]
    year = date[0:4]
    
    print "LANDSAT {},  MONTH: {}, YEAR: {}".format(satx,month,year)
    
    """ Visiting every GeoTIFF file """ 
    for _,_,files in os.walk(directory):
        for filename in files:
            if filename.endswith(".TIF"):
                print filename.split("\\")[-1].split("_")[7:]
                ds = gdal.Open(os.path.join(directory,filename))
                if ds == None: continue
                col, row, _ = ds.RasterXSize, ds.RasterYSize, ds.RasterCount
                xoff, a, b, yoff, d, e = ds.GetGeoTransform()
                print "Col: {0:6},  Row:{1:6}".format(col,row)
                
                """ Now go to each pixel, find its lat,lon. Hence its district, and the pixel value """
                """ Find the row with same (Year,District), in Crop Dataset. """
                """ Find the feature using Month, Band, SATx """
                """ For this have to find Mean & Variance """


LANDSAT 7,  MONTH: 12, YEAR: 2010
['B1.TIF']
Col:   8161,  Row:  7221
['B2.TIF']
Col:   8161,  Row:  7221
['B3.TIF']
Col:   8161,  Row:  7221
['B4.TIF']
Col:   8161,  Row:  7221
['B5.TIF']
Col:   8161,  Row:  7221
['B6', 'VCID', '1.TIF']
Col:   8161,  Row:  7221
['B6', 'VCID', '2.TIF']
Col:   8161,  Row:  7221
['B7.TIF']
Col:   8161,  Row:  7221
['B8.TIF']
Col:  16321,  Row: 14441
['BQA.TIF']
Col:   8161,  Row:  7221
LANDSAT 7,  MONTH: 12, YEAR: 2010
['B1.TIF']
Col:   7931,  Row:  6961
['B2.TIF']
Col:   7931,  Row:  6961
['B3.TIF']
Col:   7931,  Row:  6961
['B4.TIF']
Col:   7931,  Row:  6961
['B5.TIF']
Col:   7931,  Row:  6961
['B6', 'VCID', '1.TIF']
Col:   7931,  Row:  6961
['B6', 'VCID', '2.TIF']
Col:   7931,  Row:  6961
['B7.TIF']
Col:   7931,  Row:  6961
['B8.TIF']
Col:  15861,  Row: 13921
['BQA.TIF']
Col:   7931,  Row:  6961

So for LANDSAT 7: col,row ~ 8000,7000, with an exception of Band 8, with 16K,14K


Pseudo Code

Go to the base folder: extract every zip file, which is unextracted:

For each folder present here:

For each tiff file (for each band):

Identify the following:

  • Month, Year
  • District Name
  • Cloud Cover Percentage
  • Sat 7 or 8 (maybe from #files in the folder!

According to SAT, meaning of bands change ...(Put them in corresponding features ...)

Traverse every 100th pixel (for sat7 every Kth)


  • Month, Year, Spacecraft ID all from the File Name itself
  • Regarding the pixel location selection:
    • Either go for definite points at some gap and avg the non zero ones
    • OR Can select the points randomly and avg the non zero ones only.

Future work

  • Take into account the fact that each region in a district is not an agricultural land and that too for the same crop we are working on.
  • Consider the suggestion of changing the features or the way of learning the models because our case involves time based data, which is may be meant to dealt with differenty.
    • Given the data for the past say 5 years, predict the output for the 6th year.
  • Figure out how to handle all the states together, for all those years.
  • Try other tricks in ensemble learning.
  • Also try EDA for all the variables to draw some useful inference about the dataset.
  • Talk to Prof Suban. He predicted dwarfism using Satellite Images Dataset.
  • First of all define the the problem properly, essentailly carrying out a kind of literature survey.
    • Stanford, Microsoft India ...
    • What all we can do using the Satellite Data
  • State of the Art: Related Work
    • Look at the ways in which it can be improved ...
    • Important to include the factors for India
    • Also, see if new resources can be used or not (other than satellite data)
    • Locating the pixels pertaining to the crops ... IMP.

In [ ]: