Generating Features from GeoTiff Files

From GeoTiff Files available for India over a period of more than 20 years, we want to generate features from those files for the problem of prediction of district wise crop yield in India.

Due to gdal package, had to make a separate environment using conda. So install packages for this notebook in that environment itself. Check from the anaconda prompt, the names of all the envs are available:

$ conda info --envs 
$ activate env_name

In [1]:
from osgeo import ogr, osr, gdal

import fiona
from shapely.geometry import Point, shape

import numpy as np
import pandas as pd

import os
import sys
import tarfile
import timeit
  • For Windows
    base_ = "C:\Users\deepak\Desktop\Repo\Maps\Districts\Census_2011"
    
  • For macOS
    base_ = "/Users/macbook/Documents/BTP/Satellite/Data/Maps/Districts/Census_2011"
    

In [2]:
# Change this for Win7,macOS
bases = "C:\Users\deepak\Desktop\Repo\Maps\Districts\Census\Dist.shp"
# base_ = "/Users/macbook/Documents/BTP/Satellite/Data/Maps/Districts/Census_2011"
fc = fiona.open(bases)

In [3]:
def reverse_geocode(pt):
    for feature in fc:
        if shape(feature['geometry']).contains(pt):
            return feature['properties']['DISTRICT']
    return "NRI"

In [4]:
# base = "/Users/macbook/Documents/BTP/Satellite/Data/Sat" # macOS
base = "G:\BTP\Satellite\Data\Test"  # Win7

In [5]:
# b = True
# for directory, subdirList, fileList in os.walk(base):
# #     if b:
# #         b = False
# #         continue
#     #print ("Directory: " + directory)
#     for filename in fileList:
#         if filename[0] != '.': print ("\t" + filename)

In [6]:
def extract(filename, force=False):
    root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
    if os.path.isdir(os.path.join(base,root)) and not force:
        # You may override by setting force=True.
        print('%s already present - Skipping extraction of %s' % (root, filename))
    else:
        print('Extracting data for %s' % root)
        tar = tarfile.open(os.path.join(base,filename))
        sys.stdout.flush()
        tar.extractall(os.path.join(base,root))
        tar.close()

In [7]:
# extracting all the tar files ... (if not extracted)
for directory, subdirList, fileList in os.walk(base):
    for filename in fileList:
        if filename.endswith(".tar.gz"): 
            d = extract(filename)


LE07_L1TP_146039_20101223_20161211_01_T1 already present - Skipping extraction of LE07_L1TP_146039_20101223_20161211_01_T1.tar.gz
LE07_L1TP_146041_20101223_20161211_01_T1 already present - Skipping extraction of LE07_L1TP_146041_20101223_20161211_01_T1.tar.gz

In [8]:
directories = [os.path.join(base, d) for d in sorted(os.listdir(base)) if os.path.isdir(os.path.join(base, d))]
# print directories

In [9]:
ds = gdal.Open(base + "\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B1.TIF")

Prepare one ds variable here itself, for the transformation of the coordinate system below.


In [10]:
# get the existing coordinate system
old_cs= osr.SpatialReference()
old_cs.ImportFromWkt(ds.GetProjectionRef())

# create the new coordinate system
wgs84_wkt = """
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0,
        AUTHORITY["EPSG","8901"]],
    UNIT["degree",0.01745329251994328,
        AUTHORITY["EPSG","9122"]],
    AUTHORITY["EPSG","4326"]]"""
new_cs = osr.SpatialReference()
new_cs.ImportFromWkt(wgs84_wkt)

# create a transform object to convert between coordinate systems
transform = osr.CoordinateTransformation(old_cs,new_cs)

In [11]:
def pixel2coord(x, y, xoff, a, b, yoff, d, e):
    """Returns global coordinates from coordinates x,y of the pixel"""
    xp = a * x + b * y + xoff
    yp = d * x + e * y + yoff
    return(xp, yp)

In [12]:
ricep = pd.read_csv("C:\Users\deepak\Desktop\Repo\BTP\Ricep.csv")
ricep = ricep.drop(["Unnamed: 0"],axis=1)
ricep["value"] = ricep["Production"]/ricep["Area"]
ricep.head()


Out[12]:
State_Name ind_district Crop_Year Season Crop Area Production phosphorus X1 X2 X3 X4 value
0 Andhra Pradesh anantapur 1999 kharif Rice 37991.0 105082.0 0.0 96800.0 75400.0 643.720 881.473 2.765971
1 Andhra Pradesh anantapur 2000 kharif Rice 39905.0 117680.0 0.0 105082.0 96800.0 767.351 643.720 2.949004
2 Andhra Pradesh anantapur 2001 kharif Rice 32878.0 95609.0 0.0 117680.0 105082.0 579.338 767.351 2.907993
3 Andhra Pradesh anantapur 2002 kharif Rice 29066.0 66329.0 0.0 95609.0 117680.0 540.070 579.338 2.282013
4 Andhra Pradesh anantapur 2005 kharif Rice 25008.0 69972.0 0.0 85051.0 44891.0 819.700 564.500 2.797985

New features


12 months (Numbered 1 to 12)

10 TIF files (12 for SAT_8)

Mean & Variance


In [13]:
a = np.empty((ricep.shape[0],1))*np.NAN

In [14]:
""" 'features' contain collumn indexes for the new features """
""" 'dictn' is the dictionary mapping name of collumn index to the index number """
features = []
dictn = {}
k = 13
for i in range(1,13):
    for j in range(1,11):
        s = str(i) + "_B" + str(j) + "_"
        features.append(s+"M")
        features.append(s+"V")
        dictn[s+"M"] = k
        dictn[s+"V"] = k+1
        k = k+2

In [15]:
for i in range(1,13):
    for j in range(1,11):
        s = str(i) + "_B" + str(j) + "_"
        features.append(s+"Mn")
        features.append(s+"Vn")

In [16]:
len(features)


Out[16]:
480

In [17]:
tmp = pd.DataFrame(index=range(ricep.shape[0]),columns=features)
ricex = pd.concat([ricep,tmp], axis=1)

In [18]:
ricex.head()


Out[18]:
State_Name ind_district Crop_Year Season Crop Area Production phosphorus X1 X2 ... 12_B6_Mn 12_B6_Vn 12_B7_Mn 12_B7_Vn 12_B8_Mn 12_B8_Vn 12_B9_Mn 12_B9_Vn 12_B10_Mn 12_B10_Vn
0 Andhra Pradesh anantapur 1999 kharif Rice 37991.0 105082.0 0.0 96800.0 75400.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Andhra Pradesh anantapur 2000 kharif Rice 39905.0 117680.0 0.0 105082.0 96800.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Andhra Pradesh anantapur 2001 kharif Rice 32878.0 95609.0 0.0 117680.0 105082.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Andhra Pradesh anantapur 2002 kharif Rice 29066.0 66329.0 0.0 95609.0 117680.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Andhra Pradesh anantapur 2005 kharif Rice 25008.0 69972.0 0.0 85051.0 44891.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 493 columns


In [21]:
k = 10
hits = 0

In [22]:
start_time = timeit.default_timer()

for directory in directories:
    
    """ Identifying Month, Year, Spacecraft ID """
    date = directory.split('\\')[-1].split('_')[3] # Change for Win7
    satx = directory.split('\\')[-1][3]
    month = date[4:6]
    year = date[0:4]
    
    """ Visiting every GeoTIFF file """ 
    for _,_,files in os.walk(directory):
        for filename in files:
            
            # make sure not going into the extra folders
            
            if filename.endswith(".TIF"):
                
                # Check for a single iteration. Which step takes the longest time.
                # improve that step
                
                # Do it all for B1.tif only.
                # for others save the row indexes found in B1's iteration
                # so dont have to search the dataframe again and again.
                # but keep track of the pixels which were not found, so have to skip them too
                # so that correct pixel value goes to the correct row in dataframe
                # have to traverse the tiff file to get the pixel values
                
                # what all steps are common for all the 10 tif files
                # the district search from the pixel's lat,lon coordinates
                # the row index search using district and the year
                # the same n is read and wrote for all the ...
                
                # If nothing works, maybe we can reduce the number of features, by just visiting 
                # First 5 TIF files for each scene.
                
                print os.path.join(directory,filename)
                
                ds = gdal.Open(os.path.join(directory,filename))
                if ds == None: continue
                col, row, _ = ds.RasterXSize, ds.RasterYSize, ds.RasterCount
                xoff, a, b, yoff, d, e = ds.GetGeoTransform()
                
                """ Now go to each pixel, find its lat,lon. Hence its district, and the pixel value """
                """ Find the row with same (Year,District), in Crop Dataset. """
                """ Find the feature using Month, Band, SATx """
                """ For this have to find Mean & Variance """
                
                for i in range(0,col,col/k):
                    for j in range(0,row,row/k):
                        
                        ########### fetching the lat and lon coordinates 
                        x,y = pixel2coord(i, j, xoff, a, b, yoff, d, e)
                        lonx, latx, z = transform.TransformPoint(x,y)
                        
                        ########### fetching the name of district
                        point = Point(lonx,latx)
                        district = reverse_geocode(point)
                        if district == "NRI": continue
                        
                        ########### The pixel value for that location
                        px,py = i,j
                        pix = ds.ReadAsArray(px,py,1,1)
                        pix = pix[0][0]
                        
                        ########### Locating the row in DataFrame which we want to update
                        district = district.lower()
                        district = district.strip()
                        r = ricex.index[(ricex['ind_district'] == district) & (ricex['Crop_Year'] == int(year))].tolist()
                        
                        if len(r) == 1:
                            """ Found the row, so now .."""
                            """ Find Collumn index corresponding to Month, Band """
                            hits = hits + 1
                            #print ("Hits: ", hits)
                            ####### Band Number ########
                            band = filename.split("\\")[-1].split("_")[7:][0].split(".")[0][1]
                            bnd = band
                            if band == '6':
                                if filename.split("\\")[-1].split("_")[7:][2][0] == '1':
                                    bnd = band
                                else:
                                    bnd = '9'
                            elif band == 'Q':
                                bnd = '10'
                            
                            sm = month + "_B" + bnd +"_M"
                            
                            cm = dictn[sm]
                            
                            r = r[0]
                            # cm is the collumn indexe for mean
                            # r[0] is the row index
                            
                            ##### Checking if values are null ...
                            valm = ricex.iloc[r,cm]
                            if pd.isnull(valm): 
                                ricex.iloc[r,cm] = pix
                                ricex.iloc[r,cm+1] = pix*pix
                                ricex.iloc[r,cm+240] = 1
                                continue
                                
                            ##### if the values are not null ...
                            valv = ricex.iloc[r,cm+1]
                            n = ricex.iloc[r,cm+240]
                            n = n+1
                            
                            # Mean & Variance update
                            ricex.iloc[r,cm] = valm + (pix-valm)/n
                            ricex.iloc[r,cm+1] = ((n-2)/(n-1))*valv + (pix-valm)*(pix-valm)/n
                            ricex.iloc[r,cm+240] = n
                            
                        
                            #print ("No match for the district " + district + " for the year " + year)
                            
                            
elapsed = timeit.default_timer() - start_time
print (elapsed)
print "Seconds"


G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B1.TIF
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:89: RuntimeWarning: overflow encountered in ubyte_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:78: RuntimeWarning: overflow encountered in ubyte_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:88: RuntimeWarning: overflow encountered in ubyte_scalars
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B3.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B4.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B5.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B6_VCID_1.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B6_VCID_2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B7.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B8.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_BQA.TIF
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:89: RuntimeWarning: overflow encountered in ushort_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:78: RuntimeWarning: overflow encountered in ushort_scalars
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B1.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B3.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B4.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B5.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B6_VCID_1.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B6_VCID_2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B7.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_B8.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146041_20101223_20161211_01_T1\LE07_L1TP_146041_20101223_20161211_01_T1_BQA.TIF
3223.90743439
Seconds

In [23]:
ricex.describe()


Out[23]:
Crop_Year Area Production phosphorus X1 X2 X3 X4 value
count 1931.000000 1931.000000 1.931000e+03 1931.000000 1.931000e+03 1.931000e+03 1931.000000 1931.000000 1931.000000
mean 2004.201968 69936.771103 1.556052e+05 0.557224 1.588625e+05 1.567246e+05 861.667570 871.507723 1.950687
std 3.631699 81619.498004 2.151250e+05 0.665363 2.176855e+05 2.170873e+05 476.198957 458.603987 1.102191
min 1999.000000 1.000000 0.000000e+00 0.000000 1.000000e+00 1.000000e+00 76.944000 108.800000 0.000000
25% 2001.000000 6392.500000 6.326000e+03 0.000000 7.545500e+03 6.987500e+03 592.700000 607.830000 1.021009
50% 2005.000000 41040.000000 7.190000e+04 0.000000 7.574800e+04 7.204800e+04 765.714000 773.600000 1.882997
75% 2007.000000 111052.000000 2.333050e+05 1.000000 2.372640e+05 2.285185e+05 1023.702000 1045.679500 2.645009
max 2010.000000 545965.000000 1.637000e+06 2.000000 1.710000e+06 1.710000e+06 4755.700000 4076.200000 9.886125

In [22]:
ricex.to_csv("ricex_test1.csv")

In [23]:
fc


Out[23]:
<open Collection 'C:\Users\deepak\Desktop\Repo\Maps\Districts\Census\Dist.shp:Dist', mode 'r' at 0x41ebf28L>

In [6]:
fc.schema


Out[6]:
{'geometry': 'Polygon',
 'properties': OrderedDict([(u'DISTRICT', 'str:28'),
              (u'ST_NM', 'str:24'),
              (u'ST_CEN_CD', 'int:9'),
              (u'DT_CEN_CD', 'int:9'),
              (u'censuscode', 'float:14')])}

In [7]:
fc.crs


Out[7]:
{'init': u'epsg:4326'}

In [8]:
len(fc)


Out[8]:
641

In [ ]:


In [24]:
import timeit

In [27]:
a = timeit.default_timer()
for i in range(1000000):
    j = 1
b = timeit.default_timer() - a

In [28]:
b


Out[28]:
0.119725337737691

In [ ]:


In [ ]:


In [10]:
for directory in directories:
    
    """ Identifying Month, Year, Spacecraft ID """
    date = directory.split('\\')[-1].split('_')[3] # Change for Win7
    satx = directory.split('\\')[-1][3]
    month = date[4:6]
    year = date[0:4]
    
    print "LANDSAT {},  MONTH: {}, YEAR: {}".format(satx,month,year)
    
    """ Visiting every GeoTIFF file """ 
    for _,_,files in os.walk(directory):
        for filename in files:
            if filename.endswith(".TIF"):
                print filename.split("\\")[-1].split("_")[7:]
                ds = gdal.Open(os.path.join(directory,filename))
                if ds == None: continue
                col, row, _ = ds.RasterXSize, ds.RasterYSize, ds.RasterCount
                xoff, a, b, yoff, d, e = ds.GetGeoTransform()
                print "Col: {0:6},  Row:{1:6}".format(col,row)
                
                """ Now go to each pixel, find its lat,lon. Hence its district, and the pixel value """
                """ Find the row with same (Year,District), in Crop Dataset. """
                """ Find the feature using Month, Band, SATx """
                """ For this have to find Mean & Variance """


LANDSAT 7,  MONTH: 12, YEAR: 2010
['B1.TIF']
Col:   8161,  Row:  7221
['B2.TIF']
Col:   8161,  Row:  7221
['B3.TIF']
Col:   8161,  Row:  7221
['B4.TIF']
Col:   8161,  Row:  7221
['B5.TIF']
Col:   8161,  Row:  7221
['B6', 'VCID', '1.TIF']
Col:   8161,  Row:  7221
['B6', 'VCID', '2.TIF']
Col:   8161,  Row:  7221
['B7.TIF']
Col:   8161,  Row:  7221
['B8.TIF']
Col:  16321,  Row: 14441
['BQA.TIF']
Col:   8161,  Row:  7221
LANDSAT 7,  MONTH: 12, YEAR: 2010
['B1.TIF']
Col:   7931,  Row:  6961
['B2.TIF']
Col:   7931,  Row:  6961
['B3.TIF']
Col:   7931,  Row:  6961
['B4.TIF']
Col:   7931,  Row:  6961
['B5.TIF']
Col:   7931,  Row:  6961
['B6', 'VCID', '1.TIF']
Col:   7931,  Row:  6961
['B6', 'VCID', '2.TIF']
Col:   7931,  Row:  6961
['B7.TIF']
Col:   7931,  Row:  6961
['B8.TIF']
Col:  15861,  Row: 13921
['BQA.TIF']
Col:   7931,  Row:  6961

So for LANDSAT 7: col,row ~ 8000,7000, with an exception of Band 8, with 16K,14K


Pseudo Code

Go to the base folder: extract every zip file, which is unextracted:

For each folder present here:

For each tiff file (for each band):

Identify the following:

  • Month, Year
  • District Name
  • Cloud Cover Percentage
  • Sat 7 or 8 (maybe from #files in the folder!

According to SAT, meaning of bands change ...(Put them in corresponding features ...)

Traverse every 100th pixel (for sat7 every Kth)


  • Month, Year, Spacecraft ID all from the File Name itself
  • Regarding the pixel location selection:
    • Either go for definite points at some gap and avg the non zero ones
    • OR Can select the points randomly and avg the non zero ones only.

In [ ]: