Generating Features from GeoTiff Files

From GeoTiff Files available for India over a period of more than 20 years, we want to generate features from those files for the problem of prediction of district wise crop yield in India.

Due to gdal package, had to make a separate environment using conda. So install packages for this notebook in that environment itself. Check from the anaconda prompt, the names of all the envs are available:

$ conda info --envs 
$ activate env_name



In [1]:

    
from osgeo import ogr, osr, gdal

import fiona
from shapely.geometry import Point, shape

import numpy as np
import pandas as pd

import os
import sys
import tarfile
import timeit

For Windows

base_ = "C:\Users\deepak\Desktop\Repo\Maps\Districts\Census_2011"

For macOS

base_ = "/Users/macbook/Documents/BTP/Satellite/Data/Maps/Districts/Census_2011"



In [2]:

    
# Change this for Win7,macOS
bases = "C:\Users\deepak\Desktop\Repo\Maps\Districts\Census\Dist.shp"
# base_ = "/Users/macbook/Documents/BTP/Satellite/Data/Maps/Districts/Census_2011"
fc = fiona.open(bases)



In [3]:

    
def reverse_geocode(pt):
    for feature in fc:
        if shape(feature['geometry']).contains(pt):
            return feature['properties']['DISTRICT']
    return "NRI"



In [4]:

    
# base = "/Users/macbook/Documents/BTP/Satellite/Data/Sat" # macOS
base = "G:\BTP\Satellite\Data\Test"  # Win7



In [6]:

    
def extract(filename, force=False):
    root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
    if os.path.isdir(os.path.join(base,root)) and not force:
        # You may override by setting force=True.
        print('%s already present - Skipping extraction of %s' % (root, filename))
    else:
        print('Extracting data for %s' % root)
        tar = tarfile.open(os.path.join(base,filename))
        sys.stdout.flush()
        tar.extractall(os.path.join(base,root))
        tar.close()



In [7]:

    
# extracting all the tar files ... (if not extracted)
for directory, subdirList, fileList in os.walk(base):
    for filename in fileList:
        if filename.endswith(".tar.gz"): 
            d = extract(filename)









    



LE07_L1TP_146039_20101223_20161211_01_T1 already present - Skipping extraction of LE07_L1TP_146039_20101223_20161211_01_T1.tar.gz
LE07_L1TP_146041_20101223_20161211_01_T1 already present - Skipping extraction of LE07_L1TP_146041_20101223_20161211_01_T1.tar.gz



In [8]:

    
directories = [os.path.join(base, d) for d in sorted(os.listdir(base)) if os.path.isdir(os.path.join(base, d))]
# print directories



In [9]:

    
ds = gdal.Open(base + "\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B1.TIF")

Prepare one ds variable here itself, for the transformation of the coordinate system below.



In [25]:

    
# get the existing coordinate system
old_cs= osr.SpatialReference()
old_cs.ImportFromWkt(ds.GetProjectionRef())

# create the new coordinate system
wgs84_wkt = """
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0,
        AUTHORITY["EPSG","8901"]],
    UNIT["degree",0.01745329251994328,
        AUTHORITY["EPSG","9122"]],
    AUTHORITY["EPSG","4326"]]"""
new_cs = osr.SpatialReference()
new_cs.ImportFromWkt(wgs84_wkt)

# create a transform object to convert between coordinate systems
transform = osr.CoordinateTransformation(old_cs,new_cs)



In [26]:

    
type(ds)









    Out[26]:





osgeo.gdal.Dataset



In [11]:

    
def pixel2coord(x, y, xoff, a, b, yoff, d, e):
    """Returns global coordinates from coordinates x,y of the pixel"""
    xp = a * x + b * y + xoff
    yp = d * x + e * y + yoff
    return(xp, yp)



In [12]:

    
ricep = pd.read_csv("C:\Users\deepak\Desktop\Repo\BTP\Ricep.csv")
ricep = ricep.drop(["Unnamed: 0"],axis=1)
ricep["value"] = ricep["Production"]/ricep["Area"]
ricep.head()









    Out[12]:







  
    
      
      State_Name
      ind_district
      Crop_Year
      Season
      Crop
      Area
      Production
      phosphorus
      X1
      X2
      X3
      X4
      value
    
  
  
    
      0
      Andhra Pradesh
      anantapur
      1999
      kharif
      Rice
      37991.0
      105082.0
      0.0
      96800.0
      75400.0
      643.720
      881.473
      2.765971
    
    
      1
      Andhra Pradesh
      anantapur
      2000
      kharif
      Rice
      39905.0
      117680.0
      0.0
      105082.0
      96800.0
      767.351
      643.720
      2.949004
    
    
      2
      Andhra Pradesh
      anantapur
      2001
      kharif
      Rice
      32878.0
      95609.0
      0.0
      117680.0
      105082.0
      579.338
      767.351
      2.907993
    
    
      3
      Andhra Pradesh
      anantapur
      2002
      kharif
      Rice
      29066.0
      66329.0
      0.0
      95609.0
      117680.0
      540.070
      579.338
      2.282013
    
    
      4
      Andhra Pradesh
      anantapur
      2005
      kharif
      Rice
      25008.0
      69972.0
      0.0
      85051.0
      44891.0
      819.700
      564.500
      2.797985

New features

12 months (Numbered 1 to 12)

10 TIF files (12 for SAT_8)

Mean & Variance



In [13]:

    
a = np.empty((ricep.shape[0],1))*np.NAN

The value of n is going to b same for all the 10 Band files of a month hence ame for all the 20 features at a time. Can reduce the number 480 here using this.



In [14]:

    
""" 'features' contain collumn indexes for the new features """
""" 'dictn' is the dictionary mapping name of collumn index to the index number """
features = []
dictn = {}
k = 13
for i in range(1,13):
    for j in range(1,11):
        s = str(i) + "_B" + str(j) + "_"
        features.append(s+"M")
        features.append(s+"V")
        dictn[s+"M"] = k
        dictn[s+"V"] = k+1
        k = k+2



In [15]:

    
for i in range(1,13):
    for j in range(1,11):
        s = str(i) + "_B" + str(j) + "_"
        features.append(s+"Mn")
        features.append(s+"Vn")



In [16]:

    
len(features)









    Out[16]:





480



In [17]:

    
tmp = pd.DataFrame(index=range(ricep.shape[0]),columns=features)
ricex = pd.concat([ricep,tmp], axis=1)



In [18]:

    
ricex.head()









    Out[18]:







  
    
      
      State_Name
      ind_district
      Crop_Year
      Season
      Crop
      Area
      Production
      phosphorus
      X1
      X2
      ...
      12_B6_Mn
      12_B6_Vn
      12_B7_Mn
      12_B7_Vn
      12_B8_Mn
      12_B8_Vn
      12_B9_Mn
      12_B9_Vn
      12_B10_Mn
      12_B10_Vn
    
  
  
    
      0
      Andhra Pradesh
      anantapur
      1999
      kharif
      Rice
      37991.0
      105082.0
      0.0
      96800.0
      75400.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      Andhra Pradesh
      anantapur
      2000
      kharif
      Rice
      39905.0
      117680.0
      0.0
      105082.0
      96800.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      Andhra Pradesh
      anantapur
      2001
      kharif
      Rice
      32878.0
      95609.0
      0.0
      117680.0
      105082.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      Andhra Pradesh
      anantapur
      2002
      kharif
      Rice
      29066.0
      66329.0
      0.0
      95609.0
      117680.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      Andhra Pradesh
      anantapur
      2005
      kharif
      Rice
      25008.0
      69972.0
      0.0
      85051.0
      44891.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 493 columns



In [19]:

    
k = 10
hits = 0
times = [0,0,0,0,0,0,0,0]
nums = [0,0,0,0,0,0,0,0]

bx = False



In [20]:

    
stx = timeit.default_timer()

for directory in directories:
    
    if bx: continue
    else: bx = True
    
    dictx = {}
    
    """ Identifying Month, Year, Spacecraft ID """
    date = directory.split('\\')[-1].split('_')[3] # Change for Win7
    satx = directory.split('\\')[-1][3]
    month = date[4:6]
    year = date[0:4]
    
    """ Visiting every GeoTIFF file """ 
    for _,_,files in os.walk(directory):
        for filename in files:
            
            # make sure not going into the extra folders
            
            if filename.endswith(".TIF"):
                
                if filename[-5] == '8': continue
                #--------------------------------------------------------------------------------------
                # Check for a single iteration. Which step takes the longest time.
                # improve that step
                
                # Do it all for B1.tif only.
                # for others save the row indexes found in B1's iteration
                # so dont have to search the dataframe again and again.
                # but keep track of the pixels which were not found, so have to skip them too
                # so that correct pixel value goes to the correct row in dataframe
                # have to traverse the tiff file to get the pixel values
                
                # what all steps are common for all the 10 tif files
                # the district search from the pixel's lat,lon coordinates
                # the row index search using district and the year
                # the same n is read and wrote for all the ...
                
                # If nothing works, maybe we can reduce the number of features, by just visiting 
                # First 5 TIF files for each scene.
                #--------------------------------------------------------------------------------------
                
                
                print os.path.join(directory,filename)
                
                ds = gdal.Open(os.path.join(directory,filename))
                if ds == None: continue
                col, row, _ = ds.RasterXSize, ds.RasterYSize, ds.RasterCount
                xoff, a, b, yoff, d, e = ds.GetGeoTransform()
                
                """ Now go to each pixel, find its lat,lon. Hence its district, and the pixel value """
                """ Find the row with same (Year,District), in Crop Dataset. """
                """ Find the feature using Month, Band, SATx """
                """ For this have to find Mean & Variance """
                
                for i in range(0,col,col/k):
                    for j in range(0,row,row/k):
                        
                        st = timeit.default_timer()
                        ########### fetching the lat and lon coordinates 
                        x,y = pixel2coord(i, j, xoff, a, b, yoff, d, e)
                        lonx, latx, z = transform.TransformPoint(x,y)
                        times[0] += timeit.default_timer() - st
                        nums[0] += 1
                        
                        
                        st = timeit.default_timer()
                        ########### fetching the name of district
                        
                        district = ""
                        #----------------------------------------------------------
                        if filename[-5] == '1':
                            point = Point(lonx,latx)
                            district = reverse_geocode(point)
                            dictx[str(lonx)+str(latx)] = district
                        else:
                            district = dictx[str(lonx)+str(latx)]
                        #----------------------------------------------------------
                        times[1] += timeit.default_timer() - st
                        nums[1] += 1
                        
                        if district == "NRI": continue
                        
                        
                        st = timeit.default_timer()
                        ########### Locating the row in DataFrame which we want to update
                        district = district.lower()
                        district = district.strip()
                        r = ricex.index[(ricex['ind_district'] == district) & (ricex['Crop_Year'] == int(year))].tolist()
                        times[3] += timeit.default_timer() - st
                        nums[3] += 1
                        
                        
                        if len(r) == 1:
                            
                            st = timeit.default_timer()
                            ########### The pixel value for that location
                            px,py = i,j
                            pix = ds.ReadAsArray(px,py,1,1)
                            pix = pix[0][0]
                            times[2] += timeit.default_timer() - st
                            nums[2] += 1
                            
                            
                            st = timeit.default_timer()
                            """ Found the row, so now .."""
                            """ Find Collumn index corresponding to Month, Band """
                            hits = hits + 1
                            #print ("Hits: ", hits)
                            ####### Band Number ########
                            band = filename.split("\\")[-1].split("_")[7:][0].split(".")[0][1]
                            bnd = band
                            if band == '6':
                                if filename.split("\\")[-1].split("_")[7:][2][0] == '1':
                                    bnd = band
                                else:
                                    bnd = '9'
                            elif band == 'Q':
                                bnd = '10'
                            
                            sm = month + "_B" + bnd +"_M"
                            
                            cm = dictn[sm]
                            
                            r = r[0]
                            # cm is the collumn indexe for mean
                            # r[0] is the row index
                            times[4] += timeit.default_timer() - st
                            nums[4] += 1
                            
                            
                            ##### Checking if values are null ...
                            valm = ricex.iloc[r,cm]
                            if pd.isnull(valm): 
                                
                                st = timeit.default_timer()
                                ricex.iloc[r,cm] = pix
                                ricex.iloc[r,cm+1] = pix*pix
                                ricex.iloc[r,cm+240] = 1
                                times[5] += timeit.default_timer() - st
                                nums[5] += 1
                        
                                continue
                                
                                
                            st = timeit.default_timer()
                            ##### if the values are not null ...
                            valv = ricex.iloc[r,cm+1]
                            n = ricex.iloc[r,cm+240]
                            n = n+1
                            times[6] += timeit.default_timer() - st
                            nums[6] += 1
                        
                            
                            
                            st = timeit.default_timer()
                            # Mean & Variance update
                            ricex.iloc[r,cm] = valm + (pix-valm)/n
                            ricex.iloc[r,cm+1] = ((n-2)/(n-1))*valv + (pix-valm)*(pix-valm)/n
                            ricex.iloc[r,cm+240] = n
                            
                            times[7] += timeit.default_timer() - st
                            nums[7] += 1
                        
                            
                        
                            #print ("No match for the district " + district + " for the year " + year)
                            
                            
elapsed = timeit.default_timer() - stx
print (elapsed)
print "Seconds"









    



G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B1.TIF






    



C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:161: RuntimeWarning: overflow encountered in ubyte_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:140: RuntimeWarning: overflow encountered in ubyte_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:160: RuntimeWarning: overflow encountered in ubyte_scalars






    



G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B3.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B4.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B5.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B6_VCID_1.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B6_VCID_2.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_B7.TIF
G:\BTP\Satellite\Data\Test\LE07_L1TP_146039_20101223_20161211_01_T1\LE07_L1TP_146039_20101223_20161211_01_T1_BQA.TIF
280.835855605
Seconds






    



C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:161: RuntimeWarning: overflow encountered in ushort_scalars
C:\Users\deepak\Anaconda3\envs\ipykernel_py2\lib\site-packages\ipykernel_launcher.py:140: RuntimeWarning: overflow encountered in ushort_scalars

Overflow is happening because the pixel value is returned in small int type. Need to convert it into int type.



In [21]:

    
print hits

Calculating the time per iteration of each code block in the huge loop above helped in recognizing the culprit



In [22]:

    
print times
print nums









    



[0.014337681040335185, 278.56343599866284, 0.018627583254261726, 1.5887563573121126, 0.002408005960389481, 0.17163055146581163, 0.04105598833821, 0.3787758911972565]
[1089, 1089, 360, 1062, 360, 108, 252, 252]



In [23]:

    
for i in range(8):
    x = times[i]/nums[i]
    print (str(i) + ": " + str(x))









    



0: 1.31659146376e-05
1: 0.25579746189
2: 5.17432868174e-05
3: 0.00149600410293
4: 6.68890544553e-06
5: 0.00158917177283
6: 0.000162920588644
7: 0.00150307893332

Observations:

[1] & [2] are the most time consuming
[1] is reverse geocoding
[2] is pixel value extraction
move [2] inside the if condition to check if the row exists
- only then we need the pixel value
deal with [1] using dictionary



In [23]:

    
ricex.describe()









    Out[23]:







  
    
      
      Crop_Year
      Area
      Production
      phosphorus
      X1
      X2
      X3
      X4
      value
    
  
  
    
      count
      1931.000000
      1931.000000
      1.931000e+03
      1931.000000
      1.931000e+03
      1.931000e+03
      1931.000000
      1931.000000
      1931.000000
    
    
      mean
      2004.201968
      69936.771103
      1.556052e+05
      0.557224
      1.588625e+05
      1.567246e+05
      861.667570
      871.507723
      1.950687
    
    
      std
      3.631699
      81619.498004
      2.151250e+05
      0.665363
      2.176855e+05
      2.170873e+05
      476.198957
      458.603987
      1.102191
    
    
      min
      1999.000000
      1.000000
      0.000000e+00
      0.000000
      1.000000e+00
      1.000000e+00
      76.944000
      108.800000
      0.000000
    
    
      25%
      2001.000000
      6392.500000
      6.326000e+03
      0.000000
      7.545500e+03
      6.987500e+03
      592.700000
      607.830000
      1.021009
    
    
      50%
      2005.000000
      41040.000000
      7.190000e+04
      0.000000
      7.574800e+04
      7.204800e+04
      765.714000
      773.600000
      1.882997
    
    
      75%
      2007.000000
      111052.000000
      2.333050e+05
      1.000000
      2.372640e+05
      2.285185e+05
      1023.702000
      1045.679500
      2.645009
    
    
      max
      2010.000000
      545965.000000
      1.637000e+06
      2.000000
      1.710000e+06
      1.710000e+06
      4755.700000
      4076.200000
      9.886125



In [22]:

    
ricex.to_csv("ricex_test1.csv")



In [23]:

    
fc









    Out[23]:





<open Collection 'C:\Users\deepak\Desktop\Repo\Maps\Districts\Census\Dist.shp:Dist', mode 'r' at 0x41ebf28L>



In [6]:

    
fc.schema









    Out[6]:





{'geometry': 'Polygon',
 'properties': OrderedDict([(u'DISTRICT', 'str:28'),
              (u'ST_NM', 'str:24'),
              (u'ST_CEN_CD', 'int:9'),
              (u'DT_CEN_CD', 'int:9'),
              (u'censuscode', 'float:14')])}



In [7]:

    
fc.crs









    Out[7]:





{'init': u'epsg:4326'}



In [8]:

    
len(fc)









    Out[8]:





641

641 Districts in India in total



In [24]:

    
import timeit



In [27]:

    
a = timeit.default_timer()
for i in range(1000000):
    j = 1
b = timeit.default_timer() - a



In [28]:

    
b









    Out[28]:





0.119725337737691

DOUBT: Why is the size of LANDSAT 7 file smaller than LANDSAT 8 file? Inspite of the fact that, the number of pixels is equal in the band files for both cases. Investigate this ...

To calculate the relative dimensions of all 10 band files for a scene. Whatever the dimension, its same for all 10 files, except Band 8, who has 4 times the pixels as any other file



In [10]:

    
for directory in directories:
    
    """ Identifying Month, Year, Spacecraft ID """
    date = directory.split('\\')[-1].split('_')[3] # Change for Win7
    satx = directory.split('\\')[-1][3]
    month = date[4:6]
    year = date[0:4]
    
    print "LANDSAT {},  MONTH: {}, YEAR: {}".format(satx,month,year)
    
    """ Visiting every GeoTIFF file """ 
    for _,_,files in os.walk(directory):
        for filename in files:
            if filename.endswith(".TIF"):
                print filename.split("\\")[-1].split("_")[7:]
                ds = gdal.Open(os.path.join(directory,filename))
                if ds == None: continue
                col, row, _ = ds.RasterXSize, ds.RasterYSize, ds.RasterCount
                xoff, a, b, yoff, d, e = ds.GetGeoTransform()
                print "Col: {0:6},  Row:{1:6}".format(col,row)
                
                """ Now go to each pixel, find its lat,lon. Hence its district, and the pixel value """
                """ Find the row with same (Year,District), in Crop Dataset. """
                """ Find the feature using Month, Band, SATx """
                """ For this have to find Mean & Variance """









    



LANDSAT 7,  MONTH: 12, YEAR: 2010
['B1.TIF']
Col:   8161,  Row:  7221
['B2.TIF']
Col:   8161,  Row:  7221
['B3.TIF']
Col:   8161,  Row:  7221
['B4.TIF']
Col:   8161,  Row:  7221
['B5.TIF']
Col:   8161,  Row:  7221
['B6', 'VCID', '1.TIF']
Col:   8161,  Row:  7221
['B6', 'VCID', '2.TIF']
Col:   8161,  Row:  7221
['B7.TIF']
Col:   8161,  Row:  7221
['B8.TIF']
Col:  16321,  Row: 14441
['BQA.TIF']
Col:   8161,  Row:  7221
LANDSAT 7,  MONTH: 12, YEAR: 2010
['B1.TIF']
Col:   7931,  Row:  6961
['B2.TIF']
Col:   7931,  Row:  6961
['B3.TIF']
Col:   7931,  Row:  6961
['B4.TIF']
Col:   7931,  Row:  6961
['B5.TIF']
Col:   7931,  Row:  6961
['B6', 'VCID', '1.TIF']
Col:   7931,  Row:  6961
['B6', 'VCID', '2.TIF']
Col:   7931,  Row:  6961
['B7.TIF']
Col:   7931,  Row:  6961
['B8.TIF']
Col:  15861,  Row: 13921
['BQA.TIF']
Col:   7931,  Row:  6961

So for LANDSAT 7: col,row ~ 8000,7000, with an exception of Band 8, with 16K,14K

Pseudo Code

Go to the base folder: extract every zip file, which is unextracted:

For each folder present here:

For each tiff file (for each band):

Identify the following:

Month, Year

District Name

Cloud Cover Percentage

Sat 7 or 8 (maybe from #files in the folder!

According to SAT, meaning of bands change ...(Put them in corresponding features ...)

Traverse every 100th pixel (for sat7 every Kth)

Month, Year, Spacecraft ID all from the File Name itself
Regarding the pixel location selection:
- Either go for definite points at some gap and avg the non zero ones
- OR Can select the points randomly and avg the non zero ones only.

Future work

Take into account the fact that each region in a district is not an agricultural land and that too for the same crop we are working on.
Consider the suggestion of changing the features or the way of learning the models because our case involves time based data, which is may be meant to dealt with differenty.
- Given the data for the past say 5 years, predict the output for the 6th year.
Figure out how to handle all the states together, for all those years.
Try other tricks in ensemble learning.
Also try EDA for all the variables to draw some useful inference about the dataset.
Talk to Prof Suban. He predicted dwarfism using Satellite Images Dataset.
First of all define the the problem properly, essentailly carrying out a kind of literature survey.
- Stanford, Microsoft India ...
- What all we can do using the Satellite Data
State of the Art: Related Work
- Look at the ways in which it can be improved ...
- Important to include the factors for India
- Also, see if new resources can be used or not (other than satellite data)
- Locating the pixels pertaining to the crops ... IMP.



In [ ]:

	State_Name	ind_district	Crop_Year	Season	Crop	Area	Production	X1	X2	X3	X4	value
0	Andhra Pradesh	anantapur	1999	kharif	Rice	37991.0	105082.0	96800.0	75400.0	643.720	881.473	2.765971
1	Andhra Pradesh	anantapur	2000	kharif	Rice	39905.0	117680.0	105082.0	96800.0	767.351	643.720	2.949004
2	Andhra Pradesh	anantapur	2001	kharif	Rice	32878.0	95609.0	117680.0	105082.0	579.338	767.351	2.907993
3	Andhra Pradesh	anantapur	2002	kharif	Rice	29066.0	66329.0	95609.0	117680.0	540.070	579.338	2.282013
4	Andhra Pradesh	anantapur	2005	kharif	Rice	25008.0	69972.0	85051.0	44891.0	819.700	564.500	2.797985

	Crop_Year	Area	Production	phosphorus	X1	X2	X3	X4	value
count	1931.000000	1931.000000	1.931000e+03	1931.000000	1.931000e+03	1.931000e+03	1931.000000	1931.000000	1931.000000
mean	2004.201968	69936.771103	1.556052e+05	0.557224	1.588625e+05	1.567246e+05	861.667570	871.507723	1.950687
std	3.631699	81619.498004	2.151250e+05	0.665363	2.176855e+05	2.170873e+05	476.198957	458.603987	1.102191
min	1999.000000	1.000000	0.000000e+00	0.000000	1.000000e+00	1.000000e+00	76.944000	108.800000	0.000000
25%	2001.000000	6392.500000	6.326000e+03	0.000000	7.545500e+03	6.987500e+03	592.700000	607.830000	1.021009
50%	2005.000000	41040.000000	7.190000e+04	0.000000	7.574800e+04	7.204800e+04	765.714000	773.600000	1.882997
75%	2007.000000	111052.000000	2.333050e+05	1.000000	2.372640e+05	2.285185e+05	1023.702000	1045.679500	2.645009
max	2010.000000	545965.000000	1.637000e+06	2.000000	1.710000e+06	1.710000e+06	4755.700000	4076.200000	9.886125