9Day subsampling on the OceanColor Dataset


In [2]:
import xarray as xr
import numpy as np
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
from dask.diagnostics import ProgressBar
import seaborn as sns
from matplotlib.colors import LogNorm


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

Load data from disk

We already downloaded a subsetted MODIS-Aqua chlorophyll-a dataset for the Arabian Sea.

We can read all the netcdf files into one xarray Dataset using the open_mfsdataset function. Note that this does not load the data into memory yet. That only happens when we try to access the values.


In [3]:
ds_8day = xr.open_mfdataset('./data_collector_modisa_chla9km/ModisA_Arabian_Sea_chlor_a_9km_*_8D.nc')
ds_daily = xr.open_mfdataset('./data_collector_modisa_chla9km/ModisA_Arabian_Sea_chlor_a_9km_*_D.nc')
both_datasets = [ds_8day, ds_daily]

How much data is contained here? Let's get the answer in MB.


In [4]:
print([(ds.nbytes / 1e6) for ds in both_datasets])


[534.295504, 4241.4716]

The 8-day dataset is ~534 MB while the daily dataset is 4.2 GB. These both easily fit in RAM. So let's load them all into memory


In [5]:
[ds.load() for ds in both_datasets]


Out[5]:
[<xarray.Dataset>
 Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 667)
 Coordinates:
   * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
   * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
   * rgb            (rgb) int64 0 1 2
   * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
   * time           (time) datetime64[ns] 2002-07-04 2002-07-12 2002-07-20 ...
 Data variables:
     palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
     chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...,
 <xarray.Dataset>
 Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 5295)
 Coordinates:
   * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
   * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
   * rgb            (rgb) int64 0 1 2
   * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
   * time           (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...
 Data variables:
     palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
     chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...]

Fix bad data

In preparing this demo, I noticed that small number of maps had bad data--specifically, they contained large negative values of chlorophyll concentration. Looking closer, I realized that the land/cloud mask had been inverted. So I wrote a function to invert it back and correct the data.


In [6]:
def fix_bad_data(ds):
    # for some reason, the cloud / land mask is backwards on some data
    # this is obvious because there are chlorophyl values less than zero
    bad_data = ds.chlor_a.groupby('time').min() < 0
    # loop through and fix
    for n in np.nonzero(bad_data.values)[0]:
        data = ds.chlor_a[n].values 
        ds.chlor_a.values[n] = np.ma.masked_less(data, 0).filled(np.nan)

In [7]:
[fix_bad_data(ds) for ds in both_datasets]


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in less
  if not reflexive
Out[7]:
[None, None]

In [8]:
ds_8day.chlor_a>0


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[8]:
<xarray.DataArray 'chlor_a' (time: 667, lat: 276, lon: 360)>
array([[[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ...,  True, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False,  True],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False,  True,  True]],

       ..., 
       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]]], dtype=bool)
Coordinates:
  * lat      (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 27.37 ...
  * lon      (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 45.63 ...
  * time     (time) datetime64[ns] 2002-07-04 2002-07-12 2002-07-20 ...

Count the number of ocean data points

First we have to figure out the land mask. Unfortunately it doesn't come with the dataset. But we can infer it by counting all the points that have at least one non-nan chlorophyll value.


In [9]:
(ds_8day.chlor_a>0).sum(dim='time').plot()


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[9]:
<matplotlib.collections.QuadMesh at 0x1195a44a8>

In [10]:
#  find a mask for the land
ocean_mask = (ds_8day.chlor_a>0).sum(dim='time')>0
#ocean_mask = (ds_daily.chlor_a>0).sum(dim='time')>0
num_ocean_points = ocean_mask.sum().values  # compute the total nonzeros regions(data point)
ocean_mask.plot()
plt.title('%g total ocean points' % num_ocean_points)


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[10]:
<matplotlib.text.Text at 0x13d6044e0>

In [11]:
#ds_8day

In [12]:
#ds_daily

In [13]:
plt.figure(figsize=(8,6))
ds_daily.chlor_a.sel(time='2002-11-18',method='nearest').plot(norm=LogNorm())
#ds_daily.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[13]:
<matplotlib.collections.QuadMesh at 0x11a099d30>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [14]:
#list(ds_daily.groupby('time')) # take a look at what's inside

Now we count up the number of valid points in each snapshot and divide by the total number of ocean points.


In [15]:
'''
<xarray.Dataset>
Dimensions:        (eightbitcolor: 256, lat: 144, lon: 276, rgb: 3, time: 4748)
'''
ds_daily.groupby('time').count() # information from original data


Out[15]:
<xarray.Dataset>
Dimensions:  (time: 5295)
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...
Data variables:
    palette  (time) int64 768 768 768 768 768 768 768 768 768 768 768 768 ...
    chlor_a  (time) int64 658 1170 1532 2798 2632 1100 1321 636 2711 1163 ...

In [ ]:


In [16]:
ds_daily.chlor_a.groupby('time').count()/float(num_ocean_points)


Out[16]:
<xarray.DataArray 'chlor_a' (time: 5295)>
array([ 0.01053255,  0.01872809,  0.02452259, ...,  0.        ,
        0.        ,  0.        ])
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...

In [17]:
count_8day,count_daily = [ds.chlor_a.groupby('time').count()/float(num_ocean_points)
                            for ds in (ds_8day,ds_daily)]

In [18]:
#count_8day = ds_8day.chl_ocx.groupby('time').count()/float(num_ocean_points)
#coundt_daily = ds_daily.chl_ocx.groupby('time').count()/float(num_ocean_points)

#count_8day, coundt_daily = [ds.chl_ocx.groupby('time').count()/float(num_ocean_points)
#                            for ds in ds_8day, ds_daily] # not work in python 3

In [19]:
plt.figure(figsize=(12,4))
count_8day.plot(color='k')
count_daily.plot(color='r')

plt.legend(['8 day','daily'])


Out[19]:
<matplotlib.legend.Legend at 0x11a3bc080>

Seasonal Climatology


In [20]:
count_8day_clim, coundt_daily_clim = [count.groupby('time.month').mean()  # monthly data
                                      for count in (count_8day, count_daily)]

In [21]:
# mean value of the monthly data on the count of nonzeros
plt.figure(figsize=(12,4))
count_8day_clim.plot(color='k')
coundt_daily_clim.plot(color='r')
plt.legend(['8 day', 'daily'])


Out[21]:
<matplotlib.legend.Legend at 0x129c12ba8>

From the above figure, we see that data coverage is highest in the winter (especially Feburary) and lowest in summer.

Maps of individual days

Let's grab some data from Febrauary and plot it.


In [22]:
target_date = '2003-02-15'
plt.figure(figsize=(8,6))
ds_8day.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[22]:
<matplotlib.collections.QuadMesh at 0x12977a278>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [23]:
plt.figure(figsize=(8,6))
ds_daily.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[23]:
<matplotlib.collections.QuadMesh at 0x12af85518>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [24]:
ds_daily.chlor_a[0].sel_points(lon=[65, 70], lat=[16, 18], method='nearest')   # the time is selected!
#ds_daily.chl_ocx[0].sel_points(time= times, lon=lons, lat=times, method='nearest')


Out[24]:
<xarray.DataArray 'chlor_a' (points: 2)>
array([ nan,  nan])
Coordinates:
    time     datetime64[ns] 2002-07-04
    lat      (points) float64 16.04 18.04
    lon      (points) float64 65.04 70.04
  * points   (points) int64 0 1

In [25]:
#ds_daily.chlor_a.sel_points?

In [26]:
ds_9day = ds_daily.resample('9D', dim='time')
ds_9day


Out[26]:
<xarray.Dataset>
Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 589)
Coordinates:
  * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
  * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
  * rgb            (rgb) int64 0 1 2
  * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
  * time           (time) datetime64[ns] 2002-07-04 2002-07-13 2002-07-22 ...
Data variables:
    palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
    chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...

In [27]:
plt.figure(figsize=(8,6))
ds_9day.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[27]:
<matplotlib.collections.QuadMesh at 0x129d992b0>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [28]:
# check the range for the longitude
print(ds_9day.lon.min(),'\n' ,ds_9day.lat.min())


<xarray.DataArray 'lon' ()>
array(45.04166793823242) 
 <xarray.DataArray 'lat' ()>
array(5.041661739349365)

++++++++++++++++++++++++++++++++++++++++++++++

All GDP Floats

Load the float data

Map a (time, lon, lat) to a value on the cholorphlly value


In [29]:
# in the following we deal with the data from the gdp float
from buyodata import buoydata
import os

In [30]:
# a list of files
fnamesAll = ['./gdp_float/buoydata_1_5000.dat','./gdp_float/buoydata_5001_10000.dat','./gdp_float/buoydata_10001_15000.dat','./gdp_float/buoydata_15001_jun16.dat']

In [31]:
# read them and cancatenate them into one DataFrame
dfAll = pd.concat([buoydata.read_buoy_data(f) for f in fnamesAll])  # around 4~5 minutes

#mask = df.time>='2002-07-04' # we only have data after this data for chlor_a
dfvvAll = dfAll[dfAll.time>='2002-07-04']

sum(dfvvAll.time<'2002-07-04') # recheck whether the time is


Out[31]:
0

In [32]:
# process the data so that the longitude are all >0
print('before processing, the minimum longitude is%f4.3 and maximum is %f4.3' % (dfvvAll.lon.min(), dfvvAll.lon.max()))
mask = dfvvAll.lon<0
dfvvAll.lon[mask] = dfvvAll.loc[mask].lon + 360
print('after processing, the minimum longitude is %f4.3 and maximum is %f4.3' % (dfvvAll.lon.min(),dfvvAll.lon.max()) )

dfvvAll.describe()


before processing, the minimum longitude is0.0000004.3 and maximum is 360.0000004.3
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/core/generic.py:4695: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)
after processing, the minimum longitude is 0.0000004.3 and maximum is 360.0000004.3
Out[32]:
id lat lon temp ve vn spd var_lat var_lon var_tmp
count 2.147732e+07 2.131997e+07 2.131997e+07 1.986179e+07 2.129142e+07 2.129142e+07 2.129142e+07 2.147732e+07 2.147732e+07 2.147732e+07
mean 1.765662e+06 -2.263128e+00 2.124412e+02 1.986121e+01 2.454172e-01 4.708192e-01 2.613427e+01 7.326258e+00 7.326555e+00 7.522298e+01
std 9.452835e+06 3.401115e+01 9.746941e+01 8.339498e+00 2.525050e+01 2.052160e+01 1.939087e+01 8.527853e+01 8.527851e+01 2.637454e+02
min 2.578000e+03 -7.764700e+01 0.000000e+00 -1.685000e+01 -2.916220e+02 -2.601400e+02 0.000000e+00 5.268300e-07 -3.941600e-02 1.001300e-03
25% 4.897500e+04 -3.186000e+01 1.490720e+02 1.437300e+01 -1.411400e+01 -1.044700e+01 1.290300e+01 4.366500e-06 7.512600e-06 1.435700e-03
50% 7.141300e+04 -4.920000e+00 2.153940e+02 2.214400e+01 -5.560000e-01 1.970000e-01 2.176700e+01 8.833600e-06 1.495800e-05 1.691700e-03
75% 1.094330e+05 2.756000e+01 3.064370e+02 2.688900e+01 1.356100e+01 1.109300e+01 3.405900e+01 1.833300e-05 3.627900e-05 2.294200e-03
max 6.399288e+07 8.989900e+01 3.600000e+02 4.595000e+01 4.417070e+02 2.783220e+02 4.421750e+02 1.000000e+03 1.000000e+03 1.000000e+03

In [33]:
# Select only the arabian sea region
arabian_sea = (dfvvAll.lon > 45) & (dfvvAll.lon< 75) & (dfvvAll.lat> 5) & (dfvvAll.lat <28)
# arabian_sea = {'lon': slice(45,75), 'lat': slice(5,28)} # later use this longitude and latitude
floatsAll = dfvvAll.loc[arabian_sea]   # directly use mask
print('dfvvAll.shape is %s, floatsAll.shape is %s' % (dfvvAll.shape, floatsAll.shape) )


dfvvAll.shape is (21477317, 11), floatsAll.shape is (111894, 11)

In [34]:
# visualize the float around global region
fig, ax  = plt.subplots(figsize=(12,10))
dfvvAll.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsAll.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x24f194198>

In [35]:
# pands dataframe cannot do the resamplingn properly
# cause we are really indexing on ['time','id'], pandas.dataframe.resample cannot do this
# TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
print()




In [36]:
# dump the surface floater data from pandas.dataframe to xarray.dataset
floatsDSAll = xr.Dataset.from_dataframe(floatsAll.set_index(['time','id']) ) # set time & id as the index); use reset_index to revert this operation
floatsDSAll


Out[36]:
<xarray.Dataset>
Dimensions:  (id: 259, time: 17499)
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-04T06:00:00 ...
  * id       (id) int64 7574 10206 10208 11089 15703 15707 27069 27139 28842 ...
Data variables:
    lat      (time, id) float64 nan 16.3 14.03 16.4 14.04 nan 20.11 nan ...
    lon      (time, id) float64 nan 66.23 69.48 64.58 69.51 nan 68.55 nan ...
    temp     (time, id) float64 nan nan nan 28.0 28.53 nan 28.93 nan 27.81 ...
    ve       (time, id) float64 nan 8.68 5.978 6.286 4.844 nan 32.9 nan ...
    vn       (time, id) float64 nan -13.18 -18.05 -7.791 -17.47 nan 15.81 ...
    spd      (time, id) float64 nan 15.78 19.02 10.01 18.13 nan 36.51 nan ...
    var_lat  (time, id) float64 nan 0.0002661 5.01e-05 5.018e-05 5.024e-05 ...
    var_lon  (time, id) float64 nan 0.0006854 8.851e-05 9.018e-05 8.968e-05 ...
    var_tmp  (time, id) float64 nan 1e+03 1e+03 0.003733 0.0667 nan 0.001683 ...

In [37]:
# resample on the xarray.dataset onto two-day frequency
floatsDSAll_9D =floatsDSAll.resample('9D', dim='time')
floatsDSAll_9D


Out[37]:
<xarray.Dataset>
Dimensions:  (id: 259, time: 568)
Coordinates:
  * id       (id) int64 7574 10206 10208 11089 15703 15707 27069 27139 28842 ...
  * time     (time) datetime64[ns] 2002-07-04 2002-07-13 2002-07-22 ...
Data variables:
    var_lon  (time, id) float64 nan 0.005002 0.0001159 0.000123 9.882e-05 ...
    var_tmp  (time, id) float64 nan 1e+03 1e+03 0.00362 0.08884 nan 0.001708 ...
    spd      (time, id) float64 nan 8.463 17.92 20.22 16.84 nan 25.5 nan ...
    lon      (time, id) float64 nan 66.53 69.91 65.04 69.92 nan 69.45 nan ...
    var_lat  (time, id) float64 nan 0.001326 6.127e-05 6.456e-05 5.404e-05 ...
    ve       (time, id) float64 nan 7.056 12.94 11.14 11.71 nan 24.13 nan ...
    vn       (time, id) float64 nan 0.02706 -8.271 -14.36 -7.037 nan -1.798 ...
    lat      (time, id) float64 nan 16.22 13.6 16.07 13.64 nan 20.08 nan ...
    temp     (time, id) float64 nan nan nan 27.8 28.57 nan 28.98 nan 27.62 ...

In [38]:
# transfer it back to pandas.dataframe for plotting
floatsDFAll_9D = floatsDSAll_9D.to_dataframe()
floatsDFAll_9D
floatsDFAll_9D = floatsDFAll_9D.reset_index()
floatsDFAll_9D
# visualize the subsamping of floats around arabian region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_9D.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1235ff0b8>

In [39]:
# get the value for the chllorophy for each data entry
floatsDFAll_9Dtimeorder = floatsDFAll_9D.sort_values(['time','id'],ascending=True)
floatsDFAll_9Dtimeorder # check whether it is time ordered!!
# should we drop nan to speed up??


Out[39]:
id time var_lon var_tmp spd lon var_lat ve vn lat temp
0 7574 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
568 10206 2002-07-04 0.005002 1000.000000 8.462583 66.533639 0.001326 7.056056 0.027056 16.219000 NaN
1136 10208 2002-07-04 0.000116 1000.000000 17.918639 69.914139 0.000061 12.942722 -8.270944 13.599500 NaN
1704 11089 2002-07-04 0.000123 0.003620 20.217250 65.036972 0.000065 11.143806 -14.363611 16.068778 27.796889
2272 15703 2002-07-04 0.000099 0.088844 16.841889 69.915889 0.000054 11.706389 -7.037222 13.637028 28.572750
2840 15707 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3408 27069 2002-07-04 0.000102 0.001708 25.498500 69.445889 0.000056 24.130583 -1.797889 20.077750 28.981389
3976 27139 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4544 28842 2002-07-04 0.000208 0.003344 18.067944 60.830556 0.000099 5.026861 -8.122111 18.624861 27.620472
5112 34159 2002-07-04 0.000112 1000.000000 37.039111 59.736889 0.000059 31.753083 16.704556 12.894139 NaN
5680 34173 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
6248 34210 2002-07-04 0.000124 0.003696 26.752611 56.763194 0.000063 -3.750167 -17.727861 6.022361 26.452806
6816 34211 2002-07-04 0.000102 0.003512 28.413083 68.565389 0.000055 23.260944 -14.241500 8.210750 28.380222
7384 34212 2002-07-04 0.000100 0.003549 48.849444 65.946111 0.000055 42.864500 7.009000 6.679556 28.577889
7952 34223 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
8520 34310 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9088 34311 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9656 34312 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
10224 34314 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
10792 34315 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
11360 34374 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
11928 34708 2002-07-04 0.000109 0.001784 30.958611 60.658278 0.000057 30.589861 2.088833 10.225111 27.291500
12496 34709 2002-07-04 0.000189 0.002233 96.889000 52.974000 0.000094 -84.267000 47.817000 5.027000 26.934000
13064 34710 2002-07-04 0.000097 0.001854 47.400444 50.333750 0.000051 0.763722 -8.140111 13.057111 31.149000
13632 34714 2002-07-04 0.000111 0.001799 37.025528 64.694000 0.000059 35.870556 4.682833 13.741167 27.765694
14200 34716 2002-07-04 0.000106 0.001785 37.381000 66.285250 0.000057 34.058694 2.213806 7.716750 28.780944
14768 34718 2002-07-04 0.000103 0.001695 39.619167 72.973500 0.000055 20.389944 -33.327528 15.458417 29.063444
15336 34719 2002-07-04 0.000107 0.001652 27.380083 71.378167 0.000057 13.919778 -20.858056 17.217028 28.959417
15904 34720 2002-07-04 0.000111 0.001771 24.609778 69.482472 0.000060 11.141528 -19.437778 14.142056 28.664167
16472 34721 2002-07-04 0.000113 0.001733 13.113972 65.552333 0.000060 5.753667 -10.328972 16.928722 27.908083
... ... ... ... ... ... ... ... ... ... ... ...
130639 3098682 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
131207 60073460 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
131775 60074440 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
132343 60077450 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
132911 60150420 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
133479 60454500 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
134047 60656200 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
134615 60657200 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
135183 60658190 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
135751 60659110 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
136319 60659120 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
136887 60659190 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
137455 60659200 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
138023 60940960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
138591 60940970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
139159 60941960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
139727 60941970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
140295 60942960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
140863 60942970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
141431 60943960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
141999 60943970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
142567 60944960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
143135 60944970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
143703 60945970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
144271 60946960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
144839 60947960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
145407 60947970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
145975 60948960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
146543 60950430 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN
147111 62321420 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN

147112 rows × 11 columns


In [40]:
floatsDFAll_9Dtimeorder.lon.dropna().shape  # the longitude data has lots of values (3466,)


Out[40]:
(3466,)

In [42]:
# a little test for the api in loops for the dataframe   
# check df.itertuples? it is faster and preserves the data format
'''
chl_ocx=[]
for row in floats_timeorder.itertuples():
    #print(row)
    #print('row.time = %s, row.id=%d, row.lon=%4.3f, row.lat=%4.3f' % (row.time,row.id,row.lon,row.lat)  )
    tmp=ds_2day.chl_ocx.sel_points(time=[row.time],lon=[row.lon], lat=[row.lat], method='nearest') # interpolation
    chl_ocx.append(tmp)
floats_timeorder['chl_ocx'] = pd.Series(chl_ocx, index=floats_timeorder.index)
chl_ocx[0].to_series
'''


Out[42]:
"\nchl_ocx=[]\nfor row in floats_timeorder.itertuples():\n    #print(row)\n    #print('row.time = %s, row.id=%d, row.lon=%4.3f, row.lat=%4.3f' % (row.time,row.id,row.lon,row.lat)  )\n    tmp=ds_2day.chl_ocx.sel_points(time=[row.time],lon=[row.lon], lat=[row.lat], method='nearest') # interpolation\n    chl_ocx.append(tmp)\nfloats_timeorder['chl_ocx'] = pd.Series(chl_ocx, index=floats_timeorder.index)\nchl_ocx[0].to_series\n"

In [43]:
# this one line avoid the list above
# it took a really long time for 2D interpolation, it takes an hour
tmpAll = ds_9day.chlor_a.sel_points(time=list(floatsDFAll_9Dtimeorder.time),lon=list(floatsDFAll_9Dtimeorder.lon), lat=list(floatsDFAll_9Dtimeorder.lat), method='nearest')
print('the count of nan vaues in tmpAll is',tmpAll.to_series().isnull().sum())


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/indexes/base.py:2352: RuntimeWarning: invalid value encountered in less_equal
  indexer = np.where(op(left_distances, right_distances) |
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/indexes/base.py:2352: RuntimeWarning: invalid value encountered in less
  indexer = np.where(op(left_distances, right_distances) |
the count of nan vaues in tmpAll is 145481

In [44]:
#print(tmpAll.dropna().shape)
tmpAll.to_series().dropna().shape  # (1631,) good values


Out[44]:
(1631,)

In [45]:
# tmp.to_series() to transfer it from xarray dataset to series
floatsDFAll_9Dtimeorder['chlor_a'] = pd.Series(np.array(tmpAll.to_series()), index=floatsDFAll_9Dtimeorder.index)
print("after editing the dataframe the nan values in 'chlor_a' is", floatsDFAll_9Dtimeorder.chlor_a.isnull().sum() )  # they should be the same values as above

# take a look at the data
floatsDFAll_9Dtimeorder

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_9Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chlor_a', cmap='RdBu_r', edgecolor='none', ax=ax)

def scale(x):
    logged = np.log10(x)
    return logged

#print(floatsAll_timeorder['chlor_a'].apply(scale))
floatsDFAll_9Dtimeorder['chlor_a_log10'] = floatsDFAll_9Dtimeorder['chlor_a'].apply(scale)
floatsDFAll_9Dtimeorder
#print("after the transformation the nan values in 'chlor_a_log10' is", floatsAll_timeorder.chlor_a_log10.isnull().sum() )

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_9Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chlor_a_log10', cmap='RdBu_r', edgecolor='none', ax=ax)
floatsDFAll_9Dtimeorder.chlor_a.dropna().shape  # (1631,)
#floatsDFAll_9Dtimeorder.chlor_a_log10.dropna().shape  # (1631,)


after editing the dataframe the nan values in 'chlor_a' is 145481
Out[45]:
(1631,)

In [46]:
# take the diff of the chlor_a, and this has to be done in xarray
# transfer the dataframe into xarry dataset again
# take the difference
floatsDFAll_9Dtimeorder


Out[46]:
id time var_lon var_tmp spd lon var_lat ve vn lat temp chlor_a chlor_a_log10
0 7574 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
568 10206 2002-07-04 0.005002 1000.000000 8.462583 66.533639 0.001326 7.056056 0.027056 16.219000 NaN NaN NaN
1136 10208 2002-07-04 0.000116 1000.000000 17.918639 69.914139 0.000061 12.942722 -8.270944 13.599500 NaN NaN NaN
1704 11089 2002-07-04 0.000123 0.003620 20.217250 65.036972 0.000065 11.143806 -14.363611 16.068778 27.796889 NaN NaN
2272 15703 2002-07-04 0.000099 0.088844 16.841889 69.915889 0.000054 11.706389 -7.037222 13.637028 28.572750 NaN NaN
2840 15707 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3408 27069 2002-07-04 0.000102 0.001708 25.498500 69.445889 0.000056 24.130583 -1.797889 20.077750 28.981389 NaN NaN
3976 27139 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4544 28842 2002-07-04 0.000208 0.003344 18.067944 60.830556 0.000099 5.026861 -8.122111 18.624861 27.620472 NaN NaN
5112 34159 2002-07-04 0.000112 1000.000000 37.039111 59.736889 0.000059 31.753083 16.704556 12.894139 NaN NaN NaN
5680 34173 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6248 34210 2002-07-04 0.000124 0.003696 26.752611 56.763194 0.000063 -3.750167 -17.727861 6.022361 26.452806 NaN NaN
6816 34211 2002-07-04 0.000102 0.003512 28.413083 68.565389 0.000055 23.260944 -14.241500 8.210750 28.380222 0.118290 -0.927052
7384 34212 2002-07-04 0.000100 0.003549 48.849444 65.946111 0.000055 42.864500 7.009000 6.679556 28.577889 NaN NaN
7952 34223 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8520 34310 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9088 34311 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9656 34312 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10224 34314 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10792 34315 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11360 34374 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11928 34708 2002-07-04 0.000109 0.001784 30.958611 60.658278 0.000057 30.589861 2.088833 10.225111 27.291500 NaN NaN
12496 34709 2002-07-04 0.000189 0.002233 96.889000 52.974000 0.000094 -84.267000 47.817000 5.027000 26.934000 0.288937 -0.539196
13064 34710 2002-07-04 0.000097 0.001854 47.400444 50.333750 0.000051 0.763722 -8.140111 13.057111 31.149000 NaN NaN
13632 34714 2002-07-04 0.000111 0.001799 37.025528 64.694000 0.000059 35.870556 4.682833 13.741167 27.765694 NaN NaN
14200 34716 2002-07-04 0.000106 0.001785 37.381000 66.285250 0.000057 34.058694 2.213806 7.716750 28.780944 NaN NaN
14768 34718 2002-07-04 0.000103 0.001695 39.619167 72.973500 0.000055 20.389944 -33.327528 15.458417 29.063444 NaN NaN
15336 34719 2002-07-04 0.000107 0.001652 27.380083 71.378167 0.000057 13.919778 -20.858056 17.217028 28.959417 NaN NaN
15904 34720 2002-07-04 0.000111 0.001771 24.609778 69.482472 0.000060 11.141528 -19.437778 14.142056 28.664167 NaN NaN
16472 34721 2002-07-04 0.000113 0.001733 13.113972 65.552333 0.000060 5.753667 -10.328972 16.928722 27.908083 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
130639 3098682 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
131207 60073460 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
131775 60074440 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
132343 60077450 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
132911 60150420 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
133479 60454500 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
134047 60656200 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
134615 60657200 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
135183 60658190 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
135751 60659110 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
136319 60659120 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
136887 60659190 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
137455 60659200 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
138023 60940960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
138591 60940970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
139159 60941960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
139727 60941970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
140295 60942960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
140863 60942970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
141431 60943960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
141999 60943970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
142567 60944960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
143135 60944970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
143703 60945970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
144271 60946960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
144839 60947960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145407 60947970 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145975 60948960 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
146543 60950430 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
147111 62321420 2016-06-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

147112 rows × 13 columns


In [47]:
# unstack() will provide a 2d dataframe
# reset_index() will reset all the index as columns

In [48]:
# prepare the data in dataset and about to take the diff
tmp = xr.Dataset.from_dataframe(floatsDFAll_9Dtimeorder.set_index(['time','id']) ) # set time & id as the index); use reset_index to revert this operation
# take the diff on the chlor_a
chlor_a_rate = tmp.diff(dim='time',n=1).chlor_a.to_series().reset_index()
# make the column to a proper name
chlor_a_rate.rename(columns={'chlor_a':'chl_rate'}, inplace='True')
chlor_a_rate


# merge the two dataframes {floatsDFAll_XDtimeorder; chlor_a_rate} into one dataframe based on the index {id, time} and use the left method
floatsDFAllRate_9Dtimeorder=pd.merge(floatsDFAll_9Dtimeorder,chlor_a_rate, on=['time','id'], how = 'left')
floatsDFAllRate_9Dtimeorder

# check 
print('check the sum of the chlor_a before the merge', chlor_a_rate.chl_rate.sum())
print('check the sum of the chlor_a after the merge',floatsDFAllRate_9Dtimeorder.chl_rate.sum())


# visualize the chlorophyll rate, it is *better* to visualize at this scale
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAllRate_9Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r', vmin=-0.8, vmax=0.8, edgecolor='none', ax=ax)

# visualize the chlorophyll rate on the log scale
floatsDFAllRate_9Dtimeorder['chl_rate_log10'] = floatsDFAllRate_9Dtimeorder['chl_rate'].apply(scale)
floatsDFAllRate_9Dtimeorder
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAllRate_9Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chl_rate_log10', cmap='RdBu_r', edgecolor='none', ax=ax)
#floatsDFAllRate_9Dtimeorder.chl_rate.dropna().shape   # (1008,) data points
floatsDFAllRate_9Dtimeorder.chl_rate_log10.dropna().shape   # (417,)data points..... notice, chl_rate can be negative, so do not take log10


check the sum of the chlor_a before the merge -64.3533472159612
check the sum of the chlor_a after the merge -64.3533472159612
Out[48]:
(417,)

In [49]:
pd.to_datetime(floatsDFAllRate_9Dtimeorder.time)
type(pd.to_datetime(floatsDFAllRate_9Dtimeorder.time))
ts = pd.Series(0, index=pd.to_datetime(floatsDFAllRate_9Dtimeorder.time) ) # creat a target time series for masking purpose

# take the month out
month = ts.index.month 
# month.shape # a check on the shape of the month.
selector = ((11==month) | (12==month) | (1==month) | (2==month) | (3==month) )  
selector
print('shape of the selector', selector.shape)

print('all the data count in [11-01, 03-31]  is', floatsDFAllRate_9Dtimeorder[selector].chl_rate.dropna().shape) # total (672,)
print('all the data count is', floatsDFAllRate_9Dtimeorder.chl_rate.dropna().shape )   # total (1008,)


shape of the selector (147112,)
all the data count in [11-01, 03-31]  is (672,)
all the data count is (1008,)

In [50]:
# histogram for non standarized data
axfloat = floatsDFAllRate_9Dtimeorder[selector].chl_rate.dropna().hist(bins=100,range=[-0.3,0.3])
axfloat.set_title('9-Day chl_rate')


Out[50]:
<matplotlib.text.Text at 0x447f5fe48>

In [51]:
# standarized series
ts = floatsDFAllRate_9Dtimeorder[selector].chl_rate.dropna()
ts_standardized = (ts - ts.mean())/ts.std()
axts = ts_standardized.hist(bins=100,range=[-0.3,0.3])
axts.set_title('9-Day standardized chl_rate')


Out[51]:
<matplotlib.text.Text at 0x11a8bf048>

In [52]:
# all the data
fig, axes = plt.subplots(nrows=8, ncols=2, figsize=(12, 10))
fig.subplots_adjust(hspace=0.05, wspace=0.05)

for i, ax in zip(range(2002,2017), axes.flat) :
    tmpyear = floatsDFAllRate_9Dtimeorder[ (floatsDFAllRate_9Dtimeorder.time > str(i))  & (floatsDFAllRate_9Dtimeorder.time < str(i+1)) ] # if year i
    #fig, ax  = plt.subplots(figsize=(12,10))
    print(tmpyear.chl_rate.dropna().shape)   # total is 1001
    tmpyear.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r',vmin=-0.6, vmax=0.6, edgecolor='none', ax=ax)
    ax.set_title('year %g' % i)     
    
# remove the extra figure
ax = plt.subplot(8,2,16)
fig.delaxes(ax)


(47,)
(45,)
(7,)
(38,)
(105,)
(92,)
(140,)
(31,)
(62,)
(16,)
(32,)
(28,)
(198,)
(105,)
(55,)

In [53]:
fig, axes = plt.subplots(nrows=7, ncols=2, figsize=(12, 10))
fig.subplots_adjust(hspace=0.05, wspace=0.05)

for i, ax in zip(range(2002,2016), axes.flat) :
    tmpyear = floatsDFAllRate_9Dtimeorder[ (floatsDFAllRate_9Dtimeorder.time >= (str(i)+ '-11-01') )  & (floatsDFAllRate_9Dtimeorder.time <= (str(i+1)+'-03-31') ) ] # if year i
    # select only particular month, Nov 1 to March 31
    #fig, ax  = plt.subplots(figsize=(12,10))
    print(tmpyear.chl_rate.dropna().shape)  # the total is 672
    tmpyear.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r', vmin=-0.6, vmax=0.6, edgecolor='none', ax=ax)
    ax.set_title('year %g' % i)


(58,)
(1,)
(9,)
(66,)
(44,)
(129,)
(28,)
(50,)
(5,)
(32,)
(0,)
(106,)
(91,)
(53,)

In [ ]:


In [ ]:


In [104]:
# let's output the data as a csv or hdf file to disk to save the experiment time

df_list = []
for i in range(2002,2017) :
    tmpyear = floatsDFAllRate_9Dtimeorder[ (floatsDFAllRate_9Dtimeorder.time >= (str(i)+ '-11-01') )  & (floatsDFAllRate_9Dtimeorder.time <= (str(i+1)+'-03-31') ) ] # if year i
    # select only particular month, Nov 1 to March 31
    df_list.append(tmpyear)
    
df_tmp = pd.concat(df_list)
print('all the data count in [11-01, 03-31]  is ', df_tmp.chl_rate.dropna().shape) # again, the total is (692,)
df_chl_out_9D_modisa = df_tmp[~df_tmp.chl_rate.isnull()] # only keep the non-nan values
#list(df_chl_out_XD.groupby(['id']))   # can see the continuity pattern of the Lagarangian difference for each float id

# output to a csv or hdf file
df_chl_out_9D_modisa.head()


all the data count in [11-01, 03-31]  is  (672,)
Out[104]:
id time temp var_lat var_tmp ve var_lon vn lon spd lat chlor_a chlor_a_log10 chl_rate chl_rate_log10
3627 10206 2002-11-07 NaN 0.000494 1000.000000 -2.217083 0.001535 2.990778 67.132000 5.446583 11.126222 0.130267 -0.885166 -0.004264 NaN
3629 11089 2002-11-07 28.829472 0.000064 0.003812 -16.412472 0.000123 -3.991722 64.391056 17.995028 14.279667 0.197237 -0.705012 0.074821 -1.125976
3631 15707 2002-11-07 NaN 0.000074 1000.000000 -12.316611 0.000147 -18.253056 67.155306 24.656417 13.142667 0.152200 -0.817584 -0.004472 NaN
3649 34710 2002-11-07 28.448167 0.000069 0.001857 -2.827667 0.000135 19.539861 63.041861 20.774778 17.717111 0.372568 -0.428795 0.018603 -1.730417
3886 10206 2002-11-16 NaN 0.001033 1000.000000 -1.089083 0.003872 0.501111 67.029167 4.028889 11.179833 0.145233 -0.837935 0.014966 -1.824894

In [105]:
df_chl_out_9D_modisa.index.name = 'index'  # make it specific for the index name

# CSV CSV CSV CSV with specfic index
df_chl_out_9D_modisa.to_csv('df_chl_out_9D_modisa.csv', sep=',', index_label = 'index')

# load CSV output
test = pd.read_csv('df_chl_out_9D_modisa.csv', index_col='index')
test.head()


Out[105]:
id time temp var_lat var_tmp ve var_lon vn lon spd lat chlor_a chlor_a_log10 chl_rate chl_rate_log10
index
3627 10206 2002-11-07 NaN 0.000494 1000.000000 -2.217083 0.001535 2.990778 67.132000 5.446583 11.126222 0.130267 -0.885166 -0.004264 NaN
3629 11089 2002-11-07 28.829472 0.000064 0.003812 -16.412472 0.000123 -3.991722 64.391056 17.995028 14.279667 0.197237 -0.705012 0.074821 -1.125976
3631 15707 2002-11-07 NaN 0.000074 1000.000000 -12.316611 0.000147 -18.253056 67.155306 24.656417 13.142667 0.152200 -0.817584 -0.004472 NaN
3649 34710 2002-11-07 28.448167 0.000069 0.001857 -2.827667 0.000135 19.539861 63.041861 20.774778 17.717111 0.372568 -0.428795 0.018603 -1.730417
3886 10206 2002-11-16 NaN 0.001033 1000.000000 -1.089083 0.003872 0.501111 67.029167 4.028889 11.179833 0.145233 -0.837935 0.014966 -1.824894

In [ ]: