3Day subsampling on the OceanColor Dataset


In [8]:
import xarray as xr
import numpy as np
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
from dask.diagnostics import ProgressBar
import seaborn as sns
from matplotlib.colors import LogNorm

Load data from disk

We already downloaded a subsetted MODIS-Aqua chlorophyll-a dataset for the Arabian Sea.

We can read all the netcdf files into one xarray Dataset using the open_mfsdataset function. Note that this does not load the data into memory yet. That only happens when we try to access the values.


In [9]:
ds_8day = xr.open_mfdataset('./data_collector_modisa_chla9km/ModisA_Arabian_Sea_chlor_a_9km_*_8D.nc')
ds_daily = xr.open_mfdataset('./data_collector_modisa_chla9km/ModisA_Arabian_Sea_chlor_a_9km_*_D.nc')
both_datasets = [ds_8day, ds_daily]

How much data is contained here? Let's get the answer in MB.


In [10]:
print([(ds.nbytes / 1e6) for ds in both_datasets])


[534.295504, 4241.4716]

The 8-day dataset is ~534 MB while the daily dataset is 4.2 GB. These both easily fit in RAM. So let's load them all into memory


In [11]:
[ds.load() for ds in both_datasets]


Out[11]:
[<xarray.Dataset>
 Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 667)
 Coordinates:
   * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
   * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
   * rgb            (rgb) int64 0 1 2
   * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
   * time           (time) datetime64[ns] 2002-07-04 2002-07-12 2002-07-20 ...
 Data variables:
     palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
     chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...,
 <xarray.Dataset>
 Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 5295)
 Coordinates:
   * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
   * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
   * rgb            (rgb) int64 0 1 2
   * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
   * time           (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...
 Data variables:
     palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
     chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...]

Fix bad data

In preparing this demo, I noticed that small number of maps had bad data--specifically, they contained large negative values of chlorophyll concentration. Looking closer, I realized that the land/cloud mask had been inverted. So I wrote a function to invert it back and correct the data.


In [12]:
def fix_bad_data(ds):
    # for some reason, the cloud / land mask is backwards on some data
    # this is obvious because there are chlorophyl values less than zero
    bad_data = ds.chlor_a.groupby('time').min() < 0
    # loop through and fix
    for n in np.nonzero(bad_data.values)[0]:
        data = ds.chlor_a[n].values 
        ds.chlor_a.values[n] = np.ma.masked_less(data, 0).filled(np.nan)

In [13]:
[fix_bad_data(ds) for ds in both_datasets]


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in less
  if not reflexive
Out[13]:
[None, None]

In [14]:
ds_8day.chlor_a>0


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[14]:
<xarray.DataArray 'chlor_a' (time: 667, lat: 276, lon: 360)>
array([[[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ...,  True, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False,  True],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False,  True,  True]],

       ..., 
       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]]], dtype=bool)
Coordinates:
  * lat      (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 27.37 ...
  * lon      (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 45.63 ...
  * time     (time) datetime64[ns] 2002-07-04 2002-07-12 2002-07-20 ...

Count the number of ocean data points

First we have to figure out the land mask. Unfortunately it doesn't come with the dataset. But we can infer it by counting all the points that have at least one non-nan chlorophyll value.


In [15]:
(ds_8day.chlor_a>0).sum(dim='time').plot()


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[15]:
<matplotlib.collections.QuadMesh at 0x11949c358>

In [16]:
#  find a mask for the land
ocean_mask = (ds_8day.chlor_a>0).sum(dim='time')>0
#ocean_mask = (ds_daily.chlor_a>0).sum(dim='time')>0
num_ocean_points = ocean_mask.sum().values  # compute the total nonzeros regions(data point)
ocean_mask.plot()
plt.title('%g total ocean points' % num_ocean_points)


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[16]:
<matplotlib.text.Text at 0x13d340c88>

In [17]:
#ds_8day

In [18]:
#ds_daily

In [19]:
plt.figure(figsize=(8,6))
ds_daily.chlor_a.sel(time='2002-11-18',method='nearest').plot(norm=LogNorm())
#ds_daily.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[19]:
<matplotlib.collections.QuadMesh at 0x1292f55c0>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [20]:
#list(ds_daily.groupby('time')) # take a look at what's inside

Now we count up the number of valid points in each snapshot and divide by the total number of ocean points.


In [21]:
'''
<xarray.Dataset>
Dimensions:        (eightbitcolor: 256, lat: 144, lon: 276, rgb: 3, time: 4748)
'''
ds_daily.groupby('time').count() # information from original data


Out[21]:
<xarray.Dataset>
Dimensions:  (time: 5295)
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...
Data variables:
    palette  (time) int64 768 768 768 768 768 768 768 768 768 768 768 768 ...
    chlor_a  (time) int64 658 1170 1532 2798 2632 1100 1321 636 2711 1163 ...

In [22]:
ds_daily.chlor_a.groupby('time').count()/float(num_ocean_points)


Out[22]:
<xarray.DataArray 'chlor_a' (time: 5295)>
array([ 0.01053255,  0.01872809,  0.02452259, ...,  0.        ,
        0.        ,  0.        ])
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...

In [23]:
count_8day,count_daily = [ds.chlor_a.groupby('time').count()/float(num_ocean_points)
                            for ds in (ds_8day, ds_daily)]

In [24]:
#count_8day = ds_8day.chl_ocx.groupby('time').count()/float(num_ocean_points)
#coundt_daily = ds_daily.chl_ocx.groupby('time').count()/float(num_ocean_points)

#count_8day, coundt_daily = [ds.chl_ocx.groupby('time').count()/float(num_ocean_points)
#                            for ds in ds_8day, ds_daily] # not work in python 3

In [25]:
plt.figure(figsize=(12,4))
count_8day.plot(color='k')
count_daily.plot(color='r')

plt.legend(['8 day','daily'])


Out[25]:
<matplotlib.legend.Legend at 0x129b690f0>

Seasonal Climatology


In [26]:
count_8day_clim, coundt_daily_clim = [count.groupby('time.month').mean()  # monthly data
                                      for count in (count_8day, count_daily)]

In [27]:
# mean value of the monthly data on the count of nonzeros
plt.figure(figsize=(12,4))
count_8day_clim.plot(color='k')
coundt_daily_clim.plot(color='r')
plt.legend(['8 day', 'daily'])


Out[27]:
<matplotlib.legend.Legend at 0x128d4ada0>

From the above figure, we see that data coverage is highest in the winter (especially Feburary) and lowest in summer.

Maps of individual days

Let's grab some data from Febrauary and plot it.


In [28]:
target_date = '2003-02-15'
plt.figure(figsize=(8,6))
ds_8day.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[28]:
<matplotlib.collections.QuadMesh at 0x129e85c88>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [29]:
plt.figure(figsize=(8,6))
ds_daily.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[29]:
<matplotlib.collections.QuadMesh at 0x12b43b9e8>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [30]:
ds_daily.chlor_a[0].sel_points(lon=[65, 70], lat=[16, 18], method='nearest')   # the time is selected!
#ds_daily.chl_ocx[0].sel_points(time= times, lon=lons, lat=times, method='nearest')


Out[30]:
<xarray.DataArray 'chlor_a' (points: 2)>
array([ nan,  nan])
Coordinates:
    time     datetime64[ns] 2002-07-04
    lon      (points) float64 65.04 70.04
    lat      (points) float64 16.04 18.04
  * points   (points) int64 0 1

In [31]:
#ds_daily.chlor_a.sel_points?

In [32]:
ds_3day = ds_daily.resample('3D', dim='time')
ds_3day


Out[32]:
<xarray.Dataset>
Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 1765)
Coordinates:
  * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
  * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
  * rgb            (rgb) int64 0 1 2
  * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
  * time           (time) datetime64[ns] 2002-07-04 2002-07-07 2002-07-10 ...
Data variables:
    palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
    chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...

In [33]:
plt.figure(figsize=(8,6))
ds_3day.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[33]:
<matplotlib.collections.QuadMesh at 0x13d2d3240>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [34]:
# check the range for the longitude
print(ds_3day.lon.min(),'\n' ,ds_3day.lat.min())


<xarray.DataArray 'lon' ()>
array(45.04166793823242) 
 <xarray.DataArray 'lat' ()>
array(5.041661739349365)

++++++++++++++++++++++++++++++++++++++++++++++

All GDP Floats

Load the float data

Map a (time, lon, lat) to a value on the cholorphlly value


In [35]:
# in the following we deal with the data from the gdp float
from buyodata import buoydata
import os

In [36]:
# a list of files
fnamesAll = ['./gdp_float/buoydata_1_5000.dat','./gdp_float/buoydata_5001_10000.dat','./gdp_float/buoydata_10001_15000.dat','./gdp_float/buoydata_15001_jun16.dat']

In [37]:
# read them and cancatenate them into one DataFrame
dfAll = pd.concat([buoydata.read_buoy_data(f) for f in fnamesAll])  # around 4~5 minutes

#mask = df.time>='2002-07-04' # we only have data after this data for chlor_a
dfvvAll = dfAll[dfAll.time>='2002-07-04']

sum(dfvvAll.time<'2002-07-04') # recheck whether the time is


Out[37]:
0

In [38]:
# process the data so that the longitude are all >0
print('before processing, the minimum longitude is%f4.3 and maximum is %f4.3' % (dfvvAll.lon.min(), dfvvAll.lon.max()))
mask = dfvvAll.lon<0
dfvvAll.lon[mask] = dfvvAll.loc[mask].lon + 360
print('after processing, the minimum longitude is %f4.3 and maximum is %f4.3' % (dfvvAll.lon.min(),dfvvAll.lon.max()) )

dfvvAll.describe()


before processing, the minimum longitude is0.0000004.3 and maximum is 360.0000004.3
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/core/generic.py:4695: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)
after processing, the minimum longitude is 0.0000004.3 and maximum is 360.0000004.3
Out[38]:
id lat lon temp ve vn spd var_lat var_lon var_tmp
count 2.147732e+07 2.131997e+07 2.131997e+07 1.986179e+07 2.129142e+07 2.129142e+07 2.129142e+07 2.147732e+07 2.147732e+07 2.147732e+07
mean 1.765662e+06 -2.263128e+00 2.124412e+02 1.986121e+01 2.454172e-01 4.708192e-01 2.613427e+01 7.326258e+00 7.326555e+00 7.522298e+01
std 9.452835e+06 3.401115e+01 9.746941e+01 8.339498e+00 2.525050e+01 2.052160e+01 1.939087e+01 8.527853e+01 8.527851e+01 2.637454e+02
min 2.578000e+03 -7.764700e+01 0.000000e+00 -1.685000e+01 -2.916220e+02 -2.601400e+02 0.000000e+00 5.268300e-07 -3.941600e-02 1.001300e-03
25% 4.897500e+04 -3.186000e+01 1.490720e+02 1.437300e+01 -1.411400e+01 -1.044700e+01 1.290300e+01 4.366500e-06 7.512600e-06 1.435700e-03
50% 7.141300e+04 -4.920000e+00 2.153940e+02 2.214400e+01 -5.560000e-01 1.970000e-01 2.176700e+01 8.833600e-06 1.495800e-05 1.691700e-03
75% 1.094330e+05 2.756000e+01 3.064370e+02 2.688900e+01 1.356100e+01 1.109300e+01 3.405900e+01 1.833300e-05 3.627900e-05 2.294200e-03
max 6.399288e+07 8.989900e+01 3.600000e+02 4.595000e+01 4.417070e+02 2.783220e+02 4.421750e+02 1.000000e+03 1.000000e+03 1.000000e+03

In [39]:
# Select only the arabian sea region
arabian_sea = (dfvvAll.lon > 45) & (dfvvAll.lon< 75) & (dfvvAll.lat> 5) & (dfvvAll.lat <28)
# arabian_sea = {'lon': slice(45,75), 'lat': slice(5,28)} # later use this longitude and latitude
floatsAll = dfvvAll.loc[arabian_sea]   # directly use mask
print('dfvvAll.shape is %s, floatsAll.shape is %s' % (dfvvAll.shape, floatsAll.shape) )


dfvvAll.shape is (21477317, 11), floatsAll.shape is (111894, 11)

In [40]:
# visualize the float around global region
fig, ax  = plt.subplots(figsize=(12,10))
dfvvAll.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsAll.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x4dfa202b0>

In [41]:
# pands dataframe cannot do the resamplingn properly
# cause we are really indexing on ['time','id'], pandas.dataframe.resample cannot do this
# TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
print()




In [42]:
# dump the surface floater data from pandas.dataframe to xarray.dataset
floatsDSAll = xr.Dataset.from_dataframe(floatsAll.set_index(['time','id']) ) # set time & id as the index); use reset_index to revert this operation
floatsDSAll


Out[42]:
<xarray.Dataset>
Dimensions:  (id: 259, time: 17499)
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-04T06:00:00 ...
  * id       (id) int64 7574 10206 10208 11089 15703 15707 27069 27139 28842 ...
Data variables:
    lat      (time, id) float64 nan 16.3 14.03 16.4 14.04 nan 20.11 nan ...
    lon      (time, id) float64 nan 66.23 69.48 64.58 69.51 nan 68.55 nan ...
    temp     (time, id) float64 nan nan nan 28.0 28.53 nan 28.93 nan 27.81 ...
    ve       (time, id) float64 nan 8.68 5.978 6.286 4.844 nan 32.9 nan ...
    vn       (time, id) float64 nan -13.18 -18.05 -7.791 -17.47 nan 15.81 ...
    spd      (time, id) float64 nan 15.78 19.02 10.01 18.13 nan 36.51 nan ...
    var_lat  (time, id) float64 nan 0.0002661 5.01e-05 5.018e-05 5.024e-05 ...
    var_lon  (time, id) float64 nan 0.0006854 8.851e-05 9.018e-05 8.968e-05 ...
    var_tmp  (time, id) float64 nan 1e+03 1e+03 0.003733 0.0667 nan 0.001683 ...

In [51]:
# resample on the xarray.dataset onto two-day frequency
floatsDSAll_3D =floatsDSAll.resample('3D', dim='time')
floatsDSAll_3D


Out[51]:
<xarray.Dataset>
Dimensions:  (id: 259, time: 1704)
Coordinates:
  * id       (id) int64 7574 10206 10208 11089 15703 15707 27069 27139 28842 ...
  * time     (time) datetime64[ns] 2002-07-04 2002-07-07 2002-07-10 ...
Data variables:
    var_tmp  (time, id) float64 nan 1e+03 1e+03 0.003607 0.07764 nan ...
    lon      (time, id) float64 nan 66.38 69.59 64.73 69.62 nan 68.84 nan ...
    vn       (time, id) float64 nan -6.362 -19.1 -7.017 -18.71 nan 4.231 nan ...
    lat      (time, id) float64 nan 16.21 13.82 16.33 13.83 nan 20.18 nan ...
    var_lon  (time, id) float64 nan 0.006395 9.575e-05 0.0001482 9.875e-05 ...
    temp     (time, id) float64 nan nan nan 27.92 28.56 nan 28.97 nan 27.66 ...
    spd      (time, id) float64 nan 12.92 21.86 14.67 21.2 nan 27.52 nan ...
    var_lat  (time, id) float64 nan 0.001675 5.309e-05 7.571e-05 5.407e-05 ...
    ve       (time, id) float64 nan 10.94 9.9 12.24 9.378 nan 26.28 nan ...

In [44]:
# transfer it back to pandas.dataframe for plotting
floatsDFAll_3D = floatsDSAll_3D.to_dataframe()
floatsDFAll_3D
floatsDFAll_3D = floatsDFAll_3D.reset_index()
floatsDFAll_3D
# visualize the subsamping of floats around arabian region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_3D.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a4a18d0>

In [45]:
# get the value for the chllorophy for each data entry
floatsDFAll_3Dtimeorder = floatsDFAll_3D.sort_values(['time','id'],ascending=True)
floatsDFAll_3Dtimeorder # check whether it is time ordered!!
# should we drop nan to speed up??


Out[45]:
id time var_tmp lon vn lat var_lon temp spd var_lat ve
0 7574 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1704 10206 2002-07-04 1000.000000 66.375833 -6.362417 16.208333 0.006395 NaN 12.924000 0.001675 10.941333
3408 10208 2002-07-04 1000.000000 69.589833 -19.104583 13.816917 0.000096 NaN 21.864250 0.000053 9.899750
5112 11089 2002-07-04 0.003607 64.731500 -7.016583 16.331167 0.000148 27.917667 14.670833 0.000076 12.239583
6816 15703 2002-07-04 0.077642 69.617667 -18.706167 13.829750 0.000099 28.555167 21.204917 0.000054 9.378000
8520 15707 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
10224 27069 2002-07-04 0.001681 68.844083 4.231083 20.177000 0.000107 28.973000 27.516167 0.000058 26.284000
11928 27139 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
13632 28842 2002-07-04 0.003285 60.748083 -5.116833 18.852000 0.000225 27.663833 24.501167 0.000106 11.585000
15336 34159 2002-07-04 1000.000000 59.009333 5.826667 12.568250 0.000109 NaN 26.245667 0.000058 25.174250
17040 34173 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
18744 34210 2002-07-04 0.003603 56.899000 -16.234583 6.409500 0.000146 26.715250 19.873667 0.000072 -9.563583
20448 34211 2002-07-04 0.003496 68.015250 -15.920167 8.539083 0.000098 28.316167 26.681000 0.000054 20.941750
22152 34212 2002-07-04 0.003571 64.844583 18.941000 6.327167 0.000096 28.476000 32.034083 0.000053 23.924750
23856 34223 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
25560 34310 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
27264 34311 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
28968 34312 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
30672 34314 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
32376 34315 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
34080 34374 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
35784 34708 2002-07-04 0.001796 59.870667 2.843583 10.175333 0.000093 27.167000 43.217000 0.000050 42.975000
37488 34709 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
39192 34710 2002-07-04 0.001769 49.858667 28.993750 13.062167 0.000066 30.956917 47.145833 0.000037 -17.011917
40896 34714 2002-07-04 0.001840 63.802250 11.571500 13.643167 0.000110 27.707000 39.495917 0.000058 37.529500
42600 34716 2002-07-04 0.001765 65.514500 3.266917 7.507417 0.000105 28.814583 36.961917 0.000057 36.070250
44304 34718 2002-07-04 0.001739 72.491917 -29.327667 16.206417 0.000082 29.149750 37.194417 0.000046 22.008917
46008 34719 2002-07-04 0.001578 71.027333 -10.221667 17.720833 0.000088 28.921667 22.969250 0.000049 19.046167
47712 34720 2002-07-04 0.001779 69.224833 -36.747667 14.669917 0.000118 28.653417 38.318250 0.000063 9.947083
49416 34721 2002-07-04 0.001746 65.406167 -9.409917 17.159250 0.000120 27.919250 14.014583 0.000063 9.293667
... ... ... ... ... ... ... ... ... ... ... ...
391919 3098682 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
393623 60073460 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
395327 60074440 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
397031 60077450 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
398735 60150420 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
400439 60454500 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
402143 60656200 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
403847 60657200 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
405551 60658190 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
407255 60659110 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
408959 60659120 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
410663 60659190 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
412367 60659200 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
414071 60940960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
415775 60940970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
417479 60941960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
419183 60941970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
420887 60942960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
422591 60942970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
424295 60943960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
425999 60943970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
427703 60944960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
429407 60944970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
431111 60945970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
432815 60946960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
434519 60947960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
436223 60947970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
437927 60948960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
439631 60950430 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN
441335 62321420 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN

441336 rows × 11 columns


In [46]:
floatsDFAll_3Dtimeorder.lon.dropna().shape  # the longitude data has lots of values (9689,)


Out[46]:
(9689,)

In [47]:
# a little test for the api in loops for the dataframe   
# check df.itertuples? it is faster and preserves the data format
'''
chl_ocx=[]
for row in floats_timeorder.itertuples():
    #print(row)
    #print('row.time = %s, row.id=%d, row.lon=%4.3f, row.lat=%4.3f' % (row.time,row.id,row.lon,row.lat)  )
    tmp=ds_2day.chl_ocx.sel_points(time=[row.time],lon=[row.lon], lat=[row.lat], method='nearest') # interpolation
    chl_ocx.append(tmp)
floats_timeorder['chl_ocx'] = pd.Series(chl_ocx, index=floats_timeorder.index)
chl_ocx[0].to_series
'''


Out[47]:
"\nchl_ocx=[]\nfor row in floats_timeorder.itertuples():\n    #print(row)\n    #print('row.time = %s, row.id=%d, row.lon=%4.3f, row.lat=%4.3f' % (row.time,row.id,row.lon,row.lat)  )\n    tmp=ds_2day.chl_ocx.sel_points(time=[row.time],lon=[row.lon], lat=[row.lat], method='nearest') # interpolation\n    chl_ocx.append(tmp)\nfloats_timeorder['chl_ocx'] = pd.Series(chl_ocx, index=floats_timeorder.index)\nchl_ocx[0].to_series\n"

In [48]:
# this one line avoid the list above
# it took a really long time for 2D interpolation, it takes an hour
tmpAll = ds_3day.chlor_a.sel_points(time=list(floatsDFAll_3Dtimeorder.time),lon=list(floatsDFAll_3Dtimeorder.lon), lat=list(floatsDFAll_3Dtimeorder.lat), method='nearest')
print('the count of nan vaues in tmpAll is',tmpAll.to_series().isnull().sum())


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/indexes/base.py:2352: RuntimeWarning: invalid value encountered in less
  indexer = np.where(op(left_distances, right_distances) |
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/indexes/base.py:2352: RuntimeWarning: invalid value encountered in less_equal
  indexer = np.where(op(left_distances, right_distances) |
the count of nan vaues in tmpAll is 438948

In [49]:
#print(tmpAll.dropna().shape)
tmpAll.to_series().dropna().shape  # (2388,) good values


Out[49]:
(2388,)

In [50]:
# tmp.to_series() to transfer it from xarray dataset to series
floatsDFAll_3Dtimeorder['chlor_a'] = pd.Series(np.array(tmpAll.to_series()), index=floatsDFAll_3Dtimeorder.index)
print("after editing the dataframe the nan values in 'chlor_a' is", floatsDFAll_3Dtimeorder.chlor_a.isnull().sum() )  # they should be the same values as above

# take a look at the data
floatsDFAll_3Dtimeorder

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_3Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chlor_a', cmap='RdBu_r', edgecolor='none', ax=ax)

def scale(x):
    logged = np.log10(x)
    return logged

#print(floatsAll_timeorder['chlor_a'].apply(scale))
floatsDFAll_3Dtimeorder['chlor_a_log10'] = floatsDFAll_3Dtimeorder['chlor_a'].apply(scale)
floatsDFAll_3Dtimeorder
#print("after the transformation the nan values in 'chlor_a_log10' is", floatsAll_timeorder.chlor_a_log10.isnull().sum() )

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_3Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chlor_a_log10', cmap='RdBu_r', edgecolor='none', ax=ax)
#floatsDFAll_3Dtimeorder.chlor_a.dropna().shape  # (2388,)
floatsDFAll_3Dtimeorder.chlor_a_log10.dropna().shape  # (2388,)


after editing the dataframe the nan values in 'chlor_a' is 438948
Out[50]:
(2388,)

In [59]:
# take the diff of the chlor_a, and this has to be done in xarray
# transfer the dataframe into xarry dataset again
# take the difference
floatsDFAll_3Dtimeorder


Out[59]:
id time var_tmp lon vn lat var_lon temp spd var_lat ve chlor_a chlor_a_log10
0 7574 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1704 10206 2002-07-04 1000.000000 66.375833 -6.362417 16.208333 0.006395 NaN 12.924000 0.001675 10.941333 NaN NaN
3408 10208 2002-07-04 1000.000000 69.589833 -19.104583 13.816917 0.000096 NaN 21.864250 0.000053 9.899750 NaN NaN
5112 11089 2002-07-04 0.003607 64.731500 -7.016583 16.331167 0.000148 27.917667 14.670833 0.000076 12.239583 NaN NaN
6816 15703 2002-07-04 0.077642 69.617667 -18.706167 13.829750 0.000099 28.555167 21.204917 0.000054 9.378000 NaN NaN
8520 15707 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10224 27069 2002-07-04 0.001681 68.844083 4.231083 20.177000 0.000107 28.973000 27.516167 0.000058 26.284000 NaN NaN
11928 27139 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13632 28842 2002-07-04 0.003285 60.748083 -5.116833 18.852000 0.000225 27.663833 24.501167 0.000106 11.585000 NaN NaN
15336 34159 2002-07-04 1000.000000 59.009333 5.826667 12.568250 0.000109 NaN 26.245667 0.000058 25.174250 NaN NaN
17040 34173 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18744 34210 2002-07-04 0.003603 56.899000 -16.234583 6.409500 0.000146 26.715250 19.873667 0.000072 -9.563583 NaN NaN
20448 34211 2002-07-04 0.003496 68.015250 -15.920167 8.539083 0.000098 28.316167 26.681000 0.000054 20.941750 NaN NaN
22152 34212 2002-07-04 0.003571 64.844583 18.941000 6.327167 0.000096 28.476000 32.034083 0.000053 23.924750 NaN NaN
23856 34223 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25560 34310 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27264 34311 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28968 34312 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
30672 34314 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32376 34315 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
34080 34374 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
35784 34708 2002-07-04 0.001796 59.870667 2.843583 10.175333 0.000093 27.167000 43.217000 0.000050 42.975000 NaN NaN
37488 34709 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
39192 34710 2002-07-04 0.001769 49.858667 28.993750 13.062167 0.000066 30.956917 47.145833 0.000037 -17.011917 NaN NaN
40896 34714 2002-07-04 0.001840 63.802250 11.571500 13.643167 0.000110 27.707000 39.495917 0.000058 37.529500 NaN NaN
42600 34716 2002-07-04 0.001765 65.514500 3.266917 7.507417 0.000105 28.814583 36.961917 0.000057 36.070250 NaN NaN
44304 34718 2002-07-04 0.001739 72.491917 -29.327667 16.206417 0.000082 29.149750 37.194417 0.000046 22.008917 NaN NaN
46008 34719 2002-07-04 0.001578 71.027333 -10.221667 17.720833 0.000088 28.921667 22.969250 0.000049 19.046167 NaN NaN
47712 34720 2002-07-04 0.001779 69.224833 -36.747667 14.669917 0.000118 28.653417 38.318250 0.000063 9.947083 NaN NaN
49416 34721 2002-07-04 0.001746 65.406167 -9.409917 17.159250 0.000120 27.919250 14.014583 0.000063 9.293667 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
391919 3098682 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
393623 60073460 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
395327 60074440 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
397031 60077450 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
398735 60150420 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
400439 60454500 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
402143 60656200 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
403847 60657200 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
405551 60658190 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
407255 60659110 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
408959 60659120 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
410663 60659190 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
412367 60659200 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
414071 60940960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
415775 60940970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
417479 60941960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
419183 60941970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
420887 60942960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
422591 60942970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
424295 60943960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
425999 60943970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
427703 60944960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
429407 60944970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
431111 60945970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
432815 60946960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
434519 60947960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
436223 60947970 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
437927 60948960 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
439631 60950430 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
441335 62321420 2016-06-29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

441336 rows × 13 columns


In [72]:
# unstack() will provide a 2d dataframe
# reset_index() will reset all the index as columns

In [74]:
# prepare the data in dataset and about to take the diff
tmp = xr.Dataset.from_dataframe(floatsDFAll_3Dtimeorder.set_index(['time','id']) ) # set time & id as the index); use reset_index to revert this operation
# take the diff on the chlor_a
chlor_a_rate = tmp.diff(dim='time',n=1).chlor_a.to_series().reset_index()
# make the column to a proper name
chlor_a_rate.rename(columns={'chlor_a':'chl_rate'}, inplace='True')
chlor_a_rate


# merge the two dataframes {floatsDFAll_XDtimeorder; chlor_a_rate} into one dataframe based on the index {id, time} and use the left method
floatsDFAllRate_3Dtimeorder=pd.merge(floatsDFAll_3Dtimeorder,chlor_a_rate, on=['time','id'], how = 'left')
floatsDFAllRate_3Dtimeorder

# check 
print('check the sum of the chlor_a before the merge', chlor_a_rate.chl_rate.sum())
print('check the sum of the chlor_a after the merge',floatsDFAllRate_3Dtimeorder.chl_rate.sum())


# visualize the chlorophyll rate, it is *better* to visualize at this scale
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAllRate_3Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r', vmin=-0.8, vmax=0.8, edgecolor='none', ax=ax)

# visualize the chlorophyll rate on the log scale
floatsDFAllRate_3Dtimeorder['chl_rate_log10'] = floatsDFAllRate_3Dtimeorder['chl_rate'].apply(scale)
floatsDFAllRate_3Dtimeorder
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAllRate_3Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chl_rate_log10', cmap='RdBu_r', edgecolor='none', ax=ax)
floatsDFAllRate_3Dtimeorder.chl_rate.dropna().shape   # (1018,) data points
#floatsDFAllRate_3Dtimeorder.chl_rate_log10.dropna().shape   # (452,)data points..... notice, chl_rate can be negative, so do not take log10


check the sum of the chlor_a before the merge -25.318965535610925
check the sum of the chlor_a after the merge -25.318965535610925
Out[74]:
(1018,)

In [75]:
pd.to_datetime(floatsDFAllRate_3Dtimeorder.time)
type(pd.to_datetime(floatsDFAllRate_3Dtimeorder.time))
ts = pd.Series(0, index=pd.to_datetime(floatsDFAllRate_3Dtimeorder.time) ) # creat a target time series for masking purpose

# take the month out
month = ts.index.month 
# month.shape # a check on the shape of the month.
selector = ((11==month) | (12==month) | (1==month) | (2==month) | (3==month) )  
selector
print('shape of the selector', selector.shape)

print('all the data count in [11-01, 03-31]  is', floatsDFAllRate_3Dtimeorder[selector].chl_rate.dropna().shape) # total  (739,)
print('all the data count is', floatsDFAllRate_3Dtimeorder.chl_rate.dropna().shape )   # total (1018,)


shape of the selector (441336,)
all the data count in [11-01, 03-31]  is (739,)
all the data count is (1018,)

In [76]:
# histogram for non standarized data
axfloat = floatsDFAllRate_3Dtimeorder[selector].chl_rate.dropna().hist(bins=100,range=[-0.3,0.3])
axfloat.set_title('3-Day chl_rate')


Out[76]:
<matplotlib.text.Text at 0x1335064a8>

In [77]:
# standarized series
ts = floatsDFAllRate_3Dtimeorder[selector].chl_rate.dropna()
ts_standardized = (ts - ts.mean())/ts.std()
axts = ts_standardized.hist(bins=100,range=[-0.3,0.3])
axts.set_title('3-Day standardized chl_rate')


Out[77]:
<matplotlib.text.Text at 0x123682be0>

In [78]:
# all the data
fig, axes = plt.subplots(nrows=8, ncols=2, figsize=(12, 10))
fig.subplots_adjust(hspace=0.05, wspace=0.05)

for i, ax in zip(range(2002,2017), axes.flat) :
    tmpyear = floatsDFAllRate_3Dtimeorder[ (floatsDFAllRate_3Dtimeorder.time > str(i))  & (floatsDFAllRate_3Dtimeorder.time < str(i+1)) ] # if year i
    #fig, ax  = plt.subplots(figsize=(12,10))
    print(tmpyear.chl_rate.dropna().shape)   # total is 1016
    tmpyear.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r',vmin=-0.6, vmax=0.6, edgecolor='none', ax=ax)
    ax.set_title('year %g' % i)     
    
# remove the extra figure
ax = plt.subplot(8,2,16)
fig.delaxes(ax)


(47,)
(56,)
(3,)
(39,)
(92,)
(75,)
(123,)
(44,)
(50,)
(18,)
(38,)
(46,)
(227,)
(118,)
(40,)

In [79]:
fig, axes = plt.subplots(nrows=7, ncols=2, figsize=(12, 10))
fig.subplots_adjust(hspace=0.05, wspace=0.05)

for i, ax in zip(range(2002,2016), axes.flat) :
    tmpyear = floatsDFAllRate_3Dtimeorder[ (floatsDFAllRate_3Dtimeorder.time >= (str(i)+ '-11-01') )  & (floatsDFAllRate_3Dtimeorder.time <= (str(i+1)+'-03-31') ) ] # if year i
    # select only particular month, Nov 1 to March 31
    #fig, ax  = plt.subplots(figsize=(12,10))
    print(tmpyear.chl_rate.dropna().shape)  # the total is 739
    tmpyear.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r', vmin=-0.6, vmax=0.6, edgecolor='none', ax=ax)
    ax.set_title('year %g' % i)


(76,)
(0,)
(5,)
(65,)
(38,)
(108,)
(35,)
(44,)
(3,)
(36,)
(0,)
(160,)
(119,)
(50,)

In [ ]:


In [ ]:


In [81]:
# let's output the data as a csv or hdf file to disk to save the experiment time

df_list = []
for i in range(2002,2017) :
    tmpyear = floatsDFAllRate_3Dtimeorder[ (floatsDFAllRate_3Dtimeorder.time >= (str(i)+ '-11-01') )  & (floatsDFAllRate_3Dtimeorder.time <= (str(i+1)+'-03-31') ) ] # if year i
    # select only particular month, Nov 1 to March 31
    df_list.append(tmpyear)
    
df_tmp = pd.concat(df_list)
print('all the data count in [11-01, 03-31]  is ', df_tmp.chl_rate.dropna().shape) # again, the total is (739,)
df_chl_out_3D_modisa = df_tmp[~df_tmp.chl_rate.isnull()] # only keep the non-nan values
#list(df_chl_out_XD.groupby(['id']))   # can see the continuity pattern of the Lagarangian difference for each float id

# output to a csv or hdf file
df_chl_out_3D_modisa.head()


all the data count in [11-01, 03-31]  is  (739,)
Out[81]:
id time var_tmp lon vn lat var_lon temp spd var_lat ve chlor_a chlor_a_log10 chl_rate chl_rate_log10
10620 10206 2002-11-04 1000.000000 67.315250 6.904000 10.885583 0.001747 NaN 11.224333 0.000579 -6.069667 0.145567 -0.836937 0.017202 -1.764421
10648 34721 2002-11-04 0.001778 67.626250 -0.428083 12.628833 0.000122 29.590750 13.099250 0.000064 6.291000 0.129693 -0.887083 -0.024359 NaN
10879 10206 2002-11-07 1000.000000 67.174083 6.697417 11.064250 0.000558 NaN 10.497583 0.000221 -5.759333 0.129001 -0.889407 -0.016566 NaN
10881 11089 2002-11-07 0.003795 64.770000 1.865000 14.365167 0.000151 28.995083 16.718083 0.000075 -15.957833 0.192121 -0.716425 0.033696 -1.472422
10883 15707 2002-11-07 1000.000000 67.346250 -24.346083 13.640333 0.000132 NaN 29.831500 0.000068 -15.104667 0.158005 -0.801329 -0.008466 NaN

In [82]:
df_chl_out_3D_modisa.index.name = 'index'  # make it specific for the index name

# CSV CSV CSV CSV with specfic index
df_chl_out_3D_modisa.to_csv('df_chl_out_3D_modisa.csv', sep=',', index_label = 'index')

# load CSV output
test = pd.read_csv('df_chl_out_3D_modisa.csv', index_col='index')
test.head()


Out[82]:
id time var_tmp lon vn lat var_lon temp spd var_lat ve chlor_a chlor_a_log10 chl_rate chl_rate_log10
index
10620 10206 2002-11-04 1000.000000 67.315250 6.904000 10.885583 0.001747 NaN 11.224333 0.000579 -6.069667 0.145567 -0.836937 0.017202 -1.764421
10648 34721 2002-11-04 0.001778 67.626250 -0.428083 12.628833 0.000122 29.590750 13.099250 0.000064 6.291000 0.129693 -0.887083 -0.024359 NaN
10879 10206 2002-11-07 1000.000000 67.174083 6.697417 11.064250 0.000558 NaN 10.497583 0.000221 -5.759333 0.129001 -0.889407 -0.016566 NaN
10881 11089 2002-11-07 0.003795 64.770000 1.865000 14.365167 0.000151 28.995083 16.718083 0.000075 -15.957833 0.192121 -0.716425 0.033696 -1.472422
10883 15707 2002-11-07 1000.000000 67.346250 -24.346083 13.640333 0.000132 NaN 29.831500 0.000068 -15.104667 0.158005 -0.801329 -0.008466 NaN

In [ ]: