15Day subsampling on the OceanColor Dataset


In [1]:
import xarray as xr
import numpy as np
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt
from dask.diagnostics import ProgressBar
import seaborn as sns
from matplotlib.colors import LogNorm


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

Load data from disk

We already downloaded a subsetted MODIS-Aqua chlorophyll-a dataset for the Arabian Sea.

We can read all the netcdf files into one xarray Dataset using the open_mfsdataset function. Note that this does not load the data into memory yet. That only happens when we try to access the values.


In [2]:
ds_8day = xr.open_mfdataset('./data_collector_modisa_chla9km/ModisA_Arabian_Sea_chlor_a_9km_*_8D.nc')
ds_daily = xr.open_mfdataset('./data_collector_modisa_chla9km/ModisA_Arabian_Sea_chlor_a_9km_*_D.nc')
both_datasets = [ds_8day, ds_daily]

How much data is contained here? Let's get the answer in MB.


In [3]:
print([(ds.nbytes / 1e6) for ds in both_datasets])


[534.295504, 4241.4716]

The 8-day dataset is ~534 MB while the daily dataset is 4.2 GB. These both easily fit in RAM. So let's load them all into memory


In [4]:
[ds.load() for ds in both_datasets]


Out[4]:
[<xarray.Dataset>
 Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 667)
 Coordinates:
   * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
   * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
   * rgb            (rgb) int64 0 1 2
   * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
   * time           (time) datetime64[ns] 2002-07-04 2002-07-12 2002-07-20 ...
 Data variables:
     palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
     chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...,
 <xarray.Dataset>
 Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 5295)
 Coordinates:
   * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
   * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
   * rgb            (rgb) int64 0 1 2
   * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
   * time           (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...
 Data variables:
     palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
     chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...]

Fix bad data

In preparing this demo, I noticed that small number of maps had bad data--specifically, they contained large negative values of chlorophyll concentration. Looking closer, I realized that the land/cloud mask had been inverted. So I wrote a function to invert it back and correct the data.


In [5]:
def fix_bad_data(ds):
    # for some reason, the cloud / land mask is backwards on some data
    # this is obvious because there are chlorophyl values less than zero
    bad_data = ds.chlor_a.groupby('time').min() < 0
    # loop through and fix
    for n in np.nonzero(bad_data.values)[0]:
        data = ds.chlor_a[n].values 
        ds.chlor_a.values[n] = np.ma.masked_less(data, 0).filled(np.nan)

In [15]:
[fix_bad_data(ds) for ds in both_datasets]


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in less
  if not reflexive
Out[15]:
[None, None]

In [16]:
ds_8day.chlor_a>0


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[16]:
<xarray.DataArray 'chlor_a' (time: 667, lat: 276, lon: 360)>
array([[[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ...,  True, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False,  True],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False,  True,  True]],

       ..., 
       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]],

       [[False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        ..., 
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False],
        [False, False, False, ..., False, False, False]]], dtype=bool)
Coordinates:
  * lat      (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 27.37 ...
  * lon      (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 45.63 ...
  * time     (time) datetime64[ns] 2002-07-04 2002-07-12 2002-07-20 ...

Count the number of ocean data points

First we have to figure out the land mask. Unfortunately it doesn't come with the dataset. But we can infer it by counting all the points that have at least one non-nan chlorophyll value.


In [17]:
(ds_8day.chlor_a>0).sum(dim='time').plot()


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[17]:
<matplotlib.collections.QuadMesh at 0x118cca358>

In [18]:
#  find a mask for the land
ocean_mask = (ds_8day.chlor_a>0).sum(dim='time')>0
#ocean_mask = (ds_daily.chlor_a>0).sum(dim='time')>0
num_ocean_points = ocean_mask.sum().values  # compute the total nonzeros regions(data point)
ocean_mask.plot()
plt.title('%g total ocean points' % num_ocean_points)


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/xarray/core/variable.py:1046: RuntimeWarning: invalid value encountered in greater
  if not reflexive
Out[18]:
<matplotlib.text.Text at 0x13d359dd8>

In [19]:
#ds_8day

In [20]:
#ds_daily

In [21]:
plt.figure(figsize=(8,6))
ds_daily.chlor_a.sel(time='2002-11-18',method='nearest').plot(norm=LogNorm())
#ds_daily.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[21]:
<matplotlib.collections.QuadMesh at 0x11b0d01d0>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [22]:
#list(ds_daily.groupby('time')) # take a look at what's inside

Now we count up the number of valid points in each snapshot and divide by the total number of ocean points.


In [23]:
'''
<xarray.Dataset>
Dimensions:        (eightbitcolor: 256, lat: 144, lon: 276, rgb: 3, time: 4748)
'''
ds_daily.groupby('time').count() # information from original data


Out[23]:
<xarray.Dataset>
Dimensions:  (time: 5295)
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...
Data variables:
    palette  (time) int64 768 768 768 768 768 768 768 768 768 768 768 768 ...
    chlor_a  (time) int64 658 1170 1532 2798 2632 1100 1321 636 2711 1163 ...

In [ ]:


In [24]:
ds_daily.chlor_a.groupby('time').count()/float(num_ocean_points)


Out[24]:
<xarray.DataArray 'chlor_a' (time: 5295)>
array([ 0.01053255,  0.01872809,  0.02452259, ...,  0.        ,
        0.        ,  0.        ])
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-05 2002-07-06 ...

In [25]:
count_8day,count_daily = [ds.chlor_a.groupby('time').count()/float(num_ocean_points)
                            for ds in (ds_8day,ds_daily)]

In [26]:
#count_8day = ds_8day.chl_ocx.groupby('time').count()/float(num_ocean_points)
#coundt_daily = ds_daily.chl_ocx.groupby('time').count()/float(num_ocean_points)

#count_8day, coundt_daily = [ds.chl_ocx.groupby('time').count()/float(num_ocean_points)
#                            for ds in ds_8day, ds_daily] # not work in python 3

In [27]:
plt.figure(figsize=(12,4))
count_8day.plot(color='k')
count_daily.plot(color='r')

plt.legend(['8 day','daily'])


Out[27]:
<matplotlib.legend.Legend at 0x1298b70b8>

Seasonal Climatology


In [28]:
count_8day_clim, coundt_daily_clim = [count.groupby('time.month').mean()  # monthly data
                                      for count in (count_8day, count_daily)]

In [29]:
print()




In [30]:
# mean value of the monthly data on the count of nonzeros
plt.figure(figsize=(12,4))
count_8day_clim.plot(color='k')
coundt_daily_clim.plot(color='r')
plt.legend(['8 day', 'daily'])


Out[30]:
<matplotlib.legend.Legend at 0x129a89f28>

From the above figure, we see that data coverage is highest in the winter (especially Feburary) and lowest in summer.

Maps of individual days

Let's grab some data from Febrauary and plot it.


In [31]:
target_date = '2003-02-15'
plt.figure(figsize=(8,6))
ds_8day.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[31]:
<matplotlib.collections.QuadMesh at 0x129a8b4e0>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [32]:
plt.figure(figsize=(8,6))
ds_daily.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[32]:
<matplotlib.collections.QuadMesh at 0x12b12fc50>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [33]:
ds_daily.chlor_a[0].sel_points(lon=[65, 70], lat=[16, 18], method='nearest')   # the time is selected!
#ds_daily.chl_ocx[0].sel_points(time= times, lon=lons, lat=times, method='nearest')


Out[33]:
<xarray.DataArray 'chlor_a' (points: 2)>
array([ nan,  nan])
Coordinates:
    time     datetime64[ns] 2002-07-04
    lon      (points) float64 65.04 70.04
    lat      (points) float64 16.04 18.04
  * points   (points) int64 0 1

In [34]:
#ds_daily.chlor_a.sel_points?

In [35]:
ds_15day = ds_daily.resample('15D', dim='time')
ds_15day


Out[35]:
<xarray.Dataset>
Dimensions:        (eightbitcolor: 256, lat: 276, lon: 360, rgb: 3, time: 353)
Coordinates:
  * lat            (lat) float64 27.96 27.87 27.79 27.71 27.62 27.54 27.46 ...
  * lon            (lon) float64 45.04 45.13 45.21 45.29 45.38 45.46 45.54 ...
  * rgb            (rgb) int64 0 1 2
  * eightbitcolor  (eightbitcolor) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
  * time           (time) datetime64[ns] 2002-07-04 2002-07-19 2002-08-03 ...
Data variables:
    palette        (time, rgb, eightbitcolor) float64 -109.0 0.0 108.0 ...
    chlor_a        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...

In [48]:
plt.figure(figsize=(8,6))
ds_15day.chlor_a.sel(time=target_date, method='nearest').plot(norm=LogNorm())


Out[48]:
<matplotlib.collections.QuadMesh at 0x11b6ae160>
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/matplotlib/colors.py:1022: RuntimeWarning: invalid value encountered in less_equal
  mask |= resdat <= 0

In [47]:
# check the range for the longitude
print(ds_15day.lon.min(),'\n' ,ds_15day.lat.min())


<xarray.DataArray 'lon' ()>
array(45.04166793823242) 
 <xarray.DataArray 'lat' ()>
array(5.041661739349365)

++++++++++++++++++++++++++++++++++++++++++++++

All GDP Floats

Load the float data

Map a (time, lon, lat) to a value on the cholorphlly value


In [38]:
# in the following we deal with the data from the gdp float
from buyodata import buoydata
import os

In [39]:
# a list of files
fnamesAll = ['./gdp_float/buoydata_1_5000.dat','./gdp_float/buoydata_5001_10000.dat','./gdp_float/buoydata_10001_15000.dat','./gdp_float/buoydata_15001_jun16.dat']

In [40]:
# read them and cancatenate them into one DataFrame
dfAll = pd.concat([buoydata.read_buoy_data(f) for f in fnamesAll])  # around 4~5 minutes

#mask = df.time>='2002-07-04' # we only have data after this data for chlor_a
dfvvAll = dfAll[dfAll.time>='2002-07-04']

sum(dfvvAll.time<'2002-07-04') # recheck whether the time is


Out[40]:
0

In [41]:
# process the data so that the longitude are all >0
print('before processing, the minimum longitude is%f4.3 and maximum is %f4.3' % (dfvvAll.lon.min(), dfvvAll.lon.max()))
mask = dfvvAll.lon<0
dfvvAll.lon[mask] = dfvvAll.loc[mask].lon + 360
print('after processing, the minimum longitude is %f4.3 and maximum is %f4.3' % (dfvvAll.lon.min(),dfvvAll.lon.max()) )

dfvvAll.describe()


before processing, the minimum longitude is0.0000004.3 and maximum is 360.0000004.3
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/core/generic.py:4695: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)
after processing, the minimum longitude is 0.0000004.3 and maximum is 360.0000004.3
Out[41]:
id lat lon temp ve vn spd var_lat var_lon var_tmp
count 2.147732e+07 2.131997e+07 2.131997e+07 1.986179e+07 2.129142e+07 2.129142e+07 2.129142e+07 2.147732e+07 2.147732e+07 2.147732e+07
mean 1.765662e+06 -2.263128e+00 2.124412e+02 1.986121e+01 2.454172e-01 4.708192e-01 2.613427e+01 7.326258e+00 7.326555e+00 7.522298e+01
std 9.452835e+06 3.401115e+01 9.746941e+01 8.339498e+00 2.525050e+01 2.052160e+01 1.939087e+01 8.527853e+01 8.527851e+01 2.637454e+02
min 2.578000e+03 -7.764700e+01 0.000000e+00 -1.685000e+01 -2.916220e+02 -2.601400e+02 0.000000e+00 5.268300e-07 -3.941600e-02 1.001300e-03
25% 4.897500e+04 -3.186000e+01 1.490720e+02 1.437300e+01 -1.411400e+01 -1.044700e+01 1.290300e+01 4.366500e-06 7.512600e-06 1.435700e-03
50% 7.141300e+04 -4.920000e+00 2.153940e+02 2.214400e+01 -5.560000e-01 1.970000e-01 2.176700e+01 8.833600e-06 1.495800e-05 1.691700e-03
75% 1.094330e+05 2.756000e+01 3.064370e+02 2.688900e+01 1.356100e+01 1.109300e+01 3.405900e+01 1.833300e-05 3.627900e-05 2.294200e-03
max 6.399288e+07 8.989900e+01 3.600000e+02 4.595000e+01 4.417070e+02 2.783220e+02 4.421750e+02 1.000000e+03 1.000000e+03 1.000000e+03

In [42]:
# Select only the arabian sea region
arabian_sea = (dfvvAll.lon > 45) & (dfvvAll.lon< 75) & (dfvvAll.lat> 5) & (dfvvAll.lat <28)
# arabian_sea = {'lon': slice(45,75), 'lat': slice(5,28)} # later use this longitude and latitude
floatsAll = dfvvAll.loc[arabian_sea]   # directly use mask
print('dfvvAll.shape is %s, floatsAll.shape is %s' % (dfvvAll.shape, floatsAll.shape) )


dfvvAll.shape is (21477317, 11), floatsAll.shape is (111894, 11)

In [43]:
# avoid run this line repeatedly
# visualize the float around global region
fig, ax  = plt.subplots(figsize=(12,10))
dfvvAll.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsAll.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)


Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x2450c8390>

In [44]:
# pands dataframe cannot do the resamplingn properly
# cause we are really indexing on ['time','id'], pandas.dataframe.resample cannot do this
# TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
print()




In [45]:
# dump the surface floater data from pandas.dataframe to xarray.dataset
floatsDSAll = xr.Dataset.from_dataframe(floatsAll.set_index(['time','id']) ) # set time & id as the index); use reset_index to revert this operation
floatsDSAll


Out[45]:
<xarray.Dataset>
Dimensions:  (id: 259, time: 17499)
Coordinates:
  * time     (time) datetime64[ns] 2002-07-04 2002-07-04T06:00:00 ...
  * id       (id) int64 7574 10206 10208 11089 15703 15707 27069 27139 28842 ...
Data variables:
    lat      (time, id) float64 nan 16.3 14.03 16.4 14.04 nan 20.11 nan ...
    lon      (time, id) float64 nan 66.23 69.48 64.58 69.51 nan 68.55 nan ...
    temp     (time, id) float64 nan nan nan 28.0 28.53 nan 28.93 nan 27.81 ...
    ve       (time, id) float64 nan 8.68 5.978 6.286 4.844 nan 32.9 nan ...
    vn       (time, id) float64 nan -13.18 -18.05 -7.791 -17.47 nan 15.81 ...
    spd      (time, id) float64 nan 15.78 19.02 10.01 18.13 nan 36.51 nan ...
    var_lat  (time, id) float64 nan 0.0002661 5.01e-05 5.018e-05 5.024e-05 ...
    var_lon  (time, id) float64 nan 0.0006854 8.851e-05 9.018e-05 8.968e-05 ...
    var_tmp  (time, id) float64 nan 1e+03 1e+03 0.003733 0.0667 nan 0.001683 ...

In [46]:
# resample on the xarray.dataset onto 15-day frequency
floatsDSAll_15D =floatsDSAll.resample('15D', dim='time')
floatsDSAll_15D


Out[46]:
<xarray.Dataset>
Dimensions:  (id: 259, time: 341)
Coordinates:
  * id       (id) int64 7574 10206 10208 11089 15703 15707 27069 27139 28842 ...
  * time     (time) datetime64[ns] 2002-07-04 2002-07-19 2002-08-03 ...
Data variables:
    lon      (time, id) float64 nan 66.66 70.2 65.25 70.17 nan 70.05 nan ...
    vn       (time, id) float64 nan 0.3577 -5.286 -14.1 -4.513 nan -5.424 ...
    spd      (time, id) float64 nan 7.395 15.01 18.7 13.97 nan 27.87 nan ...
    var_lon  (time, id) float64 nan 0.00561 0.0001185 0.0001289 0.000102 nan ...
    var_tmp  (time, id) float64 nan 1e+03 1e+03 0.003614 0.08862 nan ...
    ve       (time, id) float64 nan 6.14 11.37 9.377 10.19 nan 25.86 nan ...
    temp     (time, id) float64 nan nan nan 27.77 28.59 nan 28.92 nan 27.23 ...
    var_lat  (time, id) float64 nan 0.001451 6.257e-05 6.695e-05 5.535e-05 ...
    lat      (time, id) float64 nan 16.27 13.55 15.66 13.61 nan 19.97 nan ...

In [49]:
# transfer it back to pandas.dataframe for plotting
floatsDFAll_15D = floatsDSAll_15D.to_dataframe()
floatsDFAll_15D
floatsDFAll_15D = floatsDFAll_15D.reset_index()
floatsDFAll_15D
# visualize the subsamping of floats around arabian region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_15D.plot(kind='scatter', x='lon', y='lat', c='temp', cmap='RdBu_r', edgecolor='none', ax=ax)


Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b68eb38>

In [50]:
# get the value for the chllorophy for each data entry
floatsDFAll_15Dtimeorder = floatsDFAll_15D.sort_values(['time','id'],ascending=True)
floatsDFAll_15Dtimeorder # check whether it is time ordered!!
# should we drop nan to speed up??


Out[50]:
id time lon vn spd var_lon var_tmp ve temp var_lat lat
0 7574 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
341 10206 2002-07-04 66.663800 0.357733 7.394967 0.005610 1000.000000 6.140233 NaN 0.001451 16.265717
682 10208 2002-07-04 70.195217 -5.285617 15.006967 0.000118 1000.000000 11.373300 NaN 0.000063 13.549633
1023 11089 2002-07-04 65.248067 -14.097033 18.695917 0.000129 0.003614 9.376883 27.773283 0.000067 15.657150
1364 15703 2002-07-04 70.165200 -4.513033 13.965250 0.000102 0.088623 10.194983 28.590333 0.000055 13.611350
1705 15707 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2046 27069 2002-07-04 70.048350 -5.424417 27.865400 0.000106 0.001731 25.855350 28.916267 0.000057 19.969700
2387 27139 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2728 28842 2002-07-04 60.961600 -9.921900 16.832533 0.000362 0.003382 5.825783 27.226833 0.000149 18.350883
3069 34159 2002-07-04 60.516650 16.559017 36.755683 0.000116 1000.000000 31.603317 NaN 0.000061 13.394633
3410 34173 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3751 34210 2002-07-04 56.749953 -18.675465 26.752744 0.000129 0.003705 -5.144814 26.354721 0.000066 5.882953
4092 34211 2002-07-04 69.070367 -14.960467 27.234933 0.000098 0.003538 19.858683 28.430017 0.000053 7.797533
4433 34212 2002-07-04 66.877317 1.993683 42.610483 0.000102 0.003553 34.703000 28.568833 0.000055 6.519433
4774 34223 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
5115 34310 2002-07-04 70.029000 -0.021000 10.808286 0.000103 0.003749 8.074714 28.954857 0.000056 5.023286
5456 34311 2002-07-04 69.980455 -11.504476 13.484762 0.000114 0.003594 -2.896714 28.593818 0.000061 9.730864
5797 34312 2002-07-04 65.167048 -15.195200 17.403650 0.000154 0.003670 7.005500 28.129857 0.000075 9.638095
6138 34314 2002-07-04 54.903200 -5.381333 16.793444 0.000087 0.003738 -6.718778 26.905500 0.000047 5.116600
6479 34315 2002-07-04 59.998824 5.554625 28.943938 0.000094 0.003575 -17.022625 28.303294 0.000052 5.162294
6820 34374 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN
7161 34708 2002-07-04 61.099483 4.269317 24.302933 0.000110 0.001807 22.553517 27.384050 0.000058 10.333167
7502 34709 2002-07-04 52.252040 46.740240 81.157560 0.000162 0.001818 0.462760 26.204160 0.000079 6.396520
7843 34710 2002-07-04 49.966633 -6.248550 46.052550 0.000087 0.001837 -15.991800 31.137717 0.000046 12.618183
8184 34714 2002-07-04 65.507833 1.335217 35.355967 0.000110 0.001815 34.244133 27.870250 0.000059 13.767567
8525 34716 2002-07-04 66.931733 2.107750 34.797783 0.000106 0.001772 25.851967 28.781367 0.000057 7.695783
8866 34718 2002-07-04 73.395800 -39.532167 44.367833 0.000102 0.001691 16.290550 28.827417 0.000055 14.420783
9207 34719 2002-07-04 71.694917 -20.354250 26.786467 0.000112 0.001666 14.472783 28.910950 0.000060 16.656500
9548 34720 2002-07-04 69.771467 -12.078667 20.477317 0.000114 0.001750 11.449783 28.658633 0.000061 13.938883
9889 34721 2002-07-04 65.645233 -11.221400 13.610417 0.000109 0.001740 5.294917 27.876133 0.000058 16.653933
... ... ... ... ... ... ... ... ... ... ... ...
78429 3098682 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
78770 60073460 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
79111 60074440 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
79452 60077450 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
79793 60150420 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
80134 60454500 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
80475 60656200 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
80816 60657200 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
81157 60658190 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
81498 60659110 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
81839 60659120 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
82180 60659190 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
82521 60659200 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
82862 60940960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
83203 60940970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
83544 60941960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
83885 60941970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
84226 60942960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
84567 60942970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
84908 60943960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
85249 60943970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
85590 60944960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
85931 60944970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
86272 60945970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
86613 60946960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
86954 60947960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
87295 60947970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
87636 60948960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
87977 60950430 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN
88318 62321420 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN

88319 rows × 11 columns


In [51]:
floatsDFAll_15Dtimeorder.lon.dropna().shape  # the longitude data has lots of values (2195,)


Out[51]:
(2195,)

In [52]:
# a little test for the api in loops for the dataframe   
# check df.itertuples? it is faster and preserves the data format
'''
chl_ocx=[]
for row in floats_timeorder.itertuples():
    #print(row)
    #print('row.time = %s, row.id=%d, row.lon=%4.3f, row.lat=%4.3f' % (row.time,row.id,row.lon,row.lat)  )
    tmp=ds_2day.chl_ocx.sel_points(time=[row.time],lon=[row.lon], lat=[row.lat], method='nearest') # interpolation
    chl_ocx.append(tmp)
floats_timeorder['chl_ocx'] = pd.Series(chl_ocx, index=floats_timeorder.index)
chl_ocx[0].to_series
'''


Out[52]:
"\nchl_ocx=[]\nfor row in floats_timeorder.itertuples():\n    #print(row)\n    #print('row.time = %s, row.id=%d, row.lon=%4.3f, row.lat=%4.3f' % (row.time,row.id,row.lon,row.lat)  )\n    tmp=ds_2day.chl_ocx.sel_points(time=[row.time],lon=[row.lon], lat=[row.lat], method='nearest') # interpolation\n    chl_ocx.append(tmp)\nfloats_timeorder['chl_ocx'] = pd.Series(chl_ocx, index=floats_timeorder.index)\nchl_ocx[0].to_series\n"

In [53]:
# this one line avoid the list above
# it took a really long time for 2D interpolation, it takes an hour
tmpAll = ds_15day.chlor_a.sel_points(time=list(floatsDFAll_15Dtimeorder.time),lon=list(floatsDFAll_15Dtimeorder.lon), lat=list(floatsDFAll_15Dtimeorder.lat), method='nearest')
print('the count of nan vaues in tmpAll is',tmpAll.to_series().isnull().sum())


/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/indexes/base.py:2352: RuntimeWarning: invalid value encountered in less
  indexer = np.where(op(left_distances, right_distances) |
/Users/vyan2000/local/miniconda3/envs/condapython3/lib/python3.5/site-packages/pandas/indexes/base.py:2352: RuntimeWarning: invalid value encountered in less_equal
  indexer = np.where(op(left_distances, right_distances) |
the count of nan vaues in tmpAll is 87023

New Interpolation routine here

  • interpolation for one point
  • interpolation for a list of points

In [ ]:


In [ ]:


In [ ]:


In [54]:
#print(tmpAll.dropna().shape)
tmpAll.to_series().dropna().shape  # (1296,) good values


Out[54]:
(1296,)

In [56]:
# tmp.to_series() to transfer it from xarray dataset to series
floatsDFAll_15Dtimeorder['chlor_a'] = pd.Series(np.array(tmpAll.to_series()), index=floatsDFAll_15Dtimeorder.index)
print("after editing the dataframe the nan values in 'chlor_a' is", floatsDFAll_15Dtimeorder.chlor_a.isnull().sum() )  # they should be the same values as above

# take a look at the data
floatsDFAll_15Dtimeorder

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_15Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chlor_a', cmap='RdBu_r', edgecolor='none', ax=ax)

def scale(x):
    logged = np.log10(x)
    return logged

#print(floatsAll_timeorder['chlor_a'].apply(scale))
floatsDFAll_15Dtimeorder['chlor_a_log10'] = floatsDFAll_15Dtimeorder['chlor_a'].apply(scale)
floatsDFAll_15Dtimeorder
#print("after the transformation the nan values in 'chlor_a_log10' is", floatsAll_timeorder.chlor_a_log10.isnull().sum() )

# visualize the float around the arabian sea region
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAll_15Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chlor_a_log10', cmap='RdBu_r', edgecolor='none', ax=ax)
#floatsDFAll_15Dtimeorder.chlor_a.dropna().shape  # (1296,) 
floatsDFAll_15Dtimeorder.chlor_a_log10.dropna().shape  # (1296,)


after editing the dataframe the nan values in 'chlor_a' is 87023
Out[56]:
(1296,)

In [57]:
# take the diff of the chlor_a, and this has to be done in xarray
# transfer the dataframe into xarry dataset again
# take the difference
floatsDFAll_15Dtimeorder


Out[57]:
id time lon vn spd var_lon var_tmp ve temp var_lat lat chlor_a chlor_a_log10
0 7574 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
341 10206 2002-07-04 66.663800 0.357733 7.394967 0.005610 1000.000000 6.140233 NaN 0.001451 16.265717 NaN NaN
682 10208 2002-07-04 70.195217 -5.285617 15.006967 0.000118 1000.000000 11.373300 NaN 0.000063 13.549633 NaN NaN
1023 11089 2002-07-04 65.248067 -14.097033 18.695917 0.000129 0.003614 9.376883 27.773283 0.000067 15.657150 NaN NaN
1364 15703 2002-07-04 70.165200 -4.513033 13.965250 0.000102 0.088623 10.194983 28.590333 0.000055 13.611350 NaN NaN
1705 15707 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2046 27069 2002-07-04 70.048350 -5.424417 27.865400 0.000106 0.001731 25.855350 28.916267 0.000057 19.969700 NaN NaN
2387 27139 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2728 28842 2002-07-04 60.961600 -9.921900 16.832533 0.000362 0.003382 5.825783 27.226833 0.000149 18.350883 NaN NaN
3069 34159 2002-07-04 60.516650 16.559017 36.755683 0.000116 1000.000000 31.603317 NaN 0.000061 13.394633 NaN NaN
3410 34173 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3751 34210 2002-07-04 56.749953 -18.675465 26.752744 0.000129 0.003705 -5.144814 26.354721 0.000066 5.882953 0.367018 -0.435313
4092 34211 2002-07-04 69.070367 -14.960467 27.234933 0.000098 0.003538 19.858683 28.430017 0.000053 7.797533 NaN NaN
4433 34212 2002-07-04 66.877317 1.993683 42.610483 0.000102 0.003553 34.703000 28.568833 0.000055 6.519433 NaN NaN
4774 34223 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5115 34310 2002-07-04 70.029000 -0.021000 10.808286 0.000103 0.003749 8.074714 28.954857 0.000056 5.023286 NaN NaN
5456 34311 2002-07-04 69.980455 -11.504476 13.484762 0.000114 0.003594 -2.896714 28.593818 0.000061 9.730864 NaN NaN
5797 34312 2002-07-04 65.167048 -15.195200 17.403650 0.000154 0.003670 7.005500 28.129857 0.000075 9.638095 NaN NaN
6138 34314 2002-07-04 54.903200 -5.381333 16.793444 0.000087 0.003738 -6.718778 26.905500 0.000047 5.116600 0.305509 -0.514976
6479 34315 2002-07-04 59.998824 5.554625 28.943938 0.000094 0.003575 -17.022625 28.303294 0.000052 5.162294 0.170065 -0.769384
6820 34374 2002-07-04 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7161 34708 2002-07-04 61.099483 4.269317 24.302933 0.000110 0.001807 22.553517 27.384050 0.000058 10.333167 NaN NaN
7502 34709 2002-07-04 52.252040 46.740240 81.157560 0.000162 0.001818 0.462760 26.204160 0.000079 6.396520 0.951922 -0.021399
7843 34710 2002-07-04 49.966633 -6.248550 46.052550 0.000087 0.001837 -15.991800 31.137717 0.000046 12.618183 NaN NaN
8184 34714 2002-07-04 65.507833 1.335217 35.355967 0.000110 0.001815 34.244133 27.870250 0.000059 13.767567 NaN NaN
8525 34716 2002-07-04 66.931733 2.107750 34.797783 0.000106 0.001772 25.851967 28.781367 0.000057 7.695783 0.156574 -0.805280
8866 34718 2002-07-04 73.395800 -39.532167 44.367833 0.000102 0.001691 16.290550 28.827417 0.000055 14.420783 NaN NaN
9207 34719 2002-07-04 71.694917 -20.354250 26.786467 0.000112 0.001666 14.472783 28.910950 0.000060 16.656500 NaN NaN
9548 34720 2002-07-04 69.771467 -12.078667 20.477317 0.000114 0.001750 11.449783 28.658633 0.000061 13.938883 NaN NaN
9889 34721 2002-07-04 65.645233 -11.221400 13.610417 0.000109 0.001740 5.294917 27.876133 0.000058 16.653933 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
78429 3098682 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
78770 60073460 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
79111 60074440 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
79452 60077450 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
79793 60150420 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
80134 60454500 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
80475 60656200 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
80816 60657200 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
81157 60658190 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
81498 60659110 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
81839 60659120 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
82180 60659190 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
82521 60659200 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
82862 60940960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
83203 60940970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
83544 60941960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
83885 60941970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
84226 60942960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
84567 60942970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
84908 60943960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85249 60943970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85590 60944960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
85931 60944970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
86272 60945970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
86613 60946960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
86954 60947960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
87295 60947970 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
87636 60948960 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
87977 60950430 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88318 62321420 2016-06-20 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

88319 rows × 13 columns


In [58]:
# unstack() will provide a 2d dataframe
# reset_index() will reset all the index as columns

In [61]:
# prepare the data in dataset and about to take the diff
tmp = xr.Dataset.from_dataframe(floatsDFAll_15Dtimeorder.set_index(['time','id']) ) # set time & id as the index); use reset_index to revert this operation
# take the diff on the chlor_a
chlor_a_rate = tmp.diff(dim='time',n=1).chlor_a.to_series().reset_index()
# make the column to a proper name
chlor_a_rate.rename(columns={'chlor_a':'chl_rate'}, inplace='True')
chlor_a_rate


# merge the two dataframes {floatsDFAll_XDtimeorder; chlor_a_rate} into one dataframe based on the index {id, time} and use the left method
floatsDFAllRate_15Dtimeorder=pd.merge(floatsDFAll_15Dtimeorder,chlor_a_rate, on=['time','id'], how = 'left')
floatsDFAllRate_15Dtimeorder

# check 
print('check the sum of the chlor_a before the merge', chlor_a_rate.chl_rate.sum())
print('check the sum of the chlor_a after the merge',floatsDFAllRate_15Dtimeorder.chl_rate.sum())


# visualize the chlorophyll rate, it is *better* to visualize at this scale
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAllRate_15Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r', vmin=-0.8, vmax=0.8, edgecolor='none', ax=ax)

# visualize the chlorophyll rate on the log scale
floatsDFAllRate_15Dtimeorder['chl_rate_log10'] = floatsDFAllRate_15Dtimeorder['chl_rate'].apply(scale)
floatsDFAllRate_15Dtimeorder
fig, ax  = plt.subplots(figsize=(12,10))
floatsDFAllRate_15Dtimeorder.plot(kind='scatter', x='lon', y='lat', c='chl_rate_log10', cmap='RdBu_r', edgecolor='none', ax=ax)
floatsDFAllRate_15Dtimeorder.chl_rate.dropna().shape   # (853,) data points
#floatsDFAllRate_15Dtimeorder.chl_rate_log10.dropna().shape   # (345,) data points..... notice, chl_rate can be negative, so do not take log10


check the sum of the chlor_a before the merge -79.98311130130112
check the sum of the chlor_a after the merge -79.98311130130112
Out[61]:
(853,)

In [62]:
pd.to_datetime(floatsDFAllRate_15Dtimeorder.time)
type(pd.to_datetime(floatsDFAllRate_15Dtimeorder.time))
ts = pd.Series(0, index=pd.to_datetime(floatsDFAllRate_15Dtimeorder.time) ) # creat a target time series for masking purpose

# take the month out
month = ts.index.month 
# month.shape # a check on the shape of the month.
selector = ((11==month) | (12==month) | (1==month) | (2==month) | (3==month) )  
selector
print('shape of the selector', selector.shape)

print('all the data count in [11-01, 03-31]  is', floatsDFAllRate_15Dtimeorder[selector].chl_rate.dropna().shape) # total (500,)
print('all the data count is', floatsDFAllRate_15Dtimeorder.chl_rate.dropna().shape )   # total (853,)


shape of the selector (88319,)
all the data count in [11-01, 03-31]  is (500,)
all the data count is (853,)

In [63]:
# histogram for non standarized data
axfloat = floatsDFAllRate_15Dtimeorder[selector].chl_rate.dropna().hist(bins=100,range=[-0.3,0.3])
axfloat.set_title('15-Day chl_rate')


Out[63]:
<matplotlib.text.Text at 0x12a00e6d8>

In [64]:
# standarized series
ts = floatsDFAllRate_15Dtimeorder[selector].chl_rate.dropna()
ts_standardized = (ts - ts.mean())/ts.std()
axts = ts_standardized.hist(bins=100,range=[-0.3,0.3])
axts.set_title('15-Day standardized chl_rate')


Out[64]:
<matplotlib.text.Text at 0x124909208>

In [65]:
# all the data
fig, axes = plt.subplots(nrows=8, ncols=2, figsize=(12, 10))
fig.subplots_adjust(hspace=0.05, wspace=0.05)

for i, ax in zip(range(2002,2017), axes.flat) :
    tmpyear = floatsDFAllRate_15Dtimeorder[ (floatsDFAllRate_15Dtimeorder.time > str(i))  & (floatsDFAllRate_15Dtimeorder.time < str(i+1)) ] # if year i
    #fig, ax  = plt.subplots(figsize=(12,10))
    print(tmpyear.chl_rate.dropna().shape)   # total is 1061
    tmpyear.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r',vmin=-0.6, vmax=0.6, edgecolor='none', ax=ax)
    ax.set_title('year %g' % i)     
    
# remove the extra figure
ax = plt.subplot(8,2,16)
fig.delaxes(ax)


(37,)
(32,)
(10,)
(36,)
(79,)
(76,)
(126,)
(42,)
(57,)
(30,)
(25,)
(20,)
(168,)
(80,)
(35,)

In [66]:
fig, axes = plt.subplots(nrows=7, ncols=2, figsize=(12, 10))
fig.subplots_adjust(hspace=0.05, wspace=0.05)

for i, ax in zip(range(2002,2016), axes.flat) :
    tmpyear = floatsDFAllRate_15Dtimeorder[ (floatsDFAllRate_15Dtimeorder.time >= (str(i)+ '-11-01') )  & (floatsDFAllRate_15Dtimeorder.time <= (str(i+1)+'-03-31') ) ] # if year i
    # select only particular month, Nov 1 to March 31
    #fig, ax  = plt.subplots(figsize=(12,10))
    print(tmpyear.chl_rate.dropna().shape)  # the total is 692
    tmpyear.plot(kind='scatter', x='lon', y='lat', c='chl_rate', cmap='RdBu_r', vmin=-0.6, vmax=0.6, edgecolor='none', ax=ax)
    ax.set_title('year %g' % i)


(43,)
(1,)
(13,)
(46,)
(34,)
(94,)
(19,)
(43,)
(3,)
(30,)
(0,)
(73,)
(67,)
(34,)

In [ ]:


In [ ]:


In [67]:
# let's output the data as a csv or hdf file to disk to save the experiment time

df_list = []
for i in range(2002,2017) :
    tmpyear = floatsDFAllRate_15Dtimeorder[ (floatsDFAllRate_15Dtimeorder.time >= (str(i)+ '-11-01') )  & (floatsDFAllRate_15Dtimeorder.time <= (str(i+1)+'-03-31') ) ] # if year i
    # select only particular month, Nov 1 to March 31
    df_list.append(tmpyear)
    
df_tmp = pd.concat(df_list)
print('all the data count in [11-01, 03-31]  is ', df_tmp.chl_rate.dropna().shape) # again, the total is (500,)
df_chl_out_15D_modisa = df_tmp[~df_tmp.chl_rate.isnull()] # only keep the non-nan values
#list(df_chl_out_XD.groupby(['id']))   # can see the continuity pattern of the Lagarangian difference for each float id

# output to a csv or hdf file
df_chl_out_15D_modisa.head()


all the data count in [11-01, 03-31]  is  (500,)
Out[67]:
id time lon vn spd var_lon var_tmp ve temp var_lat lat chlor_a chlor_a_log10 chl_rate chl_rate_log10
2073 10206 2002-11-01 67.239550 4.393483 7.707300 0.001396 1000.000000 -4.056967 NaN 0.000461 11.003467 0.131253 -0.881891 -0.000230 NaN
2077 15707 2002-11-01 67.516567 -13.316117 24.136500 0.000146 1000.000000 -15.863733 NaN 0.000074 13.536150 0.158023 -0.801279 0.002436 -2.613278
2095 34710 2002-11-01 63.079750 16.525517 17.614183 0.000127 0.001804 -1.195583 28.666317 0.000066 17.411183 0.392572 -0.406080 -0.167135 NaN
2101 34721 2002-11-01 67.881250 6.486583 14.737233 0.000121 0.001813 6.646367 29.393383 0.000063 12.702833 0.152538 -0.816623 0.010258 -1.988946
2332 10206 2002-11-16 66.993317 1.400700 4.379483 0.003644 1000.000000 -2.148200 NaN 0.000991 11.224017 0.140250 -0.853096 0.008997 -2.045886

In [68]:
df_chl_out_15D_modisa.index.name = 'index'  # make it specific for the index name

# CSV CSV CSV CSV with specfic index
df_chl_out_15D_modisa.to_csv('df_chl_out_15D_modisa.csv', sep=',', index_label = 'index')

# load CSV output
test = pd.read_csv('df_chl_out_15D_modisa.csv', index_col='index')
test.head()


Out[68]:
id time lon vn spd var_lon var_tmp ve temp var_lat lat chlor_a chlor_a_log10 chl_rate chl_rate_log10
index
2073 10206 2002-11-01 67.239550 4.393483 7.707300 0.001396 1000.000000 -4.056967 NaN 0.000461 11.003467 0.131253 -0.881891 -0.000230 NaN
2077 15707 2002-11-01 67.516567 -13.316117 24.136500 0.000146 1000.000000 -15.863733 NaN 0.000074 13.536150 0.158023 -0.801279 0.002436 -2.613278
2095 34710 2002-11-01 63.079750 16.525517 17.614183 0.000127 0.001804 -1.195583 28.666317 0.000066 17.411183 0.392572 -0.406080 -0.167135 NaN
2101 34721 2002-11-01 67.881250 6.486583 14.737233 0.000121 0.001813 6.646367 29.393383 0.000063 12.702833 0.152538 -0.816623 0.010258 -1.988946
2332 10206 2002-11-16 66.993317 1.400700 4.379483 0.003644 1000.000000 -2.148200 NaN 0.000991 11.224017 0.140250 -0.853096 0.008997 -2.045886

In [ ]: