Data Analyze Using FFT (Brain shoot version)

A data analysis scheme no one interested in.



In [2]:

    
#For author
%qtconsole



In [3]:

    
#Reader starts from here
#pylab means lots of functions can be use without load module. 
%pylab inline
import pandas as pd









    



Populating the interactive namespace from numpy and matplotlib

Section 0: Introduction (Interactive version)

When we do experiment, and generate data. Our raw data is mostly time dependent signal: $\vec{I}(t)$
What we would like to know from data, mainly is:
- What happened? What is that? (How to describe the certain event; I = ?)
- When does that happened? (And t = ?) </font> </br>
Ideally, with perfect measurement methods, we can achieve above information even by using our naked eye. However, we all know most of time it will not be possible.
First, let's make a basic review to see how difficult is that.

0.1 Time v.s. Frequency Representation

For any time domain signal, $\vec{I}(t)$, to transform it into frequency domain, one can simply apply the function fft to it, without any idea what does that mean:

$\vec{F}(\omega) = fft(\vec{I}(t))$

Of course, fft is reversible (ifft stands for inverse fft):

$\vec{I}(t) = ifft(\vec{F}(\omega))$

These two representations discribe the same data. the same as *Prof. Dr. Jörg Langowski* and *Chief of B040 in DKFZ* represent same person.

0.2 Gaussian form wave pack: The simplest representation of single event

Normalized gaussian distribution, should be the simplest way to describe individual event.
The general form of Gaussian is:$I(t) = Ae^{-{ (t-\tau)^2 \over 2\sigma^2}} + D$
Then, a Gaussian shape event can be discribed as:
- How does it look like:
  - Height: A
  - Width: $\sigma$
- When does that happened?
  - Time: $\tau$
- What else?
  - Offset: D

0.2.1 Immobolized Gaussian Pack

For the simplest case, let's take a look at the frequency spectrom of a fixed Gaussian peak.
We plot Time domain representation first, followed by the abslute applitude of Frequency representation.
Change W to see what will happe.(Non-interactive version see below)
Question: What we see?
- (Wider in time domain, then shaper in frequency domain)



In [270]:

    
# plot wave pack and fft wave pack
# change w and rerun
N = 128       #data length
A = 1          #amplitude
W = 72.0       #Width, W > 0, and .0 is required
t = arange(-N/2,N/2)
data = A*exp(-t**2/W)

figure(1)
subplot(2,1,1)
plot(t, data)
legend(["Time Domain"])
subplot(2,1,2)
plot(t, fftshift(abs(fft.fft(data))),'r')
legend(["Fourier Domain"])









    Out[270]:





<matplotlib.legend.Legend at 0x7f071dade7d0>



In [269]:

    
# For non-interactive version view:
# Amplitude is normailzed
N = 128
A = 1.0
t = arange(-N/2, N/2)
W1 = 3.0
W2 = 90.0
W3 = 1000.0
data1 = A*exp(-t**2/W1)
data2 = A*exp(-t**2/W2)
data3 = A*exp(-t**2/W3)

fft_data1 = fftshift(abs(fft.fft(data1)))
fft_data2 = fftshift(abs(fft.fft(data2)))
fft_data3 = fftshift(abs(fft.fft(data3)))

figure(1)
plot(t,data1,t,data2, t,data3)
legend(["W1", "W2", "W3"])
title("Time Domain")

figure(2)
plot(t, fft_data1/max(fft_data1), t, fft_data2/max(fft_data2), t, fft_data3/max(fft_data3))
legend(["W1", "W2", "W3"])
title("Frequency Domain")









    Out[269]:





<matplotlib.text.Text at 0x7f071d7a9390>

0.2.2 Non-immobolized gaussian peak:

Let's see a non-immobolized gaussian peak:
What we see?
- **Although the position shift, the curve formation in frequency domain did not change, unless...... **



In [268]:

    
# plot wave pack and fft wave pack
# change tau and rerun
N = 128       #data length
A = 1          #amplitude
tau = -40
W = 72.0       #Width, W > 0, and .0 is required
t = arange(-N/2,N/2)
data = A*exp(-(t-tau)**2/W)

figure(1)
subplot(2,1,1)
plot(t, data)
legend(["Time Domain"])
subplot(2,1,2)
plot(t, fftshift(abs(fft.fft(data))),'r')
legend(["Frequency Domain"])









    Out[268]:





<matplotlib.legend.Legend at 0x7f071d6a5410>



In [267]:

    
# For non-interactive version view:
# Amplitude is normailzed
N = 128
A = 1.0
t = arange(-N/2, N/2)
W = 36.0
tau1 = 40
tau2 = -40
tau3 = 0
data1 = A*exp(-(t-tau1)**2/W)
data2 = A*exp(-(t-tau2)**2/W)
data3 = A*exp(-(t-tau3)**2/W)

fft_data1 = fftshift(abs(fft.fft(data1)))
fft_data2 = fftshift(abs(fft.fft(data2)))
fft_data3 = fftshift(abs(fft.fft(data3)))

figure(1)
subplot(2,1,1)
plot(t,data1,t,data2, t,data3)
legend(["W1", "W2", "W3"])
title("Time Domain")

subplot(2,1,2)
plot(t, fft_data1/max(fft_data1), t, fft_data2/max(fft_data2), t, fft_data3/max(fft_data3))
legend(["W1", "W2", "W3"])
title("Frequency Domain")









    Out[267]:





<matplotlib.text.Text at 0x7f071d00e2d0>

Notice that in frequency domain, we plot the absolute value. In this way, we lost the information known as "phase".
Phase is important especially in control system design, yet may not important for data analyse.
And, two extremely different objects can have the same phase.

0.2.3 Noise:

In the real world, we never achieve data as cleaned as gaussian peak.
Thanks to noise, datas that we captured normally just looks like garbage.

0.2.3.1 Basic property of noise

Noise normally means no interesting information inside, or information is "everagely distributed"
Scientificly, we can think it hs a constant power spectral density.
Typically for white noise, white is the color represent All the spectrum from sun, (somehow)everagely distributed along all the photon frequency.



In [244]:

    
#plot a white noise, with mean around zero, std around 1
noise = randn(1000)
figure(1)
subplot(2,1,1)
plot(arange(-500,500),noise)
title("time domain")

subplot(2,1,2)
plot(arange(-500,500),fftshift(abs(fft.fft(noise))))
title("frequency domain")
print "Mean of this noise is: ", noise.mean(), " and std is: ", noise.std()









    



Mean of this noise is:  0.0295518766968  and std is:  1.00858519695

0.2.3.2 Cumulative sum plot

The absolute value of noise(basicly any real data series) in frequency domain is symmetric around point 0.
We can do the cumulative sum plot in frequency domain use only half side (let's say, the right part)
Let us do it for a white noise
For any time series $I(t)$, in the range of $[0,T]$, the cumulative sum function is define as:$S(t) = \sum\limits_{t=0}^t I(t), for: t \le T$



In [250]:

    
# Normalized cumulative sum plot of a white noise
# nearly a straight means spectrom power (information) is everagely distributed. 
noise = randn(128)
fft_noise = abs(fft.fft(noise))
fft_noise = fft_noise[arange(0,len(fft_noise)/2)]

figure(1)

#plot spectrom of noise
subplot(2,1,1)
plot(arange(0,len(fft_noise)), fftshift(fft_noise))
title("Frequency of noise")

#plot the cumlative sum
subplot(2,1,2)
plot(fft_noise.cumsum()/max(fft_noise.cumsum()))
title("normalized cumlative plot")

#print basic information of this noise, in time domain of course.
print "The mean value for this noise is: ", noise.mean(), " and the std is: ", noise.std()









    



The mean value for this noise is:  -0.0258925691107  and the std is:  0.955812097861

For a noise on a certain offset, the mean value does not equal to zero.
What will we see?
- **There is a peak at 0 point on first plot, and the first value is not zero on the cumulative plot**



In [251]:

    
# Normalized cumulative sum plot of a white noise with a offset mean_value
# nearly a straight means spectrom power (information) is everagely distributed. 
mean_value = 1
std_value = 1
noise = mean_value + std_value*randn(128)
fft_noise = abs(fft.fft(noise))
fft_noise = fft_noise[arange(0,len(fft_noise)/2)]

figure(1)

#plot spectrom of noise
subplot(2,1,1)
plot(arange(0,len(fft_noise)), fftshift(fft_noise))
title("Frequency of noise")

#plot the cumlative sum
subplot(2,1,2)
plot(fft_noise.cumsum()/max(fft_noise.cumsum()))
title("normalized cumlative plot")

#print basic information of this noise, in time domain of course.
print "The mean value for this noise is: ", noise.mean(), " and the std is: ", noise.std()









    



The mean value for this noise is:  1.02970800637  and the std is:  1.18670676915

Before we move on, let us do the same plot for gaussian pack:
- the cum plot increase to 1 just using the first several points, which means information is highly localized



In [266]:

    
# plot wave pack and fft wave pack
# and cumulative plot
N = 128       #data length
A = 1          #amplitude
W = 72.0       #Width, W > 0, and .0 is required
t = arange(-N/2,N/2)
data = A*exp(-t**2/W)
fft_data = abs(fft.fft(data))
fft_data = fft_data[[arange(0,len(fft_data)/2)]]

figure(1)
subplot(3,1,1)
plot(t, data)
legend(["Time Domain"])
subplot(3,1,2)
plot(t, fftshift(abs(fft.fft(data))),'r')
legend(["Fourier Domain"])
subplot(3,1,3)
plot(fft_data.cumsum()/max(fft_data.cumsum()))
legend(["cum sum"])









    Out[266]:





<matplotlib.legend.Legend at 0x7f071cf70b10>

0.2.4 Gaussian pack with noise

By using the above concepts, we can plus noise to gaussian pack to (somehow) simulate the real dirty experiment data

0.2.4.1 Immobolized gaussian pack plus noise

When the enviroment is dirty enough, we can even hardly distinguish a gaussian pack.
Let us what will happen by one example.



In [287]:

    
#A Gaussian plus noise 
N = 160       #data length
A = 1          #amplitude
W = 72.0       #Width, W > 0, and .0 is required

noise_rate = 1 #noise rate, the magic part

#noise
noise_data = noise_rate * randn(N)

#time range
t = arange(-N/2,N/2)

#generate clean data
clean_data = A*exp(-t**2/W)

#plus noise to generate dirty data
dirty_data = clean_data.__add__(noise_data)

#plot the clean and dirty data
figure(1)
subplot(2,2,1)
plot(clean_data)
legend(["clean data"])

subplot(2,2,2)
plot(dirty_data)
legend(["dirty data"])

subplot(2,2,3)
plot(fftshift(abs(fft.fft(clean_data))))
legend(["fft of clean data"])

subplot(2,2,4)
plot(fftshift(abs(fft.fft(dirty_data))))
legend(["fft of dirty data"])









    Out[287]:





<matplotlib.legend.Legend at 0x7f07142c3290>

What can we do, when the data get dirty?
Of courese, we can sample it for several times, and add them up. Since the acumulate speed of noise is slow than the real signal, we expect to achieve acceptable singnal/noise ratio



In [7]:

    
#A Gaussian plus noise 
N = 160       #data length
A = 1          #amplitude
W = 72.0       #Width, W > 0, and .0 is required

noise_rate = 1 #noise rate, the magic part

#Let's sample it for several times:
sample_time = 30

accumulated_dirty_data = zeros(N)
for i in arange(sample_time):
    noise_data = noise_rate * randn(N)
    t = arange(-N/2,N/2)
    
    #clean data
    clean_data = A*exp(-t**2/W)
    
    #add noise to generate dirty data
    dirty_data = clean_data.__add__(noise_data)
    
    #accumulate data in time domain
    accumulated_dirty_data = accumulated_dirty_data + dirty_data

everage_dirty_data = accumulated_dirty_data/sample_time

#plot the clean and dirty data
figure(1)
subplot(2,2,1)
plot(clean_data)
legend(["clean data"])

subplot(2,2,2)
plot(everage_dirty_data)
legend([str(sample_time) + " everaged dirty"])

subplot(2,2,3)
plot(clean_data)
legend(["clean data"])

subplot(2,2,4)
plot(dirty_data)
legend(["dirty data"])









    Out[7]:





<matplotlib.legend.Legend at 0x5784490>

0.2.4.2 Non-immobolized gaussian pack plus noise

Sample data for multiple times and calculate everage value in time domain seems works fine.
But, how if the object is NOT immobolized?
Lets try it with gaussian peak that do a sinusoid



In [308]:

    
#A Gaussian plus noise 
N = 160       #data length
A = 1          #amplitude
W = 72.0       #Width, W > 0, and .0 is required

noise_rate = 1 #noise rate, the magic part

#Let's sample it for several times, this time do a time shift:
sample_time = 30

accumulated_dirty_data = zeros(N)

for i in arange(sample_time):
    #time shift:
    tau = 50*sin(i)
    noise_data = noise_rate * randn(N)
    t = arange(-N/2,N/2)
    
    #clean data
    clean_data = A*exp(-(t-tau)**2/W)
    
    #add noise to generated dirty data
    dirty_data = clean_data.__add__(noise_data)
    
    #and accumulate them
    accumulated_dirty_data = accumulated_dirty_data + dirty_data
    
    figure(1)
    #plot clean data
    subplot(3,1,1)
    plot(clean_data)
    legend(["clean data"])
    
    subplot(3,1,2)
    plot(dirty_data)
    legend(["dirty data"])
    
    subplot(3,1,3)
    plot(accumulated_dirty_data)
    legend(["Dirty accumulate data"])
    #pause(0.3)
    
everage_dirty_data = accumulated_dirty_data/sample_time

In this typical case, we can not easily get enough signal to noise ratio in time domain.
Let us try in frequency domain:



In [330]:

    
#A Gaussian plus noise 
N = 160       #data length
A = 1          #amplitude
W = 72.0       #Width, W > 0, and .0 is required

noise_rate = 1 #noise rate, the magic part

#Let's sample it for several times, this time do a time shift:
sample_time = 30

accumulated_dirty_data = zeros(N)
accumulated_dirty_data_fft = zeros(N)
for i in arange(sample_time):
    #time shift:
    tau = 50*sin(i)
    
    #noise
    noise_data = noise_rate * randn(N)
    t = arange(-N/2,N/2)
    
    #clean data
    clean_data = A*exp(-(t-tau)**2/W)
    
    #add noise to generate dirty data
    dirty_data = clean_data.__add__(noise_data)
    
    #accumulate dirty data
    accumulated_dirty_data = accumulated_dirty_data + dirty_data
    
    #fft dirty data and plus to accumulate_dirty_data_fft
    accumulated_dirty_data_fft = accumulated_dirty_data_fft + fft.fft(dirty_data)
    
#after the loop, do the everage
average_dirty_data = accumulated_dirty_data/sample_time
average_dirty_data_fft = accumulated_dirty_data_fft/sample_time

figure(1)
#plot the averaged dirty data
subplot(2,1,1)
plot(average_dirty_data)
legend(["averaged dirty data"])

#plot the averaged fft
subplot(2,1,2)
plot(fftshift(abs(average_dirty_data_fft)))
legend(["averaged dirty data fft"])









    Out[330]:





<matplotlib.legend.Legend at 0x7f071037ee10>

On the averaged fft plot, we seems see something.
How about try inverse that?



In [318]:

    
#inverse average fft data and plot it
plot(abs(ifft(average_dirty_data_fft)))









    Out[318]:





[<matplotlib.lines.Line2D at 0x7f070fbd7990>]

Nothing significant!
Why? Did we lost?
Remember, ideally, we can substract data information of:
- What happend?
- When does that happened? </font>
Yet in frequency domain, we strongly know "what happened", yet somehow lost the information about "When does that happen". (Which is not really true, but never mind)
So, in frequency domain, when we have the information of "What happened", then we can set filters on that range to filter out the information that we need.



In [338]:

    
#Let us plot the averaged fft data again:
plot(t, fftshift(abs(average_dirty_data_fft)),'bx-')









    Out[338]:





[<matplotlib.lines.Line2D at 0x7f0711151550>]

From the above plot, we can see our frequency information is strongly localized in the range $0 \pm 3$ (in center means low frequency)
Then we can set a filter about the certan range ON EVERY DIRTIED DATA. (Which konw as low pass filter)
Let us see what we will get.



In [365]:

    
#A Gaussian plus noise 
N = 160       #data length
A = 1          #amplitude
W = 72.0       #Width, W > 0, and .0 is required

noise_rate = 1 #noise rate, the magic part

#Let's sample it for several times, this time do a time shift:
sample_time = 30

accumulated_dirty_data = zeros(N)

for i in arange(sample_time):
    #time shift:
    tau = 50*sin(i)
    noise_data = noise_rate * randn(N)
    t = arange(-N/2,N/2)
    
    #clean data
    clean_data = A*exp(-(t-tau)**2/W)
    
    #add noise to generated dirty data
    dirty_data = clean_data.__add__(noise_data)
    
    #and accumulate them
    accumulated_dirty_data = accumulated_dirty_data + dirty_data
    
    #filter:
    points = 3  #set points around 0
    low_filter = concatenate((ones(points+1),zeros(N-2*points-1),ones(points)), axis=0)
    dirty_data_fft = fft.fft(dirty_data)
    filtered_fft = dirty_data_fft.__mul__(low_filter)
    filtered_data = ifft(filtered_fft)
    
figure(1)
#plot clean data
subplot(4,1,1)
plot(clean_data)
legend(["clean data"])

subplot(4,1,2)
plot(dirty_data)
legend(["dirty data"])
    
subplot(4,1,3)
plot(abs(filtered_data))
legend(["filtered data"])
    
subplot(4,1,4)
plot(accumulated_dirty_data)
legend(["Dirty accumulated data"])
    #pause(0.3)
    
#everage_dirty_data = accumulated_dirty_data/sample_time









    Out[365]:





<matplotlib.legend.Legend at 0x7f0705836a50>

Then, from which we see, we can somehow filter out the data, even non-immobolized and dirty.

0.3 Conclusion and Discussion:

Time domain representation that is comfortable with human, while frequency domain representation is good at design control system.
By study the property of data in frequency domain, we can design specific filter and substract information we need.

0.4 What next?

We see the example by data that we made out.
How about the real data that we daily work with?
For the data that we mainly work with in B040, we can mainly categoried them according the following two rules:
- $I(t + \tau)$ is distinguishable from $I(t)$?
- $I(-t)$ is distinguishable from $I(t)$?

Section 1: FCS

FCS works mainly with data of global property.
The raw data to calculate FCS, normally looks like the following:
- $I(t + \tau)$ is NOT distinguishable from $I(t)$
- $I(-t)$ is NOT distinguishable from $I(t)$

1.1 Introduction:

Inside B040, every one knows how to calculate auto-correlation function
The basic definition of auto-correlation function is: $G(\tau) = \frac{}{}$
For the real calculation, we use: $g(\tau) = \frac{<\delta I(t)\cdot \delta I(t+\tau)>}{}$
The relation between $G(\tau)$ and $g(\tau)$ is: $G(\tau) = g(\tau) + 1$

For real measured raw data directly from instrument, $I(t) \ge 0$
Which means: $g(\tau) \in [-1,1]$

A dirty proof:A: $g(\infty) = 0$

B: $\max(|g(\tau)|) = g(0) = \frac{\overline{\delta I(t)^2}}{\overline {I(t)}^2} \le \frac{\max(\delta I(t)^2)}{\overline{I(t)}^2}$

Since that $I(t) > 0$ for ANY t:

$\overline {I(t)} - \max(\delta I(T)) \ge \min(I(t)) \ge 0$

So that:

$\max(\delta I(T)) \le \overline {I(t)}$

Of course, $\overline {I(t)} > 0$:

$\frac{\max(\delta I(t))}{\overline{I(t)}} \le 1$

Then:

$\frac{\max(\delta I(t)^2)}{\overline{I(t)}^2} = g(0) \le 1$

ODE:

$g(\tau) \in [-1,1]$

Although we do not interest in $g(\tau) < 0$, but $g(\tau) < 1$ should be a crucial boundary.
To be confirmed by expert.
We will use the conclusion for the following example.

1.2 Algorithm:

In real world I(t), we sample it as a discrete signal sample.
For I(t) with n sample points, the brute force calculation has the efficiency of $O(n^2)$
In B040, we generally use algorithm known as multiple $\tau$:
- By using brute force calculation for low $\tau$ values.
- Then progressively binning the I(t) data with logarithmic density, to compute higher $\tau$ vaules(or time lag).
- result with $O(n \log(n))$ efficiency.

Alternatively, according to Wiener-Khinchin theorem, we can compute the altocorrelation curve form I(t) with two FFT:
- Step1: $F_R (f) = fft(I(t))$ (calculate fft for I(t), inversible)
- Step2: $S(f) = F_R (f) \cdot F_R^*(f)$ (times fft with its complex conjugate, to calculate power spectrom, Non-inversible)
- Step3: $G(\tau) = ifft(S(f))$ (Calculate auto-correlation function by ifft S(f), inversible)
Compare to multiple $\tau$ algorithm, fft based algorithm:
- Has the same time efficiency. ($O(n\log(n))$)
- Use more memeroy.
- Time lag is everagely distributed, no information binned
An example code shows as the following:



In [27]:

    
#Calculate Auto-correlation function based on fft
#only support 1D
#for cross correlation, in optimisation
#input data suggest to be numpy.ndarray(1d) or pandas.core.series.Series
def acf_fft(data):
    """ Series or 1darray ---> Series 
        Calculate the auto correlation function of input data
    """
    data_mean = data.mean()
    
    #Step 1: calculate fft for data
    fft_data = fft.fft(data)
    
    #Step 2: Calculate the pow spectrom,
    S_f = fft_data.__mul__(fft_data.conj())
                           
    #Step 3: calculate auto correlation function, result as g(\tau)
    acf = ifft(S_f)/fft_data[0]/data_mean - 1;
    
    #Clean up, return half of data
    acf = acf.real[0:int(len(acf)/2)]
    return acf

Then we show a test case:
Use two gaussian pack as input data.
Then plot raw data, abs fft and acf.
the extra peak on acf should be at distance between two peak



In [28]:

    
#Test case:
log_tag = 10
N = 2 ** log_tag   #data length, 2^log_tag
A = 1              #Amplitude
t = arange(-N/2,N/2)

#Change here to see what will change 
left_shift = 200.0
right_shift = 200.0
left_var = 1000.0
right_var = 1000.0

#Generate 2 gaussian pack
left_pack = A * exp(-(t+left_shift)**2/left_var)
right_pack = A * exp(-(t-right_shift)**2/right_var)
off_set = A * ones(N) * 1

#Make up data:
test_data = left_pack + right_pack+off_set

#plot data:
figure(1)

#plot test_data
subplot(3,1,1)
plot(t,test_data)

#plot abs fft
subplot(3,1,2)
plot(t,fftshift(abs(fft.fft(test_data-mean(test_data)))),'o')

#plot acf, extra peak should at position equals to peak distance
subplot(3,1,3)
plot(acf_fft(test_data),'o')









    Out[28]:





[<matplotlib.lines.Line2D at 0x8c38690>]

1.3 What we can do?

Thanks to auto correlator card, normal we do not need to perform auto correlation calculation from raw data. What we do is generated data and do the fitting in Quick Fit.
However, according to the fft algorithm:
- Step1: $F_R (f) = fft(I(t))$ (calculate fft for I(t), reversible)
- Step2: $S(f) = F_R (f) \cdot F_R^*(f)$ (times fft with its complex conjugate, to calculate power spectrom, Non-reversible)
- Step3: $G(\tau) = ifft(S(f))$ (Calculate auto-correlation function by ifft S(f), reversible)
The first and third steps are reversible, we can use the third step to calculate S(f), to see what we can get.

Notice fft calculate auto correlation function with average time lag, to use Step3, our data should also be average time lag.
First thought will be do the interpolation for our auto correlation curve.
For the boundary condition $g(0)$ and $g(\infty)$, we set for the worst condition as discussed above
- $g(0) = 1$
- $g(\infty) = 0$
We play it with real data. (Thanks to Dr. Alexander Gansen)



In [4]:

    
#initialize
%pylab inline
import pandas as pd
import os









    



Populating the interactive namespace from numpy and matplotlib



In [5]:

    
work_directory = '/home/chen/Documents/FFT_data_view/From_alex/'
os.chdir(work_directory)
!ls -l









    



total 212
-rw-r--r-- 1 chen users  13843 Aug 19 14:37 10sec_DNA5nM.ASC
-rw-r--r-- 1 chen users  13843 Aug 19 14:39 10sec_DNA5nM_fast.ASC
-rw-r--r-- 1 chen users 144996 Aug 19 14:39 10sec_DNA5nM_fast.CN0
-rw-r--r-- 1 chen users   6589 Aug 19 14:38 1sec_DNA5nM.ASC
-rw-r--r-- 1 chen users  14215 Aug 19 14:38 20sec_DNA5nM.ASC
-rw-r--r-- 1 chen users  11487 Aug 19 14:37 5sec_DNA5nM.ASC



In [6]:

    
#use pandas to read in ACS file
file_to_play = "10sec_DNA5nM_fast.ASC"

#set the column names 
data_set = ["time_lag", "Correlation"]

#set number of header lines to ignore
num_header = 25

#set number of rows to read:
num_rows = 175

#read in file to df
df = pd.read_csv(file_to_play, header=num_header, sep="\t", names=data_set, skipinitialspace=True, nrows=num_rows)

#print header lines and data type
print df.head() 
print df.dtypes
print df.describe()









    



   time_lag  Correlation
0    0.0002     0.777936
1    0.0004     0.562679
2    0.0006     0.487972
3    0.0008     0.483364
4    0.0010     0.463721
time_lag       float64
Correlation    float64
dtype: object
          time_lag  Correlation
count   175.000000   175.000000
mean    220.477123     0.130215
std     567.278276     0.152531
min       0.000200    -0.004103
25%       0.040000     0.000808
50%       1.638400     0.066180
75%      75.315200     0.221770
max    3145.680000     0.777936

After read in the file, we can do the normal plot just as we do in Quick Fit:



In [7]:

    
#plot it
semilogx(df.time_lag, df.Correlation)
grid(True, which="both", ls="-")
title("Auto-correlation function from data " + file_to_play)
xlabel("Time lag [s]")
ylabel("$g(\tau)$")









    Out[7]:





<matplotlib.text.Text at 0x3776b50>

Now, we will perform interpolation to fill the gaps in time lag
First, we should set the boundary conditiion: $g(0) = 1$



In [8]:

    
#Set boundary condition to g(0) = 1
time_new = concatenate(([0],df.time_lag), axis=0)
data_new = concatenate(([1], df.Correlation), axis=0)

We may not do a logarithmic plot for new data, since t = 0 will not be plotted
Next, we can do the interpolation



In [9]:

    
from scipy.interpolate import interp1d

#Generate interpolation function use time_new and data_new
polate_function = interp1d(time_new, data_new, kind="cubic")

#plot it against the raw data.
#time range:
time_target = linspace(0.0002,100,500000)

#plot it
semilogx(df.time_lag, df.Correlation,"o", time_target, polate_function(time_target),'-')
grid(True)
legend(["Initial ACF","Interpolated ACF"])









    Out[9]:





<matplotlib.legend.Legend at 0x3ceec10>

Notice that we only plot the interpolated data upto 100 s, because the everage time lag vector time_target has 500,000 elements.
For ACF around 100s is nearly zero, we do not really need data with large time lag
Now we have the interpolated data, we can apply our third step inversely:
We calculate S(f) first



In [10]:

    
#First, we calculate the interpolated ACF from [0,100], with everage time lag = 0.0002
time_lag_size = 0.0002
time_upto = 100
time_lag_vector = linspace(0,time_upto,time_upto/time_lag_size)
Interpolated_ACF = polate_function(time_lag_vector)

#Reconstruct data to do 3rd step. basicly make the new data symmetricly
ACF_to_3rd = concatenate((Interpolated_ACF, Interpolated_ACF[len(Interpolated_ACF)-2:0:-1]),axis=0)

#Calculate S(f)
S_f = fft.fft(ACF_to_3rd)

Remember, S(f) is the power spectrom, which indicate the distribution of information
Lets plot S(f) to take a look:



In [11]:

    
#Plot S(f)
plot(abs(fftshift(S_f)),'x')









    Out[11]:





[<matplotlib.lines.Line2D at 0x369e410>]

Althrough we have points upto 1000000, our information is highly locoalized.
To quantitively see how the information distributed, of course, we do cumulative sum plot(see 0.2.3.2)



In [13]:

    
#cum sum plot
sf_target = S_f[0:int(len(S_f)/2)]
plot(abs(sf_target.cumsum())/abs(max(sf_target.cumsum())),'r-')









    Out[13]:





[<matplotlib.lines.Line2D at 0x3ae4f50>]

Note: we have 1000000 total points, as it is symeetric, we only need to plot half of that.
Cumlative sum plot shows how information distributed from low frequency to high frequency.
From above, we see that information increase dramaticly fast arround zero. e.g.:
- the first 3000 point (3000/500000 = 0.6% points) contribute 33.8% information
- the first 300000 point (300000/500000 = 60% points) contriubute 96.1% information.
We can set low pass filter in some range to see how does that work



In [108]:

    
#construct filter:

#total points
N = len(S_f)

#First filter with points keeps 50% information
points1 = 18000
filter1 = concatenate((ones(points1 + 1),zeros(N - 2*points1 - 1),ones(points1)), axis=0)


#Second filter with points keeps 20% information
points2 = 200
filter2 = concatenate((ones(points2 + 1),zeros(N - 2*points2 - 1),ones(points2)), axis=0)

#Now do the filter,
S_filtered1 = S_f.__mul__(filter1)
S_filtered2 = S_f.__mul__(filter2)

#re-calculate ACF after filtered
acf_filtered1 = abs(ifft(S_filtered1))
acf_filtered2 = abs(ifft(S_filtered2))

#re-scaled:
acf_filtered_toplot1 = acf_filtered1[0:len(acf_filtered1)/2 + 1]
acf_filtered_toplot2 = acf_filtered2[0:len(acf_filtered2)/2 + 1]



In [114]:

    
#Then plot with original data
semilogx(df.time_lag, df.Correlation,'o', time_target,acf_filtered_toplot1,'-', time_target,acf_filtered_toplot2,'--')
legend(["origina data", "50% information", "20% information"])
grid(True)

From above plot, we can read out information about our information.

1.4 Conclusion and Discussion:

We can use FFT to disign low pass filter and clean up data even only with calculated ACF curve.
It will be better clear to view how this method work with raw data.

1.5 Possible Optimization:

May not necessary since no one will interest.

Section 2: Confocal based spFRET

In spFRET, information is highly localized, we will play with data locally.
The raw data generate from spFRET, looks like the following:
- $I(t + \tau)$ is DOCH distinguishable from $I(t)$
- $I(-t)$ is NOT distinguishable from $I(t)$

2.1 Introduction:

Not necessary, we see by examples.
Raw data from Dr. Alexander Gansen

2.2 Data Structure:



In [118]:

    
#Initialize
%pylab inline
work_directory = "/home/chen/Documents/FFT_data_view/FFT_FRET/080513/"
import os
import pandas as pd
os.chdir(work_directory)
!ls -l *.txt









    



Populating the interactive namespace from numpy and matplotlib
-rw-r--r-- 1 chen users    160742 Sep 22 21:39 default.txt
-rw-r--r-- 1 chen users  82459683 Sep 23 21:11 nucAAs1000.txt
-rw-r--r-- 1 chen users  75367095 Sep 23 21:09 nucAAs100.txt
-rw-r--r-- 1 chen users  95584748 Sep 23 21:11 nucAAs1100.txt
-rw-r--r-- 1 chen users  74876768 Sep 23 21:09 nucAAs200.txt
-rw-r--r-- 1 chen users  75933897 Sep 23 21:10 nucAAs300.txt
-rw-r--r-- 1 chen users  72816055 Sep 23 21:10 nucAAs400.txt
-rw-r--r-- 1 chen users  79791613 Sep 23 21:10 nucAAs500.txt
-rw-r--r-- 1 chen users  79427396 Sep 23 21:10 nucAAs600.txt
-rw-r--r-- 1 chen users  85556308 Sep 23 21:11 nucAAS700.txt
-rw-r--r-- 1 chen users  77399630 Sep 23 21:11 nucAAs800.txt
-rw-r--r-- 1 chen users  80224702 Sep 23 21:11 nucAAs900.txt
-rw-r--r-- 1 chen users  76753703 Sep 23 21:08 nucH3acs1000.txt
-rw-r--r-- 1 chen users  54951382 Sep 23 21:06 nucH3acs100.txt
-rw-r--r-- 1 chen users  88644994 Sep 23 21:09 nucH3acs1100.txt
-rw-r--r-- 1 chen users  63136562 Sep 23 21:06 nucH3acs200.txt
-rw-r--r-- 1 chen users  70235072 Sep 23 21:07 nucH3acs300.txt
-rw-r--r-- 1 chen users  69755857 Sep 23 21:07 nucH3acs400neu.txt
-rw-r--r-- 1 chen users 110010819 Sep 23 21:07 nucH3acs400.txt
-rw-r--r-- 1 chen users  71361569 Sep 23 21:07 nucH3acs500.txt
-rw-r--r-- 1 chen users  72255125 Sep 23 21:08 nucH3acs600.txt
-rw-r--r-- 1 chen users  72246169 Sep 23 21:08 nucH3acs700.txt
-rw-r--r-- 1 chen users  71064630 Sep 23 21:08 nucH3acs800.txt
-rw-r--r-- 1 chen users  69111444 Sep 23 21:08 nucH3acs900.txt

Although raw data is stored in *.t3r file, here we use script transform from origin t3r to human readable txt file.
ToDo: design a readin scheme if necessary.
Each time file readin costs about 10s. Please be patient.

We use file nucH3acs500.txt as the example



In [267]:

    
#use pandas to read in ACS file
file_to_play = "nucH3acs500.txt"

#set the column names 
data_set = ["data_id","timetag", "time_s", "channel", "route"]

#Column specifications, special for spFret file
col_specs = [(0,7),(11,16),(18,30),(33,37),(40,42)]

#set number of header lines to ignore
num_header = 39

#read in file to df
df = pd.read_fwf(file_to_play, header=num_header, colspecs=col_specs, names=data_set,skipinitialspace=False)

#print header lines and data type
print df.head() 
print df.dtypes
print df.describe()









    



   data_id  timetag    time_s  channel  route
0        0    34395  0.003440      452      0
1        1    35856  0.003586      534      1
2        2    63204  0.006320      310      0
3        3     3072  0.006861      481      0
4        4     4424  0.006996      417      1
data_id      int64
timetag      int64
time_s     float64
channel      int64
route        int64
dtype: object
              data_id         timetag          time_s         channel           route
count  1653413.000000  1653413.000000  1653413.000000  1653413.000000  1653413.000000
mean    826706.000000    32793.810750      593.017763      466.388757        0.441750
std     477299.364655    18900.959124      343.789399      181.175464        0.496596
min          0.000000        0.000000        0.003440      152.000000        0.000000
25%     413353.000000    16418.000000      297.243929      309.000000        0.000000
50%     826706.000000    32841.000000      587.064533      467.000000        0.000000
75%    1240059.000000    49164.000000      885.391276      623.000000        1.000000
max    1653412.000000    65535.000000     1199.993791      783.000000        1.000000

time_s is the time that the sensor capture one certain photon
route indicate Donnor Channel (0) and acceptor Channel(1)
data_id is the photon sequence, and timetag is a periodic signal size of $2^{16}$

We mainly use column time_s and route for data analyse.
In FretChen, A certain burst is defined by total photo counts in both (Donor and Acceptor Channel), here we will seperate data into two signal vectors.

2.2.1 Binning raw data to certain time scale

The native data is not suitable for fft, because fft needs a average time lag vector data.
We should first convert the raw data to the some certain time lag, together seperate donor and accepter channel
We define a binning function to do the job:



In [283]:

    
# Binning raw data to certain time scale
# NOTE: this function is only use to show how the methods work, Since this is not the efficient style.
# Rewrite to some efficient version if it is necessary
def binning(data_frame, time_lag):
    '''
    input: 
        raw spFRET data frame, loaded from above section
    return: 
        time_vector: everaged distributed vector as a unit of time_lag, start from time_lag to final experiemnt time
        donor      : donor channel photo counts correspond to time_vector
        acceptor   : acceptor channel photo counts correspond to time_vector
        total_d_a  : total count correspond to time_vector
    '''
    #count of total photons
    total_data_points = len(data_frame)
    
    #get the final experiment time, unit: s
    time_final = ceil(data_frame.time_s[total_data_points-1])
    
    #count how many points we need to store the data
    return_data_points = int(ceil(time_final/time_lag))
    
    #initial return data:
    time_vector = linspace(time_lag, time_final, ceil(time_final/time_lag))
    donor = zeros(return_data_points)
    acceptor = zeros(return_data_points)
    total_d_a = zeros(return_data_points)
    
    #loop the route colunm in data_frame and categories donor and acceptor
    #first, create a vector map time distance to index
    time_to_index = data_frame.time_s / time_lag
    for i in arange(total_data_points):
        #get index to put data
        DA_index = floor(time_to_index[i])
        
        #seprate data to donor or acceptor channel
        if data_frame.route[i] == 0.0:             #donor channel
            donor[DA_index] = donor[DA_index] + 1
        else:                                    #acceptor channel
            acceptor[DA_index] = acceptor[DA_index] + 1
        total_d_a[DA_index] = total_d_a[DA_index] + 1
    
    
    return time_vector, donor, acceptor, total_d_a



In [275]:

    
#Test case:
#set binning time lag:
time_lag = 0.4/1000 #unit: s

#use the binning function to seperate donor and acceptor. 
#brute force, slow, but acceptable
time_vector, donor, acceptor, total_d_a = binning(df, time_lag)



In [282]:

    
# Check
# Check if it consistant with each other
print "Length of time vector is: ", len(time_vector)
print "Length of donor vector is: ", len(donor)
print "Total photo count should be:", len(df)
print "Sum of donor and acceptor channel photos are:", donor.sum() + acceptor.sum()









    



Length of time vector is:  3000000
Length of donor vector is:  3000000
Total photo count should be: 1653413
Sum of donor and acceptor channel photos are: 1653413.0

2.2.2 Sliding Window fft

Now that we have the data prepared, we can think about what should we do.
spFRET is about local property of singnal, so we can not use the absolute same scheme as we played in FCS.
This time, we do sliding window fft
Brief methods:
- split the global data vector into many small local vectors.
- do fft for each vector
- sum all the fft result and do cumulative sum plot(see 0.2.3.2)
Why?
- In section 0, we discussed that fft spectrom is powerful to discribe object.
- For target not immoblized, sum all data in time domain will not help.
- However, for a certain target, the shape of spectrom will not change, no mater how it moves.

First, as we do not know which window size is suitable (actually it really depends), we design a function to work for the general case:



In [357]:

    
#Do sliding window accumulate fft for 1d data, window width = 2 ^ fft_factor 
#Again, this is not efficient
def window_fft(data, fft_factor):
    '''
    Do sliding window accumulate fft for 1d data
    input:
        data: 1d data vector that long enough
        fft_factor: determine window width. window width = 2 ^ fft_fector
    return:
        accumlate_fft: accumalted window spectrom, use to do cumulative plot
    '''
    #set window width
    window_width = 2 ** fft_factor
    
    #get data length
    data_length = len(data)
    
    #take out the offset (mean value) from data, because that will affect the cumulative plot
    target = data - data.mean()
    
    #real used calculate data length, that should be integer power of window_width
    data_use_length = data_length - data_length % window_width
    
    #How many windows will we have?
    window_num = data_use_length / window_width
    
    #initial output
    accumulate_fft = zeros(window_width)
        
    #do the main loop for each window
    for i in arange(window_num):
        accumulate_fft = accumulate_fft.__add__((fft.fft(target[i*window_width: (i+1)*window_width])))
        
    #return data
    return accumulate_fft

we can change the window width, and do cumulate plot for the window size fft
For example, we use fft_factor = 5, then window size = 64 (time_lag = 0.4ms, so one window is 12.8ms)
(For this typical case best work for fft_factor = 5 and 6)



In [375]:

    
#test case:
#set fft_factor
fft_factor = 5
accumulate_fft = window_fft(total_d_a, fft_factor)
figure(1)
plot(abs(fftshift(accumulate_fft)))









    Out[375]:





[<matplotlib.lines.Line2D at 0x129621750>]

Looks like garbage?
We now try do the cumsum plot:



In [376]:

    
#cumsum plot:
data_length = len(accumulate_fft)
accumulate_power = accumulate_fft.__mul__(accumulate_fft.conjugate())
accum_toplot = accumulate_power[0: data_length/2]
plot(accum_toplot.cumsum()/max(accum_toplot.cumsum()),'x')









    Out[376]:





[<matplotlib.lines.Line2D at 0x1293577d0>]

Cumulative plot discribe, how the energy increase around 0 frequency, this can be view as your information.
For this typical case, we use total 16 points, but the information goes up above 70% before 7th points.
In this case, less than 50% points contribute to 70% information
If we set the low pass filter, just use half spectrom point, we can sufficiently discribe our data.
We can use this scheme to select burst for spFRET

2.2.3 Calculate Proximity FRET ratio of burst select by fft

Work done by the following function:



In [394]:

    
#Main function
def fft_filter_burst(ref, total_d_a, acceptor, fft_scale, filter_ratio, threshold, max_half_width):
    '''
    Calculate the proximity ratio, by using fft based burst selection
    input:
        ref:               which channel used for determining burst? total? donor? or acceptor?
        total_d_a:         total photo count, everaged time lag vector
        acceptor:          acceptor channel photo count, everaged time lag vector
        fft_scale:         2 ^ fft_scale is the local window size
        filter_ratio:      ratio of points in frequency domain to use for low pass filter (in [0, 1])
        threshold:         threshold in time domain (play with it now)
        max_half_width:   how many (half) points to take into account as a burst in time domain
    output:
        proximity:          proximity fret ratio
    '''
    #get the data length:
    data_length = len(total_d_a)
    
    #set window width
    window_width = 2 ** fft_scale
    
    #How many point is going to use?
    data_use_length = data_length - data_length % window_width
    
    #How many windows in total?
    window_num = data_use_length / window_width
    
    #set parameters for low pass filter
    fft_filter_point = floor(window_width * filter_ratio / 2)
    
    #Initial filter
    fft_filter = concatenate((ones(fft_filter_point + 1),zeros(window_width- 2*fft_filter_point - 1),ones(fft_filter_point)), axis=0)
    
    #Initial output
    proximity = []
    
    #The main loop
    for i in arange(window_num):
        
        fft_local = fft.fft(ref[i*window_width:(i+1)*window_width])
        
        fft_filtered = fft_local.__mul__(fft_filter)
        
        #ifft to get filtered data
        data_filtered_local = ifft(fft_filtered)
        local_max = data_filtered_local.max()
        local_index = data_filtered_local.argmax()
        
        #Check if it is a vilid burst
        #should be larger than threshold and not on the boundary.
        #Boundary case can be optimized
        if local_max > threshold and local_index - fft_scale/2 > 0 and local_index + fft_scale/2 < window_width:
            burst_begin = local_index - max_half_width + i * window_width
            burst_end = local_index + max_half_width + i * window_width
            
            #P = A/(D+A)
            proximity.append(sum(acceptor[burst_begin:burst_end])/sum(total_d_a[burst_begin:burst_end]))
    
    return proximity

Now let us try to use it to calculate proximity FRET, to see if we can repeat the job of FretChen



In [407]:

    
#Test case:
#Initial case, can be changed
filter_ratio = 10/16.0
fft_scale = 6
threshold = 12
max_half_width = 3

#calculate proximity ratio for the certain case
P1 = fft_filter_burst(total_d_a, total_d_a, acceptor, fft_scale, filter_ratio,threshold, max_half_width)
figure(1)
hist(P1,40)
title("Proximity FRET Ratio calculate by using FFT")









    Out[407]:





<matplotlib.text.Text at 0xbbfdf1d0>

Notice above we still use total photo count as our burst reference.
How about we try threshold in acceptor channel?



In [408]:

    
#threshold using acceptor channel
P2 = fft_filter_burst(acceptor, total_d_a, acceptor, fft_scale, filter_ratio,threshold, max_half_width)
figure(1)
hist(P2,50)
title("Proximity FRET ratio use acceptor as reference")









    Out[408]:





<matplotlib.text.Text at 0xbbd55550>

As we threshold in acceptor channel, we lost information about low FRET around P = 0
Might we see better FRET ratiw groups, even by naked eye?

2.3 Conclusion and Discussion:

It works

2.4 Possible optimization:

Above we stilled used raw data to calculate proximity ratio
How about:
- Use filtered data instead?(should also clean noise such as Triplet state)
- For a certain sample, find out the best spectrom shape, then we can highly increase target precision.

Section 3: TIRF FRET

For TIRF, it is somehow different.
The raw data generate from spFRET, looks like the following:
- $I(t + \tau)$ is DOCH distinguishable from $I(t)$
- $I(-t)$ is Globally NOT but Locally DOCH distinguishable from $I(t)$
The best filter tool seems to be also sliding window filter.
First, we do it with Gibar Transform. (A wavelet based algorithm)
This should also work for analyze of simulated force loaded data. But simulation data is too short to play with.
Todo: A better TIRF is required
- Set up the piezo stage first.

To be continued

Section N: Test Cases



In [4]:

    
#How to set filter?
#A symmetric spectrom with N points.
#0 point is the center
#filter:

#N = 1000 #len(data)
#points = 3  #set points around 0
#low_filter = concatenate((ones(points+1),zeros(N-2*points-1),ones(points)), axis=0)



In [257]:

    
#test if g(\tau) can < 0
figure(1)
data = A* ones(N) + A* sin(t)
plot(data)
plot(acf_fft(data))









    Out[257]:





[<matplotlib.lines.Line2D at 0xb5778d0>]



In [258]:

    
#plot new data
plot(time_new, data_new,'x')
grid(True, which="both", ls="-")
title("Auto-correlation function of new data")
xlabel("Time lag [s]")
ylabel("$g(\tau)$")









    Out[258]:





<matplotlib.text.Text at 0xbfaaed0>



In [276]:

    
#Interpolation:
test_t = linspace(0.0002, 10, 100000)
from scipy.interpolate import interp1d
f = interp1d(df.time_lag, df.Correlation)
f2 = interp1d(df.time_lag, df.Correlation, kind="cubic")
df.time_lag.max
semilogx(df.time_lag, df.Correlation,'o', test_t, f(test_t),'x-', test_t, f2(test_t),'-')









    Out[276]:





[<matplotlib.lines.Line2D at 0xf313fd0>,
 <matplotlib.lines.Line2D at 0xf591350>,
 <matplotlib.lines.Line2D at 0xf591850>]



In [5]:

    
#Youtube
from IPython.display import YouTubeVideo
YouTubeVideo('7tHc9xWhFH4') #Haydn - Emperor Quartet Mvt.2









    Out[5]:



In [1]:

    
whos









    



Interactive namespace is empty.



In [1]:

    
pwd









    Out[1]:





u'/Users/chenchen/Documents/git/ipynotebook/pyconvert_test'



In [ ]: