Integration

How much solar power was available to be collected in Southampton in 2005?

To answer this, we need to integrate the solar irradiance data, to get the insolation,

\begin{equation} H = \int \text{d}t \, I(t). \end{equation}

There's a data file containing the irradiance data (simplified and tidied up) from HelioClim in the repository. Let's load it in.


In [ ]:
from __future__ import division
import numpy
data_southampton_2005 = numpy.loadtxt('../data/irradiance/southampton_2005.txt')

If we ask Python what the data is, it will show us the first and last few entries:


In [ ]:
data_southampton_2005

Try looking at the data file in a text editor. The header comment says that the first column contains the time in hours since midnight, January 1st. We see that the file gives data every quarter of an hour for the year.

Let's plot the data to see the trend.


In [ ]:
%matplotlib notebook
from matplotlib import pyplot

pyplot.figure(figsize=(10,6))
pyplot.scatter(data_southampton_2005[:,0], data_southampton_2005[:,1], label="Southampton, 2005")
pyplot.legend()
pyplot.xlabel("Time (hours)")
pyplot.ylabel(r"Horizontal diffuse irradiance ($Wh \, m^{-2}$)")
pyplot.show()

You'll need to zoom right in to see individual days. It's pretty noisy, and there's points where the data is corrupted.

Still, integration should smooth that out.

Let's plot a small segment of the data - the first 200 data points, which is roughly 48 hours. We'll use that to think about integration.


In [ ]:

This is how the data looks as individual points. Let's instead plot it as a bar chart.


In [ ]:

The integral is the area under the curve. In deriving how to integrate, we split the domain into subintervals and approximate the area in that subinterval as the width of the subinterval times the (constant) value in that subinterval.

In other words: find the area of each bar, and add them up.

The area of each bar is the data value multiplied by the time step (a quarter of an hour, here).

So the total integral over these two days is given by:


In [ ]:

Is this a sensible number? There's roughly 8 hours of sun each day. On day 1 the maximum irradiance is around 10; on day 2 it's around 35. So the maximum value would be around $8 \times 10 + 8 \times 35 = 360$: we're in the right ballpark.

So the total insolation for 2005 is:


In [ ]:

Exercise

In the data directory you'll find data files for all the CDT sites for both 2004 and 2005. Compute the insolation for at least one more. If you're feeling inspired, try the glob library to compute them all automatically.

Solution

In [ ]:

Improving the integral

There's a problem with the simple rule that we've used, which is clear when we take a closer look at the data. Let's go back to our plots:


In [ ]:
pyplot.figure(figsize=(10,6))
pyplot.bar(data_southampton_2005[120:170,0], data_southampton_2005[120:170,1], label="Southampton, 2005", width=0.25)
pyplot.legend()
pyplot.xlabel("Time (hours)")
pyplot.ylabel(r"Horizontal diffuse irradiance ($Wh \, m^{-2}$)")
pyplot.show()

There's a clear trend in the data, but we're sampling it very coarsely. We could imagine smoothing this data considerably. Even the simplest thing - joining the points with straight lines - would be an improvement:


In [ ]:
pyplot.figure(figsize=(10,6))
pyplot.plot(data_southampton_2005[120:170,0], data_southampton_2005[120:170,1], label="Southampton, 2005")
pyplot.legend()
pyplot.xlabel("Time (hours)")
pyplot.ylabel(r"Horizontal diffuse irradiance ($Wh \, m^{-2}$)")
pyplot.show()

The integral is still the area under this curve. And the full domain is still split up into quarter hour subintervals. The difference is that the irradiance is now a straight line on each subinterval, not a constant value. So each subinterval is represented by a trapezoid, not by a bar:


In [ ]:
pyplot.figure(figsize=(10,6))
pyplot.fill_between(data_southampton_2005[150:152,0], data_southampton_2005[150:152,1])
pyplot.fill_between(data_southampton_2005[151:153,0], data_southampton_2005[151:153,1])
pyplot.fill_between(data_southampton_2005[152:154,0], data_southampton_2005[152:154,1])
pyplot.xlabel("Time (hours)")
pyplot.ylabel(r"Horizontal diffuse irradiance ($Wh \, m^{-2}$)")
pyplot.show()

The area of a trapezoid is given by its width times the average of the value at the left and right edge.

We now need some notation. We'll denote the times at which we have data as $t_j$, where $j$ is an integer, $j = 0, \dots, N$. So we have $N+1$ data points in all, leading to $N$ subintervals. The irradiance $I$ at time $t_j$ will be $I_j = I(t_j)$.

The $j^{\text{th}}$ subinterval has $t_j$ at its left edge and $t_{j+1}$ on its right. So the area of this subinterval is

\begin{equation} H_j = \frac{1}{2} \Delta t \, \left( I_j + I_{j+1} \right). \end{equation}

The total insolation, which is the total integral, is the total area of all the subintervals: the sum of all the $H_j$:

\begin{equation} H = \sum_{j=0}^N \frac{1}{2} \Delta t \, \left( I_j + I_{j+1} \right). \end{equation}

As the right edge of the $j^{\text{th}}$ subinterval is the left edge of the $(j+1)^{\text{th}}$ subinterval (except for the endpoints), every point except the first and last is counted twice. So we can rewrite this as

\begin{equation} H = \frac{\Delta t}{2} \, \left( I_{0} + I_{N} + 2 \sum_{j=1}^{N-1} I_j \right). \end{equation}

This is the trapezoidal rule.

Let's test it:


In [ ]:

In this case the result isn't just close to the previous version - it's identical. We need some evidence that it's actually going to be better in general.

Convergence

To do this we'll use an integrand given by a specific function rather than a data set. The function

\begin{equation} I(t) = \sin^4 \left( \frac{2 \pi t}{47} \right) \times \left( 1 + 0.1 \times \sin \left(\frac{2 \pi t}{365 \times 24} \right) \right) \end{equation}

is similar in form to the data. The integral has solution

\begin{equation} \int_0^{24 \times 365} \text{d} t \, I(t) \approx 3286.814506640615. \end{equation}

In [ ]:
def I_analytic(t):
    return numpy.sin(2*numpy.pi*t/47)**4 * (1.0 + 0.1 * numpy.sin(2*numpy.pi*t/(24*365)))

In [ ]:

We can compute the numerical solution using $N = 2^k$ subintervals, where $k = 10, \dots, 18$, using both the Riemann integral and the trapezoidal rule:


In [ ]:


In [ ]:

As the number of points increases, the timestep decreases, and the error decreases in both cases. But the error decreases much faster for the trapezoidal rule. The logarithmic scale shows that the error $E$ is going down as a power law,

\begin{equation} \log(E) \sim s \log(\Delta t) \quad \implies \quad E \sim (\Delta t)^s, \end{equation}

where $s$ is the slope of the straight line in the log-log plot. This shows that the Riemann integral error goes down by the amount that the timestep is reduced each time; the trapezoidal rule error goes down by this amount squared. When the timestep is reduced by a factor of ten, the Riemann integral error goes down by ten, the trapezoidal rule by a hundred.

Algorithms like the Riemann integral are called first order ($s=1$), and like the trapezoidal rule are called second order ($s=2$).

Exercise

Show that Simpson's rule

\begin{equation} H = \frac{\Delta t}{3} \, \left( I_{0} + I_{N} + 2 \sum_{j=1}^{N/2-1} I_{2 j} + 4 \sum_{j=1}^{N/2} I_{2 j - 1} \right) \end{equation}

is fourth order ($s=4$). Simpson's rule requires an even number of subintervals $N$, and is the result of fitting a quadratic through each three neighbouring points.

Solution

In [ ]:


In [ ]:

Multi dimensional integrals

Let's suppose that irradiance varies with location for some odd location,

\begin{equation} I(x, y, t) = \left( 1 + \cos^2 \left( 2 \pi x \right) \sin^4 \left( 4 \pi y \right) \right) \sin^4 \left( \frac{2 \pi t}{47} \right). \end{equation}

We want to know the total insolation over a 24 hour period, over the area $x \in [0, 1], y \in [0, 1]$. The solution is $H \approx 10.464842515116615$.


In [ ]:
def I_spatial(x, y, t):
    return (1.0 + numpy.cos(2*numpy.pi*x)**2 * numpy.sin(4*numpy.pi*y)**4) * numpy.sin(2*numpy.pi*t/47)**4

In [ ]:
x = numpy.linspace(0, 1)
y = numpy.linspace(0, 1)
X, Y = numpy.meshgrid(x, y)
I10 = I_spatial(X, Y, 10)

In [ ]:
pyplot.figure(figsize=(10,6))
pyplot.contour(X, Y, I10)
pyplot.xlabel(r"$x$")
pyplot.xlabel(r"$y$")
pyplot.title(r"Irradiance at $t=10$ hours")
pyplot.show()

We can perform the integral along each axis:

\begin{equation} H = \int_0^{24} \text{d}t \, \int_0^1 \text{d}y \, \int_0^1 \text{d}x \, I(x, y, t). \end{equation}

Let's do that for the Riemann integral:


In [ ]:

We've only used 32 subintervals in each dimension here, which won't be very accurate. But increasing the accuracy is hard: doubling the number of intervals, in order to double the accuracy, means we have to do eight times as much work. Even using Simpson's rule we'd be in trouble.

This integral was over a standard, "real" space. If we wanted to work out the probability of something happening we'd want to integrate over parameter space. How many parameters would you typically have? As soon as the dimensions of the integral get above $\sim$5 the computational cost of standard integration rapidly spirals out of control.

Monte Carlo integration

Monte Carlo methods use random sampling. When integrating in one dimension, they choose $N$ points $t_j$ from the range of integration at random, and then approximate

\begin{equation} H = \int_a^b \text{d}t \, I(t) \approx \frac{b - a}{N} \sum_{j=1}^N I(t_j). \end{equation}

This may seem like an odd thing to do, but you'd expect it to work given enough points: it's saying that the average value of the integrand multiplied by the width of the interval is the integral.

A quick implementation note. Generating (pseudo)-random numbers is hard. You should never do this yourself, but should always use a library. Thankfully, there's lots of good libraries out there.

Let's repeat our convergence test of the one dimension case first.


In [ ]:
from numpy.random import rand

In [ ]:


In [ ]:

This shows a number of interesting, but not particularly nice, features. First, you can see the randomness: every time you run this you get a different answer. Second, the convergence is slow: the Riemann integral, our worst method so far, converged as $(\Delta t) \propto N^{-1}$, so to make the error go down by a factor of two you increase the number of points by two. For the Monte Carlo integral, to make the error go down by a factor of two you increase the number of points by a factor of four.

So why should we care? The answer is multiple dimensions: the convergence rate of Monte Carlo is independent of the number of dimensions. For our 3d case above, to make the error of the Riemann integral go down by two you increase the number of points by a factor of eight: twice as many as in the Monte Carlo method. As the number of dimensions increases, all the methods get steadily less practical, except Monte Carlo.

We can check this with our 3d case above.


In [ ]:


In [ ]:

When we did this with the Riemann integral we used $32^3 \approx 3 \times 10^4$ points and got an error or $\approx 0.7$: for that number of points with the Monte Carlo method we've gained nearly an order of magnitude of accuracy. As the number of dimensions increases, the advantages of Monte Carlo increase alongside.

Exercise

The power absorbed by a PV cell depends on the available irradiation, its angle of incidence, and the efficiency of its components. Suppose that this is given by

\begin{equation} P(x, y, t, \theta, \eta_1, \eta_2) = I(x, y, t) \left( 1 + \sin^2 \left( 2 \pi \theta \right) \right) \left( 1 + \frac{\eta_1}{10} \right) \left( 1 + \frac{\eta_2}{5} \right), \end{equation}

where the irradiation is again given by

\begin{equation} I(x, y, t) = \left( 1 + \cos^2 \left( 2 \pi x \right) \sin^4 \left( 4 \pi y \right) \right) \sin^4 \left( \frac{2 \pi t}{47} \right). \end{equation}

The range of values that each parameter can take is $x, y, \eta_1, \eta_2 \in [0, 1], t \in [0, 24], \theta \in [0, \pi]$.

Find the probability that the power absorbed is greater than $2$, where

\begin{equation} \mathbb{P} \left( P > 2 \right) = \frac{ \int_0^{24} \text{d}t \int_0^{\pi} \text{d}\theta \int_0^1 \text{d}\eta_1 \int_0^1 \text{d}\eta_2 \int_0^1 \text{d}x \int_0^1 \text{d}y \,\, \Theta \left[ P \left( x, y, t, \theta, \eta_1, \eta_2 \right) - 2 \right]}{\int_0^{24} \text{d}t \int_0^{\pi} \text{d}\theta \int_0^1 \text{d}\eta_1 \int_0^1 \text{d}\eta_2 \int_0^1 \text{d}x \int_0^1 \text{d}y} \end{equation}

with $\Theta(s)$ being the Heaviside function

\begin{equation} \Theta(s) = \begin{cases} 1 & s > 0 \\ 0 & s < 0. \end{cases} \end{equation}
Solution

In [ ]:

Extension: Richardson extrapolation

There is a wonderfully effective method for increasing accuracy that relies on esimating the error in our calculation. Called Richardson extrapolation, it works as follows.

Assume we know that our integration method is $s^{\text{th}}$-order accurate, so that

\begin{equation} H^{(\text{true})} = H^{(\Delta t)} + C \left( \Delta t \right)^s + \dots \end{equation}

We assume that the unknown $C$ is constant, and that the higher order terms can be ignored. We can then evaluate our numerical approximation to the integral twice, as

\begin{align} H^{(\text{true})} &= H^{(\Delta t)} + C \left( \Delta t \right)^s \\ H^{(\text{true})} &= H^{(2 \Delta t)} + 2 ^s C \left( \Delta t \right)^s. \end{align}

This is two equations in two unknowns ($C$ and the true solution $H^{(\text{true})}$), which gives us

\begin{equation} H^{(\text{true})} = \frac{2^s H^{(2 \Delta t)} - H^{(\Delta t)}}{2^s - 1}. \end{equation}

Of course, we can't magically use two approximate solutions to find the true, exact solution. What we've actually done is remove the leading order error term. We should properly write this as

\begin{align} H^{(\text{Richardson})} &= \frac{2^s H^{(2 \Delta t)} - H^{(\Delta t)}}{2^s - 1}, \\ H^{(\text{true})} &= H^{(\text{Richardson})} + D \left( \Delta t \right)^{s+1} + \dots \end{align}

We can, of course, then use Richardson extrapolation again to remove the $(s+1)$-order term.

Caveat

Whilst Richardson extrapolation often works very well, there are cases (where the $H$ term being extrapolated behaves awkwardly) where it can give odd or incorrect results. Instead, it's more usual to solve the system to find the leading order error term, by finding $C$. This gives an estimate of the error: if the error is larger than a pre-specified tolerance, the step size can be made smaller. This adaptive quadrature is standard in black-box solvers, and is the reason they require an error tolerance, not a grid spacing.

Exercise

Apply Richardson extrapolation to the one dimension problems above, starting from the Riemann integral approximation. How many extrapolations are needed before it's competitive with Simpson's rule?

Solution

In [ ]:

Black box methods

In Python, the standard solvers for quadrature are in the scipy.integrate library. The easiest to use is scipy.integrate.quad, although other solvers are needed for multi-dimensional integrals.