Jupyter notebooks

This is a Jupyter notebook using Python. You can install Jupyter locally to edit and interact with this notebook.

Rootfinding

Rootfinding is the process of solving $$f(x) = 0$$ for $x$. The standard assumption is that $f : R \to R$ is continuous. We are interested in developing general-purpose algorithms---those that can use $f(x)$ as a black box without needing to look inside. When we implement our rootfinding algorithm in software, the user will pass a function or program to compute $f(x)$. Rootfinding methods for differentiable functions may also use the derivative $f'(x)$.

Some questions immediately arise:

Existence. When does this equation have at least one solution?
Uniqueness. When is the solution unique?

Let's consider some test functions, defined here along with their derivatives which we'll use later.



In [114]:

    
%matplotlib notebook
from matplotlib import pyplot
import numpy

tests = []

@tests.append
def f0(x):
    return x*x - 2, 2*x

@tests.append
def f1(x):
    return numpy.cos(x) - x, -numpy.sin(x) - 1

@tests.append
def f2(x):
    return numpy.exp(-numpy.abs(x)) + numpy.sin(x), numpy.exp(-numpy.abs(x))*(-numpy.sign(x)) + numpy.cos(x)

@tests.append
def f3(x):
    return x*x - x + 0.25, 2*x - 1

@tests.append
def f4(x):
    return numpy.tan(x+2), numpy.cos(x+2)**(-2)
    return numpy.exp(-x*x), numpy.exp(-x*x)*(-2*x)

@tests.append
def f5(x):
    return (x <= 1)*1.0, 0*x

@tests.append
def f6(x):
    return x*numpy.sin(5/x), numpy.sin(5/x) - numpy.cos(5/x)*5/x

x = numpy.linspace(-2,2,100)
pyplot.plot(x, 0*x, color='k')
for f in tests:
    pyplot.plot(x, f(x)[0], label=f.__name__)
pyplot.legend(loc='upper right')
pyplot.style.use('ggplot')
pyplot.show()

Which of these functions have at least one root?
Which have more than one root?
Can we determine these properties merely by evaluating $f(x)$ for some values of $x$?

Bisection

Bisection is a rootfinding technique that starts with an interval $[a,b]$ containing a root and does not require derivatives.



In [65]:

    
def hasroot(f, a, b):
    return f(a)[0]*f(b)[0] < 0
    
def bisect(f, a, b, verbose=False):
    mid = (a + b)/2.
    if b-a < 1e-5:
        return mid
    if verbose:
        print('bisect', mid)
    if hasroot(f, a, mid):
        return bisect(f, a, mid, verbose)
    else:
        return bisect(f, mid, b, verbose)

Notice that we need to define hasroot above.

Let's try running it:



In [66]:

    
bisect(tests[0], 0, 2)









    Out[66]:





1.4142112731933594



In [67]:

    
numpy.sqrt(2) - bisect(tests[0], 0, 2)









    Out[67]:





2.2891797357704746e-06

We get about 5 digits of accuracy. Why? How fast did we get there?



In [68]:

    
bisect(tests[0], 0, 2, verbose=True)









    



bisect 1.0
bisect 1.5
bisect 1.25
bisect 1.375
bisect 1.4375
bisect 1.40625
bisect 1.421875
bisect 1.4140625
bisect 1.41796875
bisect 1.416015625
bisect 1.4150390625
bisect 1.41455078125
bisect 1.414306640625
bisect 1.4141845703125
bisect 1.41424560546875
bisect 1.414215087890625
bisect 1.4141998291015625
bisect 1.4142074584960938






    Out[68]:





1.4142112731933594

Can you find any problems with this implementation? List them below:

No error checking
Can "converge" to nonsense
Recursion is bad in Python
Doesn't handle ill-defined points (function value not available)
Evaluates $f(x)$ more than necessary (multiple times at the same point)

Let's try running it on the rest of the test problem set:



In [69]:

    
for f in tests:
    print(f.__name__, bisect(f, -2, 2.1))









    



f0 -1.4142136573791504
f1 0.7390824317932131
f2 -0.588533115386963
f3 2.0999960899353027
f4 2.0999960899353027
f5 2.0999960899353027
f6 -1.5915507316589355

What's going wrong here? How can we improve the implementation and what are fundamental limitations of the algorithm?

Convergence rate

Let's quantitatively revisit the convergence rate. A convergent rootfinding algorithm produces a sequence of approximations $x_i$ such that $$\lim_{i \to \infty} x_i \to x_*$$ where $f(x_*) = 0$. For analysis, it is convenient to define the errors $e_i = x_i - x_*$. We say that an algorithm is linearly convergent if $$\lim_{i \to \infty} |e_{i+1}| / |e_i| = \rho < 1.$$ A smaller convergence factor $\rho$ represents faster convergence.

What is $\rho$ for bisection?

Remarks on bisection

Specifying an interval is often inconvenient
An interval in which the function changes sign guarantees convergence (robustness)
No derivative information is required
Roots of even degree are problematic
The solution error is directly available
The convergence rate is modest -- one iteration per bit of accuracy

Newton-Raphson Method

Much of numerical analysis reduces to Taylor series, the approximation $$ f(x) = f(x_0) + f'(x_0) (x-x_0) + f''(x_0) (x - x_0)^2 / 2 + \dotsb $$ centered on some reference point $x_0$. In numerical computation, it is exceedingly rare to look beyond the first-order approximation $$ \tilde f_{x_0}(x) = f(x_0) + f'(x_0)(x - x_0) . $$ Since $\tilde f_{x_0}(x)$ is a linear function, we can explicitly compute the unique solution of $\tilde f_{x_0}(x) = 0$ as $$ x = x_0 - f(x_0) / f'(x_0) . $$ This is Newton's Method (aka Newton-Raphson or Newton-Raphson-Simpson) for finding the roots of differentiable functions.



In [116]:

    
def newton(f, x, verbose=False):
    for i in range(100):
        fx, dfx = f(x)
        if verbose:
            print(f.__name__, i, x, fx)
        if numpy.abs(fx) < 1e-12:
            return x, fx, i
        try:
            x -= fx / dfx
        except ZeroDivisionError:
            return x, numpy.NaN, i

for f in tests:
    print(f.__name__, newton(f, 1))









    



f0 (1.4142135623730951, 4.440892098500626e-16, 5)
f1 (0.73908513321516067, 0.0, 4)
f2 (-6.2850492733825867, -1.4268100589909238e-16, 4)
f3 (0.5000009536743164, 9.094947017729282e-13, 19)
f4 (1.1415926535897931, -1.2246467991473532e-16, 3)
f5 (1, nan, 0)
f6 (0.53051647697298443, 1.9490859162596874e-16, 6)

Oops, how can we fix this?
This output is kinda hard to read, so let's make it cleaner.



In [71]:

    
for f in tests:
    print(f.__name__, '{0:15.12f} {1:8.2e} {2:2d}'.format(*newton(f, -0.1)))









    



f0 -1.414213562373 1.33e-15  8
f1  0.739085133215 0.00e+00  5
f2 -0.588532743981 5.55e-13  4
f3  0.499999427800 3.27e-13 20
f4 -5.294229332623 6.72e-13  3
f5 -0.100000000000      nan  0
f6 -0.099471839432 -1.95e-16  3

Did we solve all of these equations?
How can the iteration break down?
Does choosing a different initial guess lead to different solutions?
How is this convergence test different from the one we used for bisection?
Is the convergence rate similar for all test equations?

Convergence of Newton-type algorithms

We would like to know sharp conditions on when Newton-type algorithms converge, and if so, how fast. This theory will build on that for a general Fixed Point Iteration $x_{i+1} = g(x_i)$ where $g$ is a continuously differentiable function. Suppose that there exists a fixed point $r = g(r)$. By the mean value theorem, we have that $$ x_{i+1} - r = g(x_i) - g(r) = g'(c_i) (x_i - r) $$ for some $c$ with $|c - r| < |x_i - r|$. In other words, $|e_{i+1}| = |g'(c_i)| |e_i|$, which converges to zero if $|g'(c_i)| < 1$. If $|g'(r)| < 1$ then for any $\epsilon > 0$ there is a neighborhood of $r$ such that $|g'(c)| < |g'(r)| + \epsilon$ for all $c$ in that neighborhood. Consequently, we have:

Theorem (Sauer 1.6): Linear Convergence of Fixed Point Iteration

If $g$ is continuously differentiable, $r = g(r)$, and $|g'(r)| < 1$ then the fixed point iteration $x_{i+1} = g(x_i)$ is locally linearly convergent with rate $|g'(r)|$.

Observations

A rootfinding problem $f(x) = 0$ can be converted to a fixed point problem $x = x - f(x) =: g(x)$ but there is no guarantee that $g'(r) = 1 - f'(r)$ will have magnitude less than 1.
Problem-specific algebraic manipulation can be used to make $|g'(r)|$ small.
$x = x - h(x)f(x)$ is also a valid formulation for any $h(x)$ bounded away from $0$.
Can we choose $h(x)$ such that $ 1 - h'(x)f(x) - h(x)f'(x) = 0$ when $f(x) = 0$?

In other words, $$ x_{i+1} = x_i - ??? . $$

It turns out that Newton's method has locally quadratic convergence to simple roots, $\lim_{i \to \infty} |e_{i+1}|/|e_i^2| < \infty$.
"The number of correct digits doubles each iteration."
Now that we know how to make a good guess accurate, the effort lies in getting a good guess.

Culture: fast inverse square root

The following code appeared literally (including comments) in the Quake III Arena source code (late 1990s).

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what the fuck? 
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}

We now have vector instructions for approximate inverse square root. More at https://en.wikipedia.org/wiki/Fast_inverse_square_root

Conditioning



In [72]:

    
def fquartic(x):
    return (x - 0.9)**4, 4*(x - 0.9)**3

newton(fquartic, 0)









    Out[72]:





(0.899096947850202, 6.650454451607741e-13, 24)

We only get three digits correct despite a very small residual (and it takes many iterations to get there). Difficulty computing zeros of polynomials can also arise when all the roots are simple. For example, the Wilkinson polynomial $$ \prod_{i=1}^{20} (x - i) = \sum_{i=0}^{20} a_i x^i $$ has roots at each of the positive integers up to 20, but the roots are extremely sensitive to perturbations of the coefficients $a_i$, as shown in this figure from Trefethen and Bau (1999).

Numerical difficulties in which "correct" algorithms produce unreliable solutions almost always stem from lack of stability and/or ill conditioning.

Absolute condition number

Consider a function $f: X \to Y$ and define the absolute condition number $$ \hat\kappa = \lim_{\delta \to 0} \max_{|\delta x| < \delta} \frac{|f(x + \delta x) - f(x)|}{|\delta x|} = \max_{\delta x} \frac{|\delta f|}{|\delta x|}. $$ If $f$ is differentiable, then $\hat\kappa = |f'(x)|$.

Floating point arithmetic

Floating point arithmetic $x \circledast y := \text{float}(x * y)$ is exact within a relative accuracy $\epsilon_{\text{machine}}$. Formally, $$ x \circledast y = (x * y) (1 + \epsilon) $$ for some $|\epsilon| \le \epsilon_{\text{machine}}$.



In [73]:

    
format((.2 - 1/3) + 2/15, '.20f')
# format((.2 - 1/3) + (1/3 - 0.2), '.20f')
# format((1 + 1e-12) - 1, '.20f')









    Out[73]:





'0.00000000000000002776'



In [74]:

    
eps = 1
while 1 + eps > 1:
    eps /= 2
eps_machine = eps  # We call this "machine epsilon"



In [75]:

    
numpy.log(1 + 1e-12) - numpy.log1p(1e-12)









    Out[75]:





8.8900582341031727e-17



In [76]:

    
(numpy.log(1 + 1e-12) - numpy.log1p(1e-12)) / numpy.log1p(1e-12)









    Out[76]:





8.8900582341076178e-05



In [77]:

    
x = numpy.array([1,1e5,1e10,1e15])
numpy.sin(numpy.pi*x)









    Out[77]:





array([  1.22464680e-16,  -3.39606540e-11,  -2.23936276e-06,
        -2.36209053e-01])



In [78]:

    
numpy.sin(x)**2 + numpy.cos(x)**2 - 1









    Out[78]:





array([  0.00000000e+00,  -1.11022302e-16,   0.00000000e+00,
         0.00000000e+00])



In [79]:

    
[numpy.tan((3.14159+eps)/2) for eps in [1e-6,1e-8]], 1/numpy.cos(3.14159)**2









    Out[79]:





([1209489.8070879749, 756547.02744709956], 1.0000000000070415)

Relative condition number

Given the relative nature of floating point arithmetic, it is more useful to discuss relative condition number, $$ \kappa = \max_{\delta x} \frac{|\delta f|/|f|}{|\delta x|/|x|} = \max_{\delta x} \Big[ \frac{|\delta f|/|\delta x|}{|f| / |x|} \Big] $$ or, if $f$ is differentiable, $$ \kappa = \max_{\delta x} |f'(x)| \frac{|x|}{|f|} . $$

How does a condition number get big?

Take-home message

The relative accuracy of the best-case algorithm will not be reliably better than $\epsilon_{\text{machine}}$ times the condition number. $$ \max_{\delta x} \frac{|\delta f|}{|f|} \ge \kappa \cdot \epsilon_{\text{machine}} $$

Numerical differentiation

Suppose we want to apply Newton's method to a function that we know how to evaluate, but don't have code to differentiate. This is often because it's difficult/error-prone to write or because the interface by which we call it does not support derivatives. (Commercial packages often fall in this category.)



In [80]:

    
def diff(f, x, epsilon=1e-5):
    return (f(x + epsilon) - f(x)) / epsilon

diff(numpy.sin, 0.7, 1e-8) - numpy.cos(0.7)









    Out[80]:





2.2977936220414108e-09



In [81]:

    
x = .5
diff(numpy.tan, x) - 1/numpy.cos(x)**2









    Out[81]:





7.0935189069309956e-06



In [104]:

    
x = 3.14/2
[(eps, diff(numpy.tan, x, eps) - 1/numpy.cos(x)**2) for eps in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4]]









    Out[104]:





[(1e-14, -1271.3873432741966),
 (1e-12, 140.14843387738802),
 (1e-10, 0.32726590032689273),
 (1e-08, 19.793431113474071),
 (1e-06, 1982.7670766180381),
 (0.0001, 226466.63879947271)]



In [108]:

    
x = 1e4
[(eps, diff(numpy.log, x, eps) - 1/x) for eps in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4, 1e-2]]









    Out[108]:





[(1e-14, -0.0001),
 (1e-12, -0.0001),
 (1e-10, -1.1182158029987482e-05),
 (1e-08, 8.8900582340963692e-09),
 (1e-06, 8.2740370951168637e-12),
 (0.0001, -9.4895312988856409e-12),
 (0.01, -5.0168102921151377e-11)]

Automatically choosing a suitable $\epsilon$



In [109]:

    
def diff_wp(f, x, eps=1e-8):
    """Numerical derivative with Walker and Pernice (1998) choice of step"""
    h = eps * (1 + abs(x))
    return (f(x+h) - f(x)) / h

x = 1
[(eps, diff_wp(numpy.log, x, eps) - 1/x) for eps in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4, 1e-2]]









    Out[109]:





[(1e-14, -0.00079927783736910563),
 (1e-12, -2.2121721121481919e-05),
 (1e-10, 8.2640370990816336e-08),
 (1e-08, -4.9752407749181771e-09),
 (1e-06, -9.9996991109740918e-07),
 (0.0001, -9.9986668776530507e-05),
 (0.01, -0.0098686351910135528)]



In [110]:

    
x = 1e-4
[(eps, diff_wp(numpy.log, x, eps) - 1/x) for eps in [1e-14, 1e-12, 1e-10, 1e-8, 1e-6, 1e-4, 1e-2]]









    Out[110]:





[(1e-14, -0.11098307828251563),
 (1e-12, -0.00085996655070630368),
 (1e-10, -0.0050162259285571054),
 (1e-08, -0.50001677127511357),
 (1e-06, -49.674080904416769),
 (0.0001, -3068.7213347666539),
 (0.01, -9538.5241953963505)]

This algorithm is imperfect, leaving some scaling responsibility to the user. It is the default in PETSc's "matrix-free" Newton-type solvers.

Tinkering in class (2016-09-06)



In [86]:

    
numpy.log(-1)









    



/usr/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in log
  if __name__ == '__main__':






    Out[86]:





nan



In [95]:

    
x = numpy.sqrt(-1)
1/numpy.inf









    



/usr/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in sqrt
  if __name__ == '__main__':






    Out[95]:





0.0



In [99]:

    
x = numpy.linspace(0,3,100)
pyplot.plot(x, numpy.sqrt(x + 1e-1))
pyplot.show()



In [100]:

    
numpy.tan(1e100)









    Out[100]:





-0.41162296288324979



In [102]:

    
numpy.tan(1e100*(1 + 2*eps_machine))









    Out[102]:





-0.033025094829895622



In [103]:

    
1e100 - 1e100*(1 + 2*eps_machine)









    Out[103]:





-1.942668892225729e+84

In class 2016-09-14

We are looking for a function for which the linear fit (newton solver) is more reliable than a quadratic fit (cubic solver).



In [123]:

    
x = numpy.linspace(-5,5)
pyplot.figure()
f = numpy.sin(x) + 1.1*x
pyplot.plot(x, f)









    














    











    Out[123]:





[<matplotlib.lines.Line2D at 0x7fb75d557128>]

Accuracy of numerical differentiation

Discretization error

The diff and diff_wp functions use a "forward difference" formula: $\tilde f'(x) := (f(x+h) - f(x))/h$. Using the Taylor expansion of $f(x+h)$, we compute the discretization error $$ \begin{split} \frac{f(x+h) - f(x)}{h} - f'(x) &= \frac{f(x) + f'(x) h + f''(x) h^2/2 + O(h^3) - f(x)}{h} - f'(x) \\ &= \frac{f'(x) h + f''(x) h^2/2 + O(h^3)}{h} - f'(x) \\ &= f''(x) h/2 + O(h^2) \end{split} . $$

This is the discretization error caused by choosing a finite (not infinitesimal) differencing parameter $h$, and the leading order term depends linearly on $h$.

Rounding error

We have an additional source of error, rounding error, which comes from not being able to compute $f(x)$ or $f(x+h)$ exactly, nor subtract them exactly. Suppose that we can, however, compute these functions with a relative error on the order of $\epsilon_{\text{machine}}$. This leads to $$ \begin{split} \tilde f(x) &= f(x)(1 + \epsilon_1) \\ \tilde f(x \oplus h) &= \tilde f((x+h)(1 + \epsilon_2)) \\ &= f((x + h)(1 + \epsilon_2))(1 + \epsilon_3) \\ &= [f(x+h) + f'(x+h)(x+h)\epsilon_2 + O(\epsilon_2^2)](1 + \epsilon_3) \\ &= f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) \end{split} $$

where each $\epsilon_i$ is an independent relative error on the order of $\epsilon_{\text{machine}}$ and we have used a Taylor expansion at $x+h$ to approximate $f(x \oplus h)$. We thus write the rounding error in the forward difference approximation as $$ \begin{split} \left\lvert \frac{\tilde f(x+h) \ominus \tilde f(x)}{h} - \frac{f(x+h) - f(x)}{h} \right\rvert &= \left\lvert \frac{f(x+h)(1 + \epsilon_3) + f'(x+h)x\epsilon_2 + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h) - f(x)(1 + \epsilon_1) - f(x+h) + f(x)}{h} \right\rvert \\ &\le \frac{|f(x+h)\epsilon_3| + |f'(x+h)x\epsilon_2| + |f(x)\epsilon_1| + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}}h)}{h} \\ &\le \frac{(2 \max_{[x,x+h]} |f| + \max_{[x,x+h]} |f' x| \epsilon_{\text{machine}} + O(\epsilon_{\text{machine}}^2 + \epsilon_{\text{machine}} h)}{h} \\ &= (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h} + O(\epsilon_{\text{machine}}) \\ \end{split} $$ where we have assumed that $h \ge \epsilon_{\text{machine}}$. This error becomes large (relative to $f'$ -- we are concerned with relative error after all)

$f$ is large compared to $f'$
$x$ is large
$h$ is too small

Total error and optimal $h$

Suppose we would like to choose $h$ to minimize the combined discretization and rounding error, $$ h^* = \arg\min_h | f''(x) h/2 | + (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h} $$ (dropping the higher order terms), which we can compute by differentiating with respect to $h$ and setting the result equal to zero $$ |f''|/2 - (2\max|f| + \max|f'x|) \frac{\epsilon_{\text{machine}}}{h^2} = 0 $$ which can be rearranged as $$ h^* = \sqrt{\frac{4\max|f| + 2\max|f'x|}{|f''|}} \sqrt{\epsilon_{\text{machine}}} .$$ Of course this formula is of little use for computing $h$ because all this is to compute $f'$, which we obviously don't know yet, much less $f''$. However, it does have value:

It explains why 1e-8 (i.e., $\sqrt{\epsilon_{\text{machine}}}$) was empirically found to be about optimal for well-behaved functions.
It explains why even for the best behaved functions, our best attainable accuracy with forward differencing is $\sqrt{\epsilon_{\text{machine}}}$.
If we have some special knowledge about the class of functions we need to differentiate, we might have bounds on these quantities and thus an ability to use this formula to improve accuracy. Alternatively, we could run a parameter sweep to empirically choose a suitable $h$, though we would have to re-tune in response to parameter changes in the class of functions.
If someone claims to have a simple and robust rule for computing $h$ then this formula tells us how to build a function that breaks their rule. There are no silver bullets.
If our numerical differentiation routine produces a poor approximation for some function that we run into in the wild, this helps us explain what happened and how to fix it.

Centered difference

Instead of the forward difference approximation $$ \frac{f(x+h) - f(x)}{h} $$ we could use the centered difference formula, $$ \frac{f(x+h) - f(x-h)}{2h} . $$ (One way to derive this formula is to average a forward and backward difference. We will learn a more general method later in the course when we do interpolation.) We can compute the discretization error by Taylor expansion, $$ \frac{f(x) + f'(x)h + f''(x)h^2/2 + f'''(x)h^3/6 - f(x) + f'(x)h - f''(x)h^2/2 + f'''(x) h^3/6 + O(h^4)}{2h} = f'(x) + f'''(x)h^2/6 + O(h^3) $$ showing that the leading error term is of order $h^2$, versus order $h$ for forward differences. A similar computation including rounding error will find that the optimal $h$ is now of order $\sqrt[3]{\epsilon_{\text{machine}}}$ so the best attainable accuracy is $\epsilon_{\text{machine}}^{2/3}$. This accuracy improvement (versus $\sqrt{\epsilon_{\text{machine}}}$) is significant, but we'll also see that it is twice as expensive when computing derivatives of multi-variate functions.

Stability

We use the notation $\tilde f(x)$ to mean a numerical algorithm for approximating $f(x)$. Additionally, $\tilde x = x (1 + \epsilon)$ is some "good" approximation of the exact input $x$.

(Forward) Stability

"nearly the right answer to nearly the right question" $$ \frac{\lvert \tilde f(x) - f(\tilde x) \rvert}{| f(\tilde x) |} \in O(\epsilon_{\text{machine}}) $$ for some $\tilde x$ that is close to $x$

Backward Stability

"exactly the right answer to nearly the right question" $$ \tilde f(x) = f(\tilde x) $$ for some $\tilde x$ that is close to $x$

Every backward stable algorithm is stable.
Not every stable algorithm is backward stable.

Example: $\tilde f(x) = \text{float}(x) + 1$

The algorithm computes $$\tilde f(x) = \text{float}(x) \oplus 1 = [x(1+\epsilon_1) + 1](1 + \epsilon_2) = (x + 1 + x\epsilon_1)(1 + \epsilon_2) $$ and we can express any $\tilde x = x(1 + \epsilon_3)$. To see if if the algorithm is stable, we compute $$ \frac{\tilde f(x) - f(\tilde x)}{|f(\tilde x)|} = \frac{(x + 1 + x\epsilon_1)(1 + \epsilon_2) - [x(1+ \epsilon_3) + 1]}{\tilde x + 1} = \frac{(x + 1)\epsilon_2 + x(\epsilon_1 - \epsilon_3) + O(\epsilon^2)}{x + 1 + x\epsilon_3} . $$ If we can choose $\epsilon_3$ to make this small, then the method will be (forward) stable, and if we can make this expression exactly zero, then we'll have backward stability. Trying for the latter, we solve for $\epsilon_3$ by setting the numerator equal to zero, $$ \epsilon_3 = \frac{x + 1}{x}\epsilon_2 + \epsilon_1 + O(\epsilon^2)/x $$ which is small so long as $|x| \gg 0$, but the first term blows up as $x \to 0$. In other words, the fact that $\epsilon_2$ can produce a large error relative to the input causes this algorithm to not be backward stable. In contrast, this $x\to 0$ case is not a problem for forward stability because $\epsilon_3 = \epsilon_1$ yields error on the order of $\epsilon_2$.

Example: $\tilde f(x,y) = \text{float}(x) \oplus \text{float}(y)$

Now we are interested in $$ \frac{\tilde f(x,y) - f(\tilde x,\tilde y)}{f(\tilde x,\tilde y)} $$ and we can vary both $\tilde x$ and $\tilde y$. If we choose $y=1$, then the ability to vary $\tilde y$ is powerful enough to ensure backward stability.

Accuracy of backward stable algorithms (Theorem)

A backward stable algorithm for computing $f(x)$ has relative accuracy $$ \left\lvert \frac{\tilde f(x) - f(x)}{f(x)} \right\rvert \in O(\kappa(f) \epsilon_{\text{machine}}) . $$ This is a rewording of a statement made earlier -- backward stability is the best case.