Suggested references:
Quick recap: the key feature is the Ito stochastic integral
\begin{equation} \int_{t_0}^t G(t') \, dW(t') = \text{mean-square-}\lim_{n\to +\infty} \left\{ \sum_{i=1}^n G(t_{i-1}) (W_{t_i} - W_{t_{i-1}} \right\} \end{equation}where the key point for the Ito integral is that the first term in the sum is evaluated at the left end of the interval ($t_{i-1}$).
Now we use this to write down the SDE
\begin{equation} dX_t = f(X_t) \, dt + g(X_t) \, dW_t \end{equation}with formal solution
\begin{equation} X_t = X_0 + \int_0^t f(X_s) \, ds + \int_0^t g(X_s) \, dW_s. \end{equation}Using the Ito stochastic integral formula we get the Euler-Maruyama method
\begin{equation} X_{n+1} = X_n + h f(X_n) + \sqrt{h} \xi_n \, g(X_n) \end{equation}by applying the integral over the region $[t_n, t_{n+1} = t_n + h]$. Here $h$ is the width of the interval and $\xi_n$ is the normal random variable $\xi_n \sim N(0, 1)$.
There are two ways to talk about errors here. It depends on the realization that you have. That is, for each realization there is a different Brownian path $W$.
If we fix to one realization - a single Brownian path $W$ - then we can vary the size of the step $h$. This gives us the strong order of convergence: how the error varies with $h$ for a fixed Brownian path.
The other question is what happens when we consider the average of all possible realizations. This is the weak order of convergence.
Formally, denote the true solution as $X(T)$ and the numerical solution for a given step length $h$ as $X^h(T)$. The order of convergence is denoted $\gamma$.
For our purposes we just need that $dW^2 = dt$ (from our definition of the Brownian path - changing notation for the increment from $h$ to $dt$), which means that (by only keeping leading order terms) $dW^{2+N} = 0$ for all $N > 0$. The higher moments vary too fast to contribute to anything on the timescales that we're interested in, after averaging.
If
\begin{equation} \frac{dX}{dt} = f(X_t) \end{equation}and we want to find the differential equation satisfied by $h(X(t))$ (or $h(X_t)$), then we write
\begin{align} &&\frac{d}{dt} h(X_t) &= h \left( X(t) + dX(t) \right) - h(X(t)) \\ &&&\simeq h(X(t)) + dX \, h'(X(t)) + \frac{1}{2} (dX)^2 \, h''(X(t)) + \dots - h(X(t)) \\ &&&\simeq f(X) h'(X) dt + \frac{1}{2} (f(X))^2 h''(X) (dt)^2 + \dots \\ \implies && \frac{d h(X)}{dt} &= f(X) h'(X). \end{align}Now run through the same steps using the equation
\begin{equation} dX = f(X)\, dt + g(X) \, dW. \end{equation}We find
\begin{align} && d h &\simeq h'(X(t))\, dX + \frac{1}{2} h''(X(t)) (dX)^2 + \dots, \\ &&&\simeq h'(X) f(X)\, dt + h'(X) g(X) ', dW + \frac{1}{2} \left( f(X) dt^2 + 2 f(x)g(x)\, dt dW + g^2(x) dW^2 \right) \\ \implies && d h &= \left( f(X) h'(X) + \frac{1}{2} h''(X)g^2(X) \right) \, dt + h'(X) g(X) \, dW. \end{align}This additional $g^2$ term makes all the difference when deriving numerical methods, where the chain rule is repeatedly used.
Remember that
\begin{equation} \int_{t_0}^t W_s \, dW_s = \frac{1}{2} W^2_t - \frac{1}{2} W^2_{t_0} - \frac{1}{2} (t - t_0). \end{equation}From this we need to identify the stochastic differential equation, and also the function $h$, that will give us this result just from the chain rule.
The SDE is
\begin{equation} dX_t = dW_t, \quad f(X) = 0, \quad g(X) = 1. \end{equation}Writing the chain rule down in the form
\begin{equation} h(X_t) = h(X_0) + \int_0^t \left( f(X_s) h'(X_s) + \frac{1}{2} h''(X_s) g^2(X_s) \right) \, dt + \int_0^t h'(X_s) g(X_s) \, dW_s. \end{equation}Matching the final term (the integral over $dW_s$) we see that we need $h'$ to go like $X$, or
\begin{equation} h = X^2, \quad dX_t = dW_t, \quad f(X) = 0, \quad g(X) = 1. \end{equation}With $X_t = W_t$ we therefore have
\begin{align} W_t^2 &= W_0^2 + \int_{t_0}^t \frac{1}{2} 2 \, ds + \int_{t_0}^t 2 W_s \, dW_s &= W_0^2 + (t - t_0) + \int_{t_0}^t 2 W_s \, dW_s \end{align}as required.