```
In [0]:
```!python -V

```
```

```
In [0]:
```#!pip3 install torch torchvision
import torch

```
In [0]:
```print("PyTorch version: ")
torch.__version__

```
Out[0]:
```

```
In [0]:
```print("Device Name: ")
torch.cuda.get_device_name(0)

```
Out[0]:
```

```
In [0]:
```print("CUDA Version: ")
print(torch.version.cuda)
print("cuDNN version is: ")
print(torch.backends.cudnn.version())

```
```

```
In [0]:
```# NVIDIA profiling tool for the available GPU
!nvidia-smi

```
In [0]:
```!cat /proc/cpuinfo

```
In [0]:
```!cat /proc/meminfo

The inputs, outputs, and transformations within neural networks are all represented using tensors, and as a result, neural network programming utilizes tensors heavily.

The concept of a tensor is a mathematical generalization of other more specific concepts. Let’s look at some specific instances of tensors.

**Specific instances of tensors**

Each of these examples are specific instances of the more general concept of a tensor:

- number
- scalar
- array
- vector
- 2d-array
- matrix

Let’s organize the above list of example tensors into two groups:

- number, array, 2d-array
- scalar, vector, matrix

The first group of three terms (number, array, 2d-array) are terms that are typically used in computer science, while the second group (scalar, vector, matrix) are terms that are typically used in mathematics.

Tensors can be created using the following **syntax**.

```
In [0]:
```c = torch.tensor([[1,2],[1,2]])
d = torch.tensor([[[1,2],[3,4]],[[5,6],[7,8]]])

```
In [0]:
```a = torch.tensor([0,1,2,3,4])

```
In [0]:
```print("Type of entire array:")
type(a)

```
Out[0]:
```

```
In [0]:
```print("Datatype of tensor:")
a.type()

```
Out[0]:
```

Every torch.Tensor has these attributes:

- torch.dtype
- torch.device
- torch.layout

Looking at our Tensor a, we can see the following default attribute values:

```
In [0]:
```print(a.dtype)
print(a.device)
print(a.layout)

```
```

Tensor datatypes are given in the table below. Note that torch.tensor is an alias for the default tensor type (torch.FloatTensor).

Data type | dtype | CPU tensor | GPU Tensor |
---|---|---|---|

32-bit floating point | torch.float32 or torch.float | torch.FloatTensor | torch.cuda.FloatTensor |

64-bit floating point | torch.float64 or torch.double | torch.DoubleTensor | torch.cuda.DoubleTensor |

16-bit floating point | torch.float16 or torch.half | torch.HalfTensor | torch.cuda.HalfTensor |

8-bit integer (unsigned) | torch.uint8 | torch.ByteTensor | torch.cuda.ByteTensor |

8-bit integer (signed) | torch.int8 | torch.CharTensor | torch.cuda.CharTensor |

16-bit integer (signed) | torch.int16 or torch.short | torch.ShortTensor | torch.cuda.ShortTensor |

32-bit integer (signed) | torch.int32 or torch.int | torch.IntTensor | torch.cuda.IntTensor |

64-bit integer (signed) | torch.int64 or torch.long | torch.LongTensor | torch.cuda.LongTensor |

**Type casting** is also supported. In the example below, the method FloatTensors casts integers into float. If we have a device like above, we can create a tensor on the device by passing the device to the tensor’s constructor.

```
In [0]:
```b = torch.FloatTensor([0,1,2,3,4])
print("Type of b: ")
b.type()

```
Out[0]:
```

```
In [0]:
```a = a.type(torch.FloatTensor)
print("Type of a: ")
a.type()

```
Out[0]:
```

Notice how each type in the table has a CPU and GPU version. One thing to keep in mind about tensor data types is that tensor operations between tensors must happen between tensors with the same type of data.

The device, **cpu** in our case, specifies the device (CPU or GPU) where the tensor's **data** is **allocated**. This determines where **tensor computations** for the given tensor will be performed.

PyTorch supports the use of multiple devices, and they are specified using an index like so:

```
In [0]:
```device = torch.device('cuda:0')
device

```
Out[0]:
```

**multiple devices** is that tensor operations between **tensors** must happen between tensors that **exists** on the **same device**.

Tensors have a **torch.layout**

The **layout, strided** in our case, specifies how the tensor is **stored** **in memory**.

As neural network programmers, we need to be aware of the following:

- Tensors contain data of a uniform type (dtype).
- Tensor computations between tensors depend on the dtype and the device.

**Creating tensors using data**

These are the primary ways of creating tensor objects (instances of the torch.Tensor class), with data (array-like) in PyTorch:

- torch.Tensor(data)
- torch.tensor(data)
- torch.as_tensor(data)
- torch.from_numpy(data)

```
In [0]:
```import numpy as np
datanp = np.array([1,2,3])
type(datanp)

```
Out[0]:
```

Note PyTorch tensors can be created with the torch.Tensor constructor, which takes the tensor’s dimensions as input and returns a tensor occupying an uninitialized region of memory

Now, let’s create our tensors with each of these options 1-4, and have a look at what we get:

```
In [0]:
```o1 = torch.Tensor(datanp)
o2 = torch.tensor(datanp)
o3 = torch.as_tensor(datanp)
o4 = torch.from_numpy(datanp)
print(o1)
print(o2)
print(o3)
print(o4)
print(o1.dtype)
print(o2.dtype)
print(o3.dtype)
print(o4.dtype)

```
```

All of the options (o1, o2, o3, o4) appear to have produced the same tensors except for the first one.

The first option (o1) has dots after the number indicating that the numbers are floats, while the next three options have a type of int64.

The difference here arises in the fact that the **torch.Tensor() constructor uses the default dtype** when building the tensor.

The other calls **choose a dtype based on the incoming data**. This is called **type inference**. The dtype is inferred based on the incoming data. Note that the **dtype** can also be **explicitly set** for these calls by specifying the dtype as an argument:

```
In [0]:
```print(torch.tensor(datanp, dtype=torch.float32))
print(torch.as_tensor(datanp, dtype=torch.float32))

```
```

**Memory sharing in tensors**

Note that originally, we had datanp[0]=1, and also note that we only changed the data in the original numpy.ndarray. Notice we didn't explicity make any changes to our tensors (o1, o2, o3, o4).

However, after setting datanp[0]=0, we can see some of our tensors have changes.

```
In [0]:
```print('old:', datanp)
datanp[0] = 0
print('new:', datanp)
print(o1)
print(o2)
print(o3)
print(o4)

```
```

The first two o1 and o2 still have the original value of 1 for index 0, while the second two o3 and o4 have the new value of 0 for index 0.

This happens because **torch.Tensor() and torch.tensor() copy** their input data while **torch.as_tensor() and torch.from_numpy() share** their input data in memory with the original input object.

This sharing just means that the actual data in memory exists in a single place. As a result, any changes that occur in the underlying data will be reflected in both objects.

Sharing data is more efficient and uses less memory than copying data because the data is not written to two locations in memory.

This establishes that torch.as_tensor() and torch.from_numpy() both share memory with their input data.

The torch.from_numpy() function only accepts numpy.ndarrays, while the torch.as_tensor() function accepts a wide variety of Python array-like objects including other PyTorch tensors. For this reason, torch.as_tensor() is the winning choice in the memory sharing game.

Given all of these details, these two are the best options:

- torch.tensor()
- torch.as_tensor()

The torch.tensor() call is the sort of go-to call, while torch.as_tensor() should be employed when tuning our code for performance.

The concepts of rank, axes, and shape are the tensor attributes that will concern us most in deep learning.

- Rank
- Axes
- Shape

These concepts build on one another starting with rank, then axes, and building up to shape, so keep any eye out for this relationship between these three.

**Rank of a tensor**

The rank of a tensor refers to the number of dimensions present within the tensor. Suppose we are told that we have a rank-2 tensor. This means all of the following:

- We have a matrix
- We have a 2d-array
- We have a 2d-tensor

We are introducing the word **rank** here because it is commonly used in deep learning when referring to the number of **dimensions** present within a given tensor.

The rank of a tensor tells us **how many indexes are required to access (refer to) a specific data element** contained within the tensor data structure.

A tensor's rank tells us how many indexes are needed to refer to a specific element within the tensor.

The dimension and size of any tensor can be found using the methods shown below.

```
In [0]:
```s1d = torch.tensor([1,9,1,1])
s2d = torch.tensor([[1,9,1,1], [4,5,6,7]])
#s3d = torch.tensor([ [[1,9,1,1],[4,5,6,7]] , [[2,10,2,2],[8,9,10,11]] ])
s3d = torch.tensor([
[
[1,9,1,1],
[4,5,6,7]
],
[
[2,10,2,2],
[8,9,10,11]
]
])
s4d = torch.tensor([[[[4,1]]]])

*The dimension corresponds to the number of nested list sets*

```
In [0]:
```print("The dimension of tensors s1d, s2d, s3d,s4d are:")
print(s1d.ndimension(),s2d.ndimension(),s3d.ndimension(),s4d.ndimension())

```
```

**Axes of a tensor**

If we have a tensor, and we want to refer to a specific dimension, we use the word axis in deep learning.

An **axis** of a tensor is a **specific dimension of a tensor**.

If we say that a tensor is a **rank 2 tensor**, we mean that the tensor has 2 dimensions, or **equivalently**, the tensor has **two axes**.

Elements are said to exist or run along an axis. This running is constrained by the length of each axis.

The **length of each axis** tells us how many **indexes are available along each axis**.

```
In [0]:
```dd = torch.tensor([
[1,2,3],
[4,5,6],
[7,8,9]])

```
In [0]:
```#Each element along the first axis, is an array:
print(dd[0])
print(dd[1])
print(dd[2])

```
```

```
In [0]:
```#Each element along the second axis, is a number:
print(dd[0][0])
print(dd[1][0])
print(dd[2][0])
print(dd[0][1])
print(dd[1][1])
print(dd[2][1])
print(dd[0][2])
print(dd[1][2])
print(dd[2][2])

```
```

Note that, with tensors, the elements of the last axis are always numbers (scalar). Every other axis will contain n-dimensional arrays.

The rank of a tensor tells us how many axes a tensor has, and the length of these axes leads us to the very important concept known as the shape of a tensor.

**Shape of a tensor**

The **shape** of a tensor is determined by the **length of each axis**, so if we know the shape of a given tensor, then we know the length of each axis, and this tells us how many indexes are available along each axis.

The shape of a tensor gives us the length of each axis of the tensor. Note that, in PyTorch, size and shape of a tensor are the same thing.

```
In [0]:
```s1d = torch.tensor([1,9,1,1])
s2d = torch.tensor([[1,9,1,1], [4,5,6,7]])
#s3d = torch.tensor([ [[1,9,1,1],[4,5,6,7]] , [[2,10,2,2],[8,9,10,11]] ])
s3d = torch.tensor([
[
[1,9,1,1],
[4,5,6,7]
],
[
[2,10,2,2],
[8,9,10,11]
]
])
s4d = torch.tensor([[[[4,1]]]])

```
In [0]:
```print("The size tensor s3d i.e, interpreted as two sets of [2,4] tensors:")
s3d.size()

```
Out[0]:
```

```
In [0]:
```print("The shape of tensor s3d is:")
s3d.shape

```
Out[0]:
```

**.shape is an alias for .size()**

```
In [0]:
```import torch
import numpy as np
import pandas as pd
nparray = np.array([0.0,1.0,2.0,3.0,4.0])
print("\n Original Array in numpy: ",nparray)
torcharray = torch.from_numpy(nparray)
print("\n Torch Array from numpy: ",torcharray)
back_to_nparray = torcharray.numpy()
print("\n Numpy array converted back from torch: ",back_to_nparray)
pdarray = pd.Series([19.0,0.1,5.8,2.3])
print("\n Original Array in Pandas: \n",pdarray)
ptorcharray = torch.from_numpy(pdarray.values) #pdarray.values returns a numpy array
print("\n Torch Array from numpy: ",ptorcharray)

```
```

We use rectangular brackets to access different elements of a tensor.

Depending on the number of dimensions of the tensor, we use the following notation for indexing

$$ tensor[dim_0, dim_1, dim_2,\cdots,dim_n]$$Similary, slices are taken depending on the rank of the tensor.

All index values of a slice are optional, and the defaults values are the start of the sequence, the last item in the sequence, and a default step size of one where it cannot be zero. A negative step size can reverse the order of the output result.

$$tensor[slice_0,slice_1,slice_2,\cdots,slice_n]$$where

[start: end: step] - Slice from [start] to [end-1] but step ahead based on step values

[ : : ] – All elements of the entire tensor

- n representing the number of dimensions of original tensor

```
In [0]:
```#!pip3 install torch torchvision
import torch
a = torch.tensor([[9,8,7],[6,5,4],[3,2,1]])
#[ [a[0][0],a[0][1],a[0][2]], [a[1][0],a[1][1],a[1][2]],[a[2][0],a[2][1],a[2][2]] ]
print("\nGiven array: \n",a)
print("\nIndexing: ")
print(a[0,0]) # Single bracket notation for accessing individual elements
print(a[0][0]) # Double bracket notation for accessing individual elements
print(a[0])
print(a[1])
print("\nSlicing: ")
slice1 = a[0:3,0:2]
print(slice1)
print("\n")
slice1[0][0]=0
print(slice1)
print("\nMixing Indexing and Slicing:")
print(a[0,0:2])

```
```

**Indexing** gets an item but destroys the data structure as returned item is not the original tensor. Simply put, it **does not preserve** the dimensionality of the original tensor.

**Slicing preserves** the data structure and returned item has the same of kind structure that the original contained.

**Slices** reference a subset of the array **without copying** the underlying data.

Since slices are references to memory in original array, changing values in a slice also changes the original array.

Fancy indexing used when ordinary slicing and indexing operations cannot retrieve a value. Creates new copies.

- By specifying non-sequential integers locations of data
- By applying masks

Fancy indexing is like the simple indexing, but we pass arrays of indices in place of single scalars.

```
In [0]:
```#!pip3 install torch torchvision
import torch
fancy = torch.tensor([78, 55, 82, 93])
print(fancy[0],fancy[1],fancy[2]) # Suppose we want to access three different elements. We could do it like this.
print([fancy[0],fancy[1],fancy[2]]) # Alternatively, we can pass a single list or array of indices to obtain the same
fancy_index = [0,1,2]
print(fancy[fancy_index]) # When using fancy indexing, the shape of the result reflects the shape of the index arrays rather than the shape of the array being indexed

```
```

```
In [0]:
```print(fancy>78)
print(fancy[fancy>78])
print(fancy.gt(78))

```
```

We can

Combine fancy indexing with simple indices

Combine fancy indexing with slicing

Combine fancy indexing with masking

```
In [0]:
```r = torch.tensor([[0.0,1.0,2.0,3.0],
[4.0,5.0,6.0,7.0],
[8.0,9.0,10.0,11.0]])
print(r)
print("\nCombine fancy indexing and simple indices :", r[2, [2, 0, 1]])
print("\nCombine fancy indexing with slicing: ", r[1:, [2, 0, 1]])
#print("\nCombine fancy indexing with masking: ",)

```
```

In practice, one will most often want to use one of PyTorch’s functions that return tensors initialized in a certain manner, such as:

- torch.rand: values initialized from a random uniform distribution,
- torch.randn: values initialized from a random normal distribution,
- torch.eye(n): an $n×n$ identity matrix,
- torch.from_numpy(ndarray): a PyTorch tensor from a NumPy ndarray,
- torch.linspace(start, end, steps): a 1-D tensor with steps values spaced linearly between start and end,
- torch.ones : a tensor with ones everywhere,
- torch.zeros_like(other) : a tensor with the same shape as other and zeros everywhere,
- torch.arange(start, end, step): a 1-D tensor with values filled from a range.

PyTorch tensors provide a very rich API for combination with other tensors as well as in-place mutation.

Also like NumPy, unary and binary operations can usually be performed via functions in the torch module, like torch.add(x, y), or directly via methods on the tensor objects, like x.add(y).

For the usual suspects, operator overloads like x + y exist. Furthermore, many functions have in-place alternatives that will mutate the receiver instance rather than creating a new tensor. These functions have the same name as the out-of-place variants, but are suffixed with an underscore, e.g. x.add_(y).

A selection of operations includes:

- torch.add(x, y): elementwise addition,
- torch.mm(x, y): matrix multiplication (not matmul or dot),
- torch.mul(x, y): elementwise multiplication,
- torch.exp(x): elementwise exponential,
- torch.pow(x, power): elementwise exponentiation,
- torch.sqrt(x): elementwise squaring,
- torch.sqrt_(x): in-place elementwise squaring,
- torch.sigmoid(x): elementwise sigmoid.
- torch.cumprod(x): product of all values,
- torch.sum(x): sum of all values,
- torch.std(x): standard deviation of all values,
- torch.mean(x): mean of all values.

Tensors support many of the familiar semantics of NumPy ndarray’s, such as broadcasting, advanced (fancy) indexing (x[x > 5]) and elementwise relational operators (x > y).

PyTorch tensors can also be converted to NumPy ndarray’s directly via the torch.Tensor.numpy() function.

Finally, since the primary improvement of PyTorch tensors over NumPy ndarrays is supposed to be GPU acceleration, there is also a torch.Tensor.cuda() function, which will copy the tensor memory onto a CUDA-capable GPU device, if one is available.

Before we dive in with specific tensor operations, let’s get a quick overview of the landscape by looking at the main operation categories that encompass the operations we’ll cover. We have the following high-level categories of operations:

- Reshaping operations
- Element-wise operations
- Reduction operations
- Access operations

There are a lot of individual operations out there, so much so that it can sometimes be intimidating when you're just beginning, but grouping similar operations into categories based on their likeness can help make learning about tensor operations more manageable.

The shape also encodes all of the relevant information about axes, rank, and therefore indexes. Additionally, one of the types of operations we must perform frequently when we are programming our neural networks is called reshaping.

Reshaping changes the grouping of the terms but does not change the underlying terms themselves.

For reshaping the product of the component values in the shape must equal the total number of elements in the tensor so that there are enough positions inside the tensor data structure to contain all of the original data elements after the reshaping.

Reshaping changes the shape but not the underlying data elements.

```
In [0]:
```t = torch.tensor([
[1,1,1,1],
[2,2,2,2],
[3,3,3,3]
], dtype=torch.float32)
print(t.size())
print(t.shape) #shape is an alias for size()

```
```

Typically, after we know a tensor’s shape, we can deduce a couple of things. First, we can deduce the tensor's rank. The rank of a tensor is equal to the length of the tensor's shape.

We can also deduce the number of elements contained within the tensor. The number of elements inside a tensor (12 in our case) is equal to the product of the shape's component values.

```
In [0]:
```print(len(t.shape))
print(torch.tensor(t.shape).prod())
print(t.numel()) #In PyTorch, there is a dedicated function for determining the number of elements inside a tensor

```
```

Since the above tensor has 12 elements, so any reshaping must account for exactly 12 elements.

```
In [0]:
```print(t.reshape([1,12]))
print(t.reshape([2,6]))
print(t.reshape([3,4]))
print(t.reshape([4,3]))
print(t.reshape(6,2))
print(t.reshape(12,1))

```
```

```
In [0]:
```reshapea = t.reshape([1,12])
print(reshapea)
print("\n")
reshapea[0][0]=0
print(reshapea)
print(t)

```
```

```
In [0]:
```#!pip3 install torch torchvision
import torch
v = torch.Tensor([9,8,7,6,5,4])
v_col= v.view(6,1)
print("\nView 6x1: \n",v_col)
v_col1= v.view(3,-1)
print("\nView 3x2: \n",v_col1)

```
```

```
In [0]:
```print(v)
v_col1[0][0]=0
print(v_col1)

```
```

We can also change the shape of our tensors is by squeezing and unsqueezing them.

- Squeezing a tensor
**removes**the**dimensions**or axes that have a**length of one**. - Unsqueezing a tensor adds a dimension with a length of one.

These functions allow us to expand or shrink the rank (number of dimensions) of our tensor.

```
In [0]:
```print(t.reshape([1,12]))
print(t.reshape([1,12]).shape)
print(t.reshape([1,12]).squeeze())
print(t.reshape([1,12]).squeeze().shape)

```
```

```
In [0]:
```print(t.reshape([1,12]).squeeze().unsqueeze(dim=0))
print(t.reshape([1,12]).squeeze().unsqueeze(dim=0).shape)

```
```

**Flatten a tensor**

A flatten operation on a tensor reshapes the tensor to have a shape that is equal to the number of elements contained in the tensor. This is the same thing as a 1d-array of elements.

Flattening a tensor means to remove all of the dimensions except for one.

```
In [0]:
```f = torch.ones(4, 3)
print(f)
print(torch.flatten(f)) #https://pytorch.org/docs/master/torch.html
print("\n")
f1 = torch.tensor([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
print("Original Array: \n",f1)
print("\n")
print(torch.flatten(f1))
print(torch.flatten(f1, start_dim=1))

```
```

Most times we deal with batches of color images and we don’t want to flatten the whole tensor. We only want to flatten the image tensors within the batch tensor leaving the batch dimension intact

$$(Batch Size,\; Channels,\; Height,\; Width) $$We could thus do $flatten(start\_dim=1)$ which tells the flatten() method which axis it should start the flatten operation.

The one here is an index, so it’s the second axis which is the color channel axis. We **skip over the batch axis **so to speak, leaving it intact.

**Concatenating tensors **

```
In [0]:
```t1 = torch.tensor([
[1,2],
[3,4]
])
t2 = torch.tensor([
[5,6],
[7,8]
])
# We can combine t1 and t2 row-wise (axis-0) in the following way:
print(torch.cat((t1, t2), dim=0))
# We can combine them column-wise (axis-1) like this:
print(torch.cat((t1, t2), dim=1))
print(torch.cat((t1, t2), dim=0).shape)
print(torch.cat((t1, t2), dim=1).shape)

```
```

torch.cat() concatenate tensors along an existing dimension and torch.stack() stack tensors along a new dimension.

torch.cat and torch.stack requires the arguments to have the same shape (except along the concatenating dimension). They don't implement broadcasting. Pytorch don't enforce this sufficiently, which leads to the incorrect values.

```
In [0]:
```import torch
a = torch.randn(1, 4)
b = torch.randn(1, 4)
print(a)
print(b)
print("\n")
abcat = torch.cat((a, b), 0)
print(abcat)
print(abcat.shape)
print("\n")
abstack = torch.stack((a, b),0) # The last element is the dimension to insert. Has to be between 0 and the number of dimensions of concatenated tensors (inclusive)
print(abstack)
print(abstack.shape)

```
```

```
In [0]:
```tensor_one = torch.tensor([[1,2,3],[4,5,6]])
print(tensor_one.shape)
tensor_two = torch.tensor([[7,8,9],[10,11,12]])
print(tensor_two.shape)
tensor_tre = torch.tensor([[13,14,15],[16,17,18]])
print(tensor_tre.shape)
tensor_list = [tensor_one, tensor_two, tensor_tre]
stacked_tensor = torch.stack(tensor_list)
print(stacked_tensor.shape)

```
```

Our initial three tensors were all of shape 2x3. So the default of torch.stack is that it’s going to insert a new dimension in front of the 2x3, so we’re going to end up with a 3x2x3 tensor.

The reason it’s 3 is because we have three tensors in this list we are converting to one tensor.

An element-wise operation is an operation between two tensors that operates on corresponding elements within the respective tensors.

Two elements are said to be corresponding if the two elements occupy the same position within the tensor. The position is determined by the indexes used to locate each element.

```
In [0]:
```ele = torch.tensor([
[1,2],
[3,4]],dtype=torch.float32)
# (1) Using these symbolic operations:
print(ele + 2)
print(ele - 2)
print(ele * 2)
print(ele / 2)
print("\n")
# (2) these built-in tensor object methods:
print(ele.add(2))
print(ele.sub(2))
print(ele.mul(2))
print(ele.div(2))

```
```

**Comparison operations** are also element-wise. For a given comparison operations between tensors, a new tensor of the same shape is returned with each element containing either a 0 or a 1.

- 0 if the comparison between corresponding elements is False.
- 1 if the comparison between corresponding elements is True.

```
In [0]:
```t = torch.tensor([
[0,5,0],
[6,0,7],
[0,8,0]], dtype=torch.float32)
print(t.eq(0))
print(t.ge(0))
print(t.gt(0))
print(t.lt(0))
print(t.le(7))

```
```

There are some other ways to refer to element-wise operations, and all of these mean the same thing:

- Element-wise
- Component-wise
- Point-wise

```
In [0]:
```print(t.abs())
print(t.sqrt())
print(t.neg())
print(t.neg().abs())

```
```

**Broadcasting of Tensors**

Tensors of different dimensionality can be combined in the same expression. Tensors with smaller dimension are broadcasted to match the larger tensors, without copying data. Broadcasting has two rules.

Rule 1: For broadcasting to be successful, the dimensions of the tensors need to be compatible. Two dimensions are

**compatible**when they are**equal**Rule 2: Two dimensions of different tensors are also compatible when one of the dimensions is 1. The resulting tensor size is actually the maximum size along each dimension of the input tensors. In other words, given x(3,4) and y (4,) the maximum size along each dimension of x and y is taken to make up the shape of the new, resulting tensor is of shape (3,4)

```
In [0]:
```x = torch.ones((3,4))
y = torch.arange(4,dtype=torch.float32)
print(x)
print(y)
print("\n")
print(x-y)

```
```

```
In [0]:
```red = torch.tensor([
[0,1,0],
[2,0,2],
[0,3,0]], dtype=torch.float32)
print(red.sum())
print("\nThe number of elements in original tensor",red.numel())
print("\nConcating Operations: ")
print(red.sum().numel())
print(red.sum().numel() < red.numel())

```
```

```
In [0]:
```print(red.sum())
print(red.prod())
print(red.mean())
print(red.std())

```
```

**Reduction by Axis**

```
In [0]:
```ra = torch.tensor([
[1,1,1,1],
[2,2,2,2],
[3,3,3,3]], dtype=torch.float32)
print(ra.size()) # 3 x 4 rank-2 tensor
print(ra.sum(dim=0)) # Sum along the axis containing 3 elements i.e, column wise
print(ra.sum(dim=1)) # Sum along the axis containing 4 elements i.e, row wise

```
```

**Argmax tensor reduction operation**

Argmax returns the index location of the maximum value inside a tensor.

When we call the argmax() method on a tensor, the tensor is reduced to a new tensor that contains an index value indicating where the max value is inside the tensor

```
In [0]:
```xa = torch.tensor([
[1,0,0,2],
[0,3,3,0],
[4,0,0,5]], dtype=torch.float32)
print(xa.max())
print(xa.argmax())
print(xa.flatten())

```
```

**Argmax along Specific axes**

Notice how the call to the max() method returns two tensors. The first tensor contains the max values and the second tensor contains the index locations for the max values along dim0.

The max values are determined by taking the element-wise maximum across each array running across the first axis.

```
In [0]:
```print(xa.max(dim=0))
print(xa.argmax(dim=0))
print(xa.max(dim=1))
print(xa.argmax(dim=1))

```
```

Operations can return scalar valued tensors. If we want to actually get the value as a number, we use the item() tensor method. This works for scalar valued tensors.

When multiple values are returned, and we can access the numeric values by transforming the output tensor into a Python list or a NumPy array.

```
In [0]:
```ta = torch.tensor([
[1,2,3],
[4,5,6],
[7,8,9]], dtype=torch.float32)
print(ta.mean())
print(ta.mean().item())
print(ta.mean(dim=0).tolist())
print(ta.mean(dim=0).numpy())

```
```

```
In [0]:
```a = torch.randn(4)
print(a)
torch.clamp(a, min=0, max=1) # If less than zero make it zero, if greater than make it 1 and leave rest as is

```
Out[0]:
```

```
In [0]:
```a = torch.tensor([[2,4,3],[5,6,1]])
b = torch.tensor([[10,11,10],[10,2,8]])
torch.mul(a, b) # Each element of the tensor input is multiplied by each element of the Tensor other, must be broadcastable

```
Out[0]:
```

```
In [0]:
```torch.dot(torch.tensor([2, 3]), torch.tensor([2, 1])) # This function does not broadcast.

```
Out[0]:
```

```
In [0]:
```mat1 = torch.randn(2, 3)
mat2 = torch.randn(3, 3)
print(torch.mm(mat1, mat2)) # matrix multiplication This function does not broadcast. For broadcasting matrix products, use torch.matmul().

```
```

```
In [0]:
```print(mat1 @ mat2) # A @ B is the matrix product

```
```

```
In [0]:
```mat1 = torch.randn(3, 3)
mat2 = torch.randn(3, 3)
print(mat1 * mat2) # A * B the element-wise product

```
```

Visit https://pytorch.org/docs/master/torch.html and prepare a list of all available functions and operators in pytorch.

A key insight from calculus is that the gradient indicates the rate of change of the loss, or the slope of the loss function w.r.t. the weights and biases.

```
If a gradient element is postive,
increasing the element's value slightly will increase the loss.
decreasing the element's value slightly will decrease the loss.
If a gradient element is negative,
increasing the element's value slightly will decrease the loss.
decreasing the element's value slightly will increase the loss.
```

The increase or decrease is proportional to the value of the gradient.

To optimize neural networks, we need to **calculate derivatives**, and to do this computationally, deep learning frameworks use what are called **computational graphs**.

Computational graphs are used to graph the function operations that occur on tensors inside neural networks.

These graphs are then used to compute the derivatives needed to optimize the neural network. PyTorch uses a computational graph that is called a **dynamic computational graph**. This means that the graph is **generated** **on the fly** as the operations are created.

This is in contrast to static graphs that are fully determined before the actual operations occur.

Package | Description |
---|---|

torch | The top-level PyTorch package and tensor library. |

torch.nn | A subpackage that contains modules and extensible classes for building neural networks. |

torch.autograd | A subpackage that supports all the differentiable Tensor operations in PyTorch. |

torch.nn.functional | A functional interface that contains typical operations used for building neural networks like loss functions, activation functions, and convolution operations. |

torch.optim | A subpackage that contains standard optimization operations like SGD and Adam. |

torch.utils | A subpackage that contains utility classes like data sets and data loaders that make data preprocessing easier. |

torchvision | A package that provides access to popular datasets, model architectures, and image transformations for computer vision. |

In the computational graph,

- a
**node is an array**and - an
**edge is an operation**on the array.

To make a computational graph, we make a node by wrapping an array inside the torch.tensor() function. All operations that we do on this node from then on will be defined as edges in the computational graph. The edges of the graph also result in new nodes in the computational graph.

Each node in the graph has

- a
**.data property**which is a multi-dimensional array and - a
**.grad property**which is it’s gradient with respect to some scalar value (.grad is also a tensor itself).

After defining the computational graph, we can calculate gradients of the output with respect to all nodes in the graph with a single command i.e. tensor.backward().

Thus **torch.autograd** is the library that supports **automatic differentiation** (auto-compute gradients) in PyTorch. The central class of this package is torch.Tensor.

```
Automatic differentiation (autodiff) refers to a general way of taking a program which computes a value, and automatically constructing a procedure for computing derivatives of that value. We focus on reverse mode autodiff. There is also a forward mode, which is for computing directional derivatives.
Backpropagation is the special case of autodiff applied to neural nets But in machine learning, we often use backprop synonymously with autodiff.
Autograd is the name of a particular autodiff package. But lots of people, including the PyTorch developers, got confused and started using "autograd" to mean "autodiff".
The goal of autodiff is not a formula, but a procedure for computing derivatives. An autodiff system will convert the program into a sequence of primitive operations which have specified routines for computing derivatives.
In this representation, backprop can be done in a completely mechanical way. Most autodiff systems, including Autograd, explicitly construct the computation graph.
```

A Tensor has a Boolean field

**requires_grad**, set to**False by default**, which states if PyTorch should build the graph of operations so that gradients with respect to it can be computed.Thus to track all operations on tensors, we set

**.requires_grad**as**True**.Only

**floating point**type tensors can have their**gradient computed**.We define a function which constructs a

**dynamic graph**.To compute the derivative, we call

**.backward()**.The function

**Tensor.backward() accumulates**the**gradients**in the different Tensors, so one may have to set them to zero before calling it.The the derivative for the tensor will be accumulated in the

**.grad**attributeThis

**accumulating behavior**is**desirable**in particular to compute the gradient of a loss summed over several “mini-batches,” or the gradient of a sum of losses.**torch.autograd.grad(outputs, inputs)**computes and returns the gradient of outputs with respect to inputs. The function Tensor.backward() function is an**alternative**to torch.autograd.grad(...) and standard for training models.Although they are related, the

**autograd graph is not the network’s structure**, but the graph of operations to compute the gradient. It can be data-dependent and miss or replicate sub-parts of the network.

- Once a Tensor is converted to a node in the computational graph using torch.tensor()
- access its value using
**x.data** - access its gradient using
**x.grad** - access its gradient value using
**x.grad.data**

- access its value using
- We can do operations on the tensor to make edges of the graph.
- Each tensor has a
**.grad_fn**attribute that references a**"Function"**that has created the tensor (except for tensors created by the user – their grad_fn is None). - The inputs and the parameters in each layer are
**leaf variables**, the**outputs**(usually it is called the loss and we minimize it to update the parameters of the network) of neural networks are the**root variables**in the graph.

```
In [0]:
```import torch
t = torch.tensor([1., 2., 4.]).requires_grad_()
u = torch.tensor([10., 20.]).requires_grad_()
a = t.pow(2).sum() + u.log().sum()
torch.autograd.grad(a, (t, u))

```
Out[0]:
```

**Example 1:** $$y(x) = x^2$$

The derivative of the above function is given as $$ \frac {dy}{dx} = 2x$$

Then the derivative of above function at $x=2$ is $$ \frac {d\; y(2)}{dx} = 2 x = 2 (2) = 4$$

```
In [0]:
```#!pip3 install torch torchvision
import torch
x = torch.tensor(2.0,requires_grad=True) # We create a Torch tensor with a value of 2 and we set the parameter requires_grad equals
# The requires_grad=True stores all the operations associated with the variables
y = x**2 # We have function y in terms of x i.e, y(x)=x^2 and this function call constructs a graph
y.backward() # The backward method calculates the derivative of y w.r.t x when x=2 using chain rule
x.grad # The grad method shows the value of the derivative of y w.r.t x when x=2

```
Out[0]:
```

**Note that**, to get access to the value derivative of Y with respect to X we use **X.grad** not Y.grad

The **requires_grad** allows calculation of gradients w.r.t. the tensor allowing gradients accumulation

**Example 2:** $$z(x) = x^2 + 2x + 1$$

The derivative of the above function is given as $$ \frac{d\;z(x)}{dx} = 2x + 2$$

Then the derivative of above function at $x=2$ is $$ \frac {d\; z(2)}{dx} = 2 x + 2 = 2 (2) + 2= 6$$

```
In [0]:
```x = torch.tensor(2.0, requires_grad=True) # Assign x=2.0
z = x**2 + 2*x+1 # Construct the graph
z.backward() # Use chain rule to calculate the gradient
print(x.grad) # Print the gradient
print(x)

```
```

Gradients are **accumulated** with every cycle which allow us to get the correct gradient for all the computations with a given variable.

Thus we may need to manually reset the values to 0 so that the gradients computed previously do not interfere with the ones we are currently computing.

```
In [0]:
```#x = torch.tensor(2.0, requires_grad=True)
z = x**2 + 2*x+1 # Reconstruct the graph
z.backward() # Use chain rule to calculate the gradient
print(x.grad)
print(x)

```
```

After the backward call is called, the derivative is automatically calculated.

Because the automatic derivation is done twice, and old gradients are accumulated, so the first gradient 6 and the second gradient 6 are added to get 12.

In the example below we do **x.grad.zero_()** before the graph is reconstructed.

```
In [0]:
```#x = torch.tensor(2.0, requires_grad=True)
x.grad.zero_() # Zero the gradients before reconstructing the graph
z = x**2 + 2*x+1 # Reconstruct the graph
z.backward() # Use chain rule to calculate the gradient
print(x.grad)
print(x)

```
```

Since the backward() function accumulates gradients, and you don’t want to mix up gradients between minibatches, you have to zero them out at the start of a new minibatch. This is exactly like how a general (additive) accumulator variable is initialized to 0 in code.

By the way, **the best practice is to use the zero_grad() function on the optimizer.**

**Example 3:**

The autograd graph is encoded through the

- fields
**grad_fn**of Tensor s, and - the fields
**next_functions**of Function s.

```
In [0]:
```#!pip3 install torch torchvision
import torch
x = torch.tensor([ 1.0, -2.0, 3.0, -4.0 ]).requires_grad_()
a = x.abs()
s = a.sum()
print(s)
print(s.grad_fn.next_functions)
print(s.grad_fn.next_functions[0][0].next_functions)

```
```

```
In [0]:
```#!pip3 install torch torchvision
import torch
# Requires_grad=True turns on differential mode
a=torch.tensor(([1.0]),requires_grad=True)
b=torch.tensor(([2.0]),requires_grad=True)
c=torch.tensor(([3.0]),requires_grad=True)
d=a+b
e=d+c
e.backward()
print(a.grad,b.grad,c.grad)
print(d.grad) # Intermediate gradient value is not saved, and is empty
print(a.grad_fn) # The first node's .grad_fn is empty
print(e.grad_fn)

```
```

If we want to compute the derivatives, we call .backward() on a tensor.

If tensor is a scalar (i.e. it holds a one element data), we don’t need to specify any arguments to backward(), however if it has more elements, you need to specify a **grad_output** **argument** that is a tensor of **matching** **shape**.

I.e, if the Tensor contains one element, you don’t have to specify any parameters for the backward() function. **If the Tensor contains more than one element, specify a gradient that’s a tensor of matching shape*.*

Basically to start the chain rule we need a gradient at the output, to get it going. In the event the output is a scalar loss function (which it usually is - normally we are beginning the backward pass at the loss variable ), its an implied value of 1.0

The function y.backward() is equivalent to doing ** y.backward(torch.Tensor([1.0]))**

**Example3: **

computing partial derivatives looks something like this:

$$\frac {\partial f}{\partial x} = \frac {\partial}{\partial x}x^2 y = 2 x y$$$$\frac {\partial f}{\partial y} = \frac {\partial}{\partial y}x^2 y = x^2\cdot 1$$```
In [0]:
```#!pip3 install torch torchvision
import torch
x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)
y = torch.tensor([5.0, 6.0, 7.0, 8.0], requires_grad=True)
z = (x**2) * y
print(z)
z.backward(torch.FloatTensor([1, 0, 0, 0])) # do backward for first element of z
#z.backward()
print("The derivative of z_1 w.r.t to x: ",x.grad.data)
print("The derivative of z_1 w.r.t to y: ",y.grad.data)

```
```

```
In [0]:
```x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)
y = torch.tensor([5.0, 6.0, 7.0, 8.0], requires_grad=True)
z = (x**2) * y
z.backward(torch.FloatTensor([0, 1, 0, 0])) # do backward for second element of z
print("The derivative of z_2 w.r.t to x:, ",x.grad.data)
print("The derivative of z_2 w.r.t to y: ",y.grad.data)

```
```

```
In [0]:
```# do backward for all elements of z, equal to the collection of partial derivatives z_1, z_2, z_3 and z_4
x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)
y = torch.tensor([5.0, 6.0, 7.0, 8.0], requires_grad=True)
z = (x**2) * y
z.backward(torch.FloatTensor([1, 1, 1, 1]))
print(x.grad.data)
print(y.grad.data)

```
```

```
In [0]:
```x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)
y = torch.tensor([5.0, 6.0, 7.0, 8.0], requires_grad=True)
z = (x**2) * y
z.backward(gradient=torch.ones(z.size()))
print(x.grad.data)
print(y.grad.data)

```
```

The above code can also be written as:

```
In [0]:
```import torch
x = torch.tensor([1.0, 2.0, 3.0, 4.0], requires_grad=True)
y = torch.tensor([5.0, 6.0, 7.0, 8.0], requires_grad=True)
z = (x**2) * y
print(z)
print(z.shape)
z.backward(torch.FloatTensor([1, 0, 0, 0])) # do backward for first element of z
print("The derivative of z_1 w.r.t to x: ",x.grad.data)
print("The derivative of z_1 w.r.t to y: ",y.grad.data)
x.grad.zero_() # Zero the gradients of x
y.grad.zero_() # Zero the gradients of y
z = (x**2) * y # Reconstruct the graph
z.backward(torch.FloatTensor([1, 1, 1, 1])) # do backward for second element of z
print("The derivative of z_2 w.r.t to x:, ",x.grad.data)
print("The derivative of z_2 w.r.t to y: ",y.grad.data)
x.grad.zero_()
y.grad.zero_()
z = (x**2) * y
z.backward(torch.FloatTensor([1, 1, 1, 1]))
print(x.grad.data)
print(y.grad.data)

```
```

```
In [0]:
```x = torch.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
# Each tensor has a .grad_fn attribute that references a "Function" that has created the tensor (except for tensors created by the user – their grad_fn is None).
print('x', x)
y = 2 * x # define an operation on x and construct the graph
z = y ** 3 # define one more operation to check the chain rule and continue constructing the graph
print('z shape:', z.size())
z.backward(torch.FloatTensor([[1, 1], [1, 1]]))
print('x gradient for its all elements:\n', x.grad)
x.grad.zero_()
x.grad.data.zero_() #the gradient for x will be accumulated, it needs to be cleared.
y = 2 * x
z = y ** 3
z.backward(torch.FloatTensor([[0, 1], [0, 1]]))
print('x gradient for the second column:\n', x.grad)
x.grad.zero_()
x.grad.data.zero_()
y = 2 * x
z = y ** 3
z.backward(torch.FloatTensor([[1, 1], [0, 0]]))
print('x gradient for the first row:\n', x.grad)

```
```

```
In [0]:
```x = torch.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
y = 2 * x #define an operation on x
print('y', y)
z = y ** 3 #define one more operation to check the chain rule
out = z.mean()
print('out', out)
out.backward()
print('x gradient:\n', x.grad)
x.grad.data.zero_()
y = 2 * x
z = y ** 3
out = z.mean()
#out.backward(torch.FloatTensor([[1, 1], [1, 1]]))
out.backward(torch.ones(z.size())) #Note the use of size() function of the output vector
print('x gradient second time', x.grad)

```
```

Consider the function of two variables $u$ and $v$

$$ f(u,v) = uv + u^2$$Thus the partial derivatives w.r.t the two variables $u$ and $v$ are given as:

$$\frac {\partial f(u,v)}{\partial u} = v+2u$$$$\frac {\partial f(u,v)}{\partial v} = u$$Now at u=1 and v=2 the function: $$f(u=1,v=2) = uv + u^2 = 1(2)+ 1^2 = 3$$

The partial derivatives at u=1 and at v=2 are: $$\frac {\partial f(1,2)}{\partial u} = v+2u = 2 + 2 (1) = 4$$

$$\frac {\partial f(1,2)}{\partial v} = u = 1$$```
In [0]:
```#Define two tensors u and v
u=torch.tensor(1.0,requires_grad=True)
v=torch.tensor(2.0,requires_grad=True)
f=u*v+u**2
f.backward()
print(u.grad)
print(v.grad)

```
```

We need to calculate the derivative of y equals x squared. We generate values of x from -10 to 10. Note that we have to use the detach .option before we can cast it as a numpy array required for matplotlib function. The **.detach()** function will prevent future computations on the tensor from being tracked. Another way to prevent history tracking is by wrapping your code with **torch.no_grad()**.

The **detach() method creates a tensor which shares the data, but does not require gradient computation, and is not connected to the current graph**. This method should be used when the gradient should not be propagated beyond a variable, or to update leaf tensors.

```
In [0]:
```x = torch.linspace(-10,10,10,requires_grad=True)
print(x)
Y=x**2
print(Y)
y=x**2
print(y)
y.backward(torch.Tensor([1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))
print(x.grad)
import matplotlib.pyplot as plt
plt.plot(x.detach().numpy(),Y.detach().numpy(),label='function')
plt.plot(x.detach().numpy(),x.grad.detach().numpy(),label='derivative')
plt.legend()

```
Out[0]:
```

**Homework: A little hack**. Explain the function below as to why we are taking **sum** compared to the above function.

Hint: Even if we have, y = x^2 ; loss = y.sum(); when we do loss.backward(), we still can get the gradient of loss w.r.t. y as sum willl be the last op of the last layer, and the graph will calculate backwards from there.

```
In [0]:
```x = torch.linspace(-10,10,10,requires_grad=True)
print(x)
Y=x**2
print(Y)
y=torch.sum(x**2)
print(y)
y.backward()
print(x.grad)
import matplotlib.pyplot as plt
plt.plot(x.detach().numpy(),Y.detach().numpy(),label='function')
plt.plot(x.detach().numpy(),x.grad.detach().numpy(),label='derivative')
plt.legend()

```
Out[0]:
```

**Relu Function**

```
In [0]:
```import torch.nn.functional as F
x=torch.linspace(-3,3,100,requires_grad=True)
Y=F.relu(x)
y=torch.sum(F.relu(x))
y.backward()
plt.plot(x.detach().numpy(),Y.detach().numpy(),label='function')
plt.plot(x.detach().numpy(),x.grad.detach().numpy(),label='derivative')
plt.legend()

```
Out[0]:
```

In PyTorch, the computation **graph** is **created** at the **start** of **each iteration** in an **epoch** and then subsequently **freed** to save memory at the **end** of that **iteration**.

In each iteration, we execute the forward pass, compute the derivatives of output w.r.t to the parameters of the network, and update the parameters to fit the given examples. After doing the backward pass, the graph will be freed to save memory. In the **next iteration**, a **fresh** new **graph** is **created** and ready for back-propagation

Because the computation graph will be freed by default after the first backward pass, you will encounter errors if you are trying to do backward on the same graph the second time.

We can specify **retain_graph=True** when calling backward the first time to make sure the graph is retained and the buffers are not freed till we finish the backward propogation through the graph a second time.

During the optimization step, we combine the chain rule and the graph to compute the derivative (partial) of the output w.r.t the learnable variable in the graph and update these variables to make the output close to what we want.

```
In [0]:
```#!pip3 install torch torchvision
import torch
f = torch.tensor([2.0,3.0],requires_grad=True)
print("Original Tensor: ",f)
g = f[0] * f[1]
g.backward(retain_graph=True)
#g.backward()
print("1st Backward Pass: ",f.grad)
g.backward()
print("2nd Backward Pass: ",f.grad)

```
```

```
In [0]:
```import torch
a = torch.randn((1,4),requires_grad=True)
b = a**2
c = b*2
d = c.mean()
e = c.sum()

```
In [0]:
``````
#d.backward(retain_graph=True)
```

As long as you use retain_graph=True in your backward method, you can do backward any time you want.

```
In [0]:
```d.backward(retain_graph=True) # fine....graph is retained so no need to reconstruct again
e.backward(retain_graph=True) # fine....graph is retained so no need to reconstruct again
d.backward() # also fine
e.backward() # error will occur!

```
In [0]:
```x = torch.tensor([3.0],requires_grad=True)
y = x * 2 + x ** 2 + 3
y.backward(retain_graph= True )
print(x.grad) # Tensor containing: 8, [torch.FloatTensor of size 1]
y.backward(retain_graph= True )
print(x.grad) # Output16, because the automatic derivation is done twice, so the first gradient 8 and the second gradient 8 are added to get 16
y.backward() # Do another automatic derivation, this time does not retain the calculation graph
print(x.grad) # Outputs 24
#y.backward() # will do an error, the calculation graph has been discarded

```
```

The addition operation don’t need buffers

If $f(x) = x + w$ then the gradient of $f$ with respect to $w$ is 1. In this case the gradient doesn’t depend on the inputs.

If $f(x) = x * w$ then the gradient of $f$ with respect to $w$ is $x$. In this case, we need to save the input value.

```
In [0]:
```f = torch.tensor([2.0,3.0], requires_grad=True)
g = f[0] + f[1]
g.backward()
print(f.grad)
g.backward()
print(f.grad)
print(f.grad.data[0])
print(f.is_leaf)

```
```

Usually after a backpropagation you process the next batch so you don’t need the gradients of the previous batch anymore.

**retain_variables** argument has been deprecated in favor of **retain_graph**

A **real use case** that you want to backward through the graph for more than once is multi-task learning where you have multiple losses at different layers. Suppose that you have 2 losses: loss1 and loss2 and they reside in different layers. In order to back-prop the gradient of loss1 and loss2 w.r.t to the learnable weight of your network independently. You have to use retain_graph=True in backward() method in the first back-propagated loss.

```
In [0]:
```#loss1.backward(retain_graph=True) # suppose you first back-propagate loss1, then loss2 (you can also do it in reverse order)
#loss2.backward() # now the graph is freed, and next process of batch gradient descent is ready
#optimizer.step() # update the network parameters

- The backward() function made differentiation very simple
- For non-scalar tensor, we need to specify
**grad_tensors** - If you need to backward() twice on a graph or subgraph, you will need to set retain_graph to be true
- Note that grad will accumulate from excuting the graph multiple times

There are several attributes related to gradients that every tensor has:

**grad**: A property which holds a tensor of the same shape containing computed gradients.

**is_leaf**: True, if this tensor was constructed by the user and False, if the object is a result of function transformation.

**requires_grad**: True if this tensor requires gradients to be calculated. This property is inherited from leaf tensors, which get this value from the tensor construction step (zeros() or torch.tensor() and so on). By default, the constructor has requires_grad=False, so if you want gradients to be calculated for your tensor, then you need to explicitly say so.

The network will have a single hidden layer, i.e, a two-layer network and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output. This is adapter from module1 but implemented using pytorch without helper functions.

```
In [0]:
```# Original Problem
# -*- coding: utf-8 -*-
import numpy as np
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(5):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Zero the grads is not required as we are not using torch backward() function which accumulates buffers
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

```
```

```
In [0]:
```import torch
# device = torch.device("cpu") # Uncomment this to run on CPU
dtype = torch.float
device = torch.device("cuda:0")
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype) #64x1000
y = torch.randn(N, D_out, device=device, dtype=dtype) #64x10
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype) #1000 x 100
w2 = torch.randn(H, D_out, device=device, dtype=dtype) #100 x 10
learning_rate = 1e-6
for t in range(5):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2

```
```

```
In [0]:
```import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in,requires_grad=False)
y = torch.randn(N, D_out,requires_grad=False)
w1 = torch.randn(D_in, H,requires_grad=True)
w2 = torch.randn(H, D_out,requires_grad=True)
learning_rate = 1e-6
for t in range(5):
# Do forward
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute Loss
loss = (y_pred - y).pow(2).sum()
print(t, loss.detach().item())
# Backprop to compute gradients (partial derivatives) of w1 and w2 with respect to loss
loss.backward()
# Update weights using gradient descent
w1.data -= learning_rate * w1.grad.data
w2.data -= learning_rate * w2.grad.data
# Zero them for next iteration, as we have already update the weights with gradient data
w1.grad.data.zero_()
w2.grad.data.zero_()

```
```

We can also use **loss.data** (**loss.data.item()**) instead of **loss.detach().item()** however it detaches the tensor from the computation graph and might lead to wrong results.

While using loss.data is unrelated to the computation graph, **loss.detach** will have its in-place changes reported by autograd if loss is needed in backward and will raise an error if necessary.

**Both share the underlying data of the tensor and have requires_grad=False**.

Thus tensor.data gives a tensor that shares the storage with tensor, but doesn't track history hence no_grad() function is not required.

Please note In autograd, if any input Tensor of an operation has requires_grad=True, **the computation will be tracked**. For updating weights we dont need to backtrack the operation.

So to summarize, they are both used to detach tensor from computation graph and returns a tensor that shares the same data, the difference is **loss.detach() adds another constrain that when the data is changed in-place, the backward won’t be done**.

Refer to https://pytorch.org/blog/pytorch-0_4_0-migration-guide/

```
In [0]:
```# ==================USING detach()=======================================
#!pip3 install torch torchvision
import torch
x = torch.tensor(([5.0]))
w = torch.tensor(([10.0]),requires_grad=True)
y = w*x
print(w)
c = w.detach()
c.zero_()
print(w) # Modified by c.zero_()!!
y.backward() # Error One of the variables needed for gradient computation has been modified by an inplace operation

```
In [0]:
```# ==================USING .data()=======================================
#!pip3 install torch torchvision
import torch
x = torch.tensor(([5.0]))
w = torch.tensor(([10.0]),requires_grad=True)
y = w*x
print(w)
c = w.data
c.zero_()
print(w) # Modified by c.zero_()!!
y.backward() # Error is not reported as in-place changes are not tracked by autograd

```
```

**.data** can be unsafe in some cases. Any changes on w.data wouldn't be tracked by autograd, and the computed gradients would be incorrect if w is needed in a backward pass. A safer alternative is to use w.detach(), which also returns a Tensor that shares data with requires_grad=False, but will have its **in-place changes reported by autograd if w is needed in backward**

```
In [0]:
```import torch
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(5):
# Forward pass: compute predicted y using operations on Tensors; these
# are exactly the same operations we used to compute the forward pass using
# Tensors, but we do not need to keep references to intermediate values since
# we are not implementing the backward pass by hand.
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss using operations on Tensors.
# Now loss is a Tensor of shape (1,)
# loss.item() gets the a scalar value held in the loss.
loss = (y_pred - y).pow(2).sum()
print(t, loss.item())
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()
# Manually update weights using gradient descent. Wrap in torch.no_grad()
# because weights have requires_grad=True, but we don't need to track this
# in autograd.
# An alternative way is to operate on weight.data and weight.grad.data.
# Recall that tensor.data gives a tensor that shares the storage with
# tensor, but doesn't track history.
# You can also use torch.optim.SGD to achieve this.
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights
w1.grad.zero_()
w2.grad.zero_()

```
```

Using the **context manager no_grad()** we can avoid storing the computations done producing the output of our network in the computation graph.

The torch.no_grad() context **switches off the autograd machinery**, and can be used for operations such as parameter updates.

$$\require {enclosed}
\fbox{torch.nn.Functional}
\xrightarrow{}
\fbox{torch.nn}\\$$

**LAYERS**

**torch.nn** exposes specific functionality like layers, activation functions, loss functions etc required for constructing networks and architectures.

torch.nn is **built on top** of **torch.nn.Functional**.

torch.nn has three main parts:

**Parameters**- learnable/trainable parameters, configurable parameters -**Containers**- Containers include modules, sequences, ModuleList, ParameterList, ModuleDict, ParameterDict**Layers**- Layers include linear, normalization, convolutional, pooling, padding, dropout, recurrent

For layers with trainable parameters, we use torch.nn to create the layers or the model. We can do it two ways:

**Using Sequence**- Simple models which dont require much customization**Using Class**- Used to define more complicated and custom models. Our layers are stored back in the instance of a class so we can easily access the layer and the trainable parameters later.

In **sequence containers**, we use torch.nn.Sequential to compose layers from torch.nn. Thus to create a model that looks like

- The class torch.nn.Linear does the job for us. For linear layer we need to multiply each input node with a weight, and add a bias. It applies a linear transformation to the incoming data, y=wx+b. Thus
`model = nn.Sequential( nn.Linear(n_in, n_h), nn.ReLU(), nn.Linear(n_h, n_out), nn.Sigmoid() )`

- The module torch.nn.Linear(input_size, output_size, bias=True) implements a fully-connected layer. It takes as input a tensor of size N×C and produce a tensor of size N×D.
- We didn’t specify the weight tensors as the
**weights and biases are automatically randomized at creation**. - Trainable parameters of a model are returned by
**model.parameters()**. - The function
**torch.manual_seed(1)**will give us the same result everytime we run the code.

This Torch.nn.Linear applies a linear transformation to the incoming data, i.e. y = Ax + b. The input tensor given in forward(input) must be either a vector (1D tensor) or matrix (2D tensor). If the input is a matrix, then each row is assumed to be an input sample of given batch. The layer can be used without bias by setting bias = false.

So given input matrix as follows.

$$Sample_1 = [1, 3, 2, 6, 9]$$$$Sample_2 = [2, 7, 1, 4, 8]$$$$Sample_3 = [0, 2, 3, 6, 5]$$$$Sample_4 = [8, 4, 1, 7, 9]\\$$$${\bf X_{4x5}} = \underbrace{ \left.\left( \begin{array}{ccccc} 1&3&2&6 &9\\ 2&7&1&4 &8\\ 0&2&3&6 &5\\ 8&4&1&7&9 \end{array} \right)\right\} }_{5\text{ features}} \,4\text{ samples}\\$$$$torch.nn.Linear(features, number \;of \;neurons)\\$$

$$torch.nn.Linear(5, 200)\\$$

$$Thus \;one \;hidden \;layer \;with \;200 \;hidden \;neurons!!!$$

```
In [0]:
```#!pip3 install torch torchvision
import torch
f = torch.nn.Linear(in_features = 10, out_features = 4)
for n, p in f.named_parameters():
print(n, p.size())
#for param in f.parameters():
#print(param.size())
x = torch.empty(523, 10).normal_()
y = f(x)
y.size()

```
Out[0]:
```

There are many methods available for each module to access its children —

- model.modules() , model.named_modules() , model.parameters() model.named_parameters() , model.children() and model.named_children().

But the most used is model.parameters() , as this is used to access all the parameters recursively and hence can be used to pass to an optimizer for updating weights/bias.

Parameters are of the type torch.nn.Parameter which is a Tensor with requires_grad to True, and known to be a model parameter by various utility functions, in particular torch.nn.Module.parameters()

**LOSS/COST**

torch.nn has loss/cost functions called criteria.

- L1Loss, MSELoss, CrossEntropyLoss, CTCLoss, NLLLoss, PoissonNLLLoss, KLDivLoss, BCELoss, BCEWithLogitsLoss, MarginRankingLoss, HingeEmbeddingLoss, MultiLabelMarginLoss, SmoothL1Loss, SoftMarginLoss, MultiLabelSoftMarginLoss, CosineEmbeddingLoss, MultiMarginLoss, TripletMarginLoss

The general syntax is

loss = torch.nn.MSELoss()

output = loss(input, target)

output.backward()

**Note:** Criteria/Loss do not accept a tensor with requires_grad set to True for target.

The first parameter of a loss is traditionally called the input and the second the target. These two quantities may be of different dimensions or even types for some losses (e.g.for classification).

**OPTIMIZATION/WEIGHT UPDATES**

torch.optim is a package implementing various optimization algorithms

- ASGD, Adadelta, Adagrad, Adam, Adamax, LBFGS, RMSprop, Rprop, SGD, SparseAdam

To use torch.optim we construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients.

We give it an iterable containing the parameters to optimize such as the learning rate, weight decay which are optimizer specific.

optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)

optimizer = optim.Adam([var1, var2], lr = 0.0001)

**model.parameters()** returns an iterator over our model’s parameters (**weights** and **biases**).

If we need to move a model to GPU via .cuda(), we need to do so before constructing optimizers for it.

All optimizers implement a step() method, that updates the parameters **called after** the gradients are computed using **backward()**.

We also specify per-layer learning rates using dictionaries instead of iterables as below.

```
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
```

Here

model.base’s parameters will use the default learning rate of 1e-2,

model.classifier’s parameters will use a learning rate of 1e-3,

momentum of 0.9 will be used for all parameters

**SCHEDULER - ADJUST LEARNING RATE**

torch.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs.

- LambdaLR, StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau

optimizer = torch.optim.Adam(dual_encoder.parameters(), lr = 0.001)

scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma = 0.95)

```
for epoch in range(200):
for i in range(len(label_array)):
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()
```

```
In [0]:
```#!pip3 install torch torchvision
import torch
import torch.optim
x = torch.randn((5, 5))
w = torch.randn((5, 5),requires_grad=True)
z = w.mm(x).mean() # Perform an operation
opt = torch.optim.Adam([w], lr=0.1, betas=(0.5, 0.999)) # Define the optimizer
z.backward() # Calculate gradients
print(w.data) # Print the weights
opt.step() # Update w according to Adam's gradient update rules
print(w.data) # Print updated weights after training step

```
```

```
In [0]:
```opt.step()
print(w.data)

```
```

Thus torch.nn.Sequential expects a list of the layers that we want in the neural network. in our case our list has two things, we want a linear layer (y=wx+b) whose input is a vector of some length along with a non-linear activation function relu followed another linear layer with sigmoid activation function - this is a two layer neuron model

There is also the implicit input layer which is understood.

```
In [0]:
```import torch
#device = torch.device("cpu") # Uncomment this to run on CPU
dtype = torch.float
device = torch.device("cuda:0")
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype) #64x1000
y = torch.randn(N, D_out, device=device, dtype=dtype) #64x10
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype) #1000 x 100
w2 = torch.randn(H, D_out, device=device, dtype=dtype) #100 x 10
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
torch.nn.Sigmoid())
model = model.cuda()
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_function = torch.nn.MSELoss(reduction='sum').cuda()
# Set the learning rate value
learning_rate = 1e-4
for t in range(5):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data. Basically construct the graph
y_pred=model(x)
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the
# loss.
loss = loss_function(y_pred, y)
print(t, "{:.20f}".format(loss.item()))
# Zero the gradients before running the backward pass.
model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
loss.backward()
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its gradients like we did before.
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad

```
```

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (with torch.no_grad() or .data to avoid tracking history in autograd). This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

In this example we will use the nn package to define our model as before, but we will optimize the model using the Adam algorithm provided by the **optim package**:

```
In [0]:
```# -*- coding: utf-8 -*-
import torch
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out))
loss_fn = torch.nn.MSELoss(reduction='sum')
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(5):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x) # Compute the graph
# Compute and print loss.
loss = loss_fn(y_pred, y)
print(t, "{:.20f}".format(loss.item()))
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()

```
```

In PyTorch when using classes, we define the Models as subclasses of torch.nn.Module.

The two functions required while defining (inherenting) any model using class are:

- We define all the parametric (and sometimes non-parametric) layers. This function has to always be inherited first, then we define parameters of the layer such as the class variables i.e. self.x**init****forward**- We connect all the layers and other functions to the input to form a graph.

In the __init__ function, we initialize the layers we want to use and Pytorch goes more low level and we have to specify the sizes of our network so that everything matches.

In the forward method, we specify the connections of your layers. This means that we will use the layers we already initialized, in order to re-use the same layer for each forward pass of data we make.

For layers that do not have trainable weights, we can use either the **layer form** (from torch.nn) or **connection form** (from torch.nn.functional),

```
In [0]:
```#!pip3 install torch torchvision
import torch
import torch.nn as nn
import torch.nn.functional as F
class net(torch.nn.Module):
def __init__(self):
super(net,self).__init__() # Can also use torch.nn.Module.__init__(self)
self.linear1=torch.nn.Linear(D_in,H)
self.linear2=torch.nn.Linear(H,D_out)
def forward(self,x):
h_relu = torch.nn.functional.relu(self.linear1(x))
#h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred

During the **forward pass, the inputs must be passed** to the graph being constructed and any output must be returned for the loss to be calculated.

In the class definition, you can see the inheritance from the base class nn.Module. The inherited class gets all of the functionality, capabilities in the parenthesis class when subclassing. We then can add additional functionality to it.

Then, in the first line of the class initialization **( def __init__(self)**: ) we have the required super() function, which creates an instance of the base nn.Module class. We initialize the superclass functionality that is the superclass nn.module has to be built before we can add our pieces using it. i.e, we have to construct the superclass first.

The **self keyboard** makes sure that the instantiating class is able instantiate its own data.

The next lines is where we create **define** our fully connected **layers** as per the architecture diagram. A fully connected neural network layer is represented by the nn.Linear object, with the first argument in the definition being the number of nodes in layer l and the next argument being the number of nodes in layer l+1.

Now we’ve setup the “skeleton” of our network architecture, we have to **define how data flows** through out network i.e, how the **computational graph** is connected. We do this by defining a **forward()** method in our class – this method overwrites a dummy method in the base class, and needs to be defined for each network.

Note:

Each class method should have an argument

**self**as its**first argument**.**self**is a parameter common to all class methods. In general, a function is floating free, unencumbered whereas a class (instance) method has to be aware of the parent (and parent properties) so self is a way of passing the method a reference to the parent class.Variables created by the keyword

**self**unique to each instance of the classClass variables are shared by all instances and dont have the keyword self.

__init__ is the default method that is invoked when the object (instance) is first created. The method call to this method is immediate and automatic after the creation of the class instance.

```
In [0]:
```N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in).cuda()
y = torch.randn(N, D_out).cuda()
#x = x.type(torch.FloatTensor)
#y = y.type(torch.FloatTensor)
mymodel = net()
print(mymodel)
# Move to gpu if available:
if torch.cuda.is_available():
mymodel.cuda()

```
```

**creates** an **instance of the network architecture**. We then setup an optimizer and a loss criteria. In PyTorch, the optimizer knows how to optimize any attribute of type Parameter.

```
In [0]:
```# create a stochastic gradient descent optimizer
optimizer = torch.optim.SGD(mymodel.parameters(), lr=0.0009, momentum=0.9)
# create a loss function
#criterion = nn.NLLLoss()
criterion = torch.nn.MSELoss(reduction='sum')

```
In [0]:
```epochs = 5
for epoch in range(epochs):
optimizer.zero_grad()
model_output = mymodel(x)
loss = criterion(model_output, y)
print(epoch, "{:.20f}".format(loss.item()))
loss.backward()
optimizer.step()

```
```

**forward()** will **automatically** get called when the layer is calculated and its gonna get passed the data from previous layer.

```
In [0]:
```# -*- coding: utf-8 -*-
import torch
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
#h_relu = self.linear1(x).clamp(min=0)
h_relu = torch.nn.functional.relu(self.linear1(x))
y_pred = self.linear2(h_relu)
return y_pred
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(5):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x) #Construct the graph and return the output of forward
# Compute and print loss, can directly call nn.functional.mse_loss(out, y)since it is a non-parametric function
loss = criterion(y_pred, y)
print(t, "{:.20f}".format(loss.item()))
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
sum([p.numel() for p in model.parameters()])

```
Out[0]:
```

Notice that we **never explicitly call forward pass** defined within the class. We always call the Module instance afterwards instead since it takes care of running the registered hooks while the former silently ignores them.

We can also obtain the number of trainable parameters of a model using in the last line of code.

**torch.nn only supports mini-batches**. The entire torch.nn package only supports inputs that are a mini-batch of samples, and not a single sample. For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width. If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension.

This is motivated by the computational speed-up it induces. To evaluate a module on a sample, both the module’s parameters and the sample have to be first copied into cache memory, which is fast but small. For any model of reasonable size, only a fraction of its parameters can be kept in cache, so a module’s parameter have to be copied there every time it is used. These memory transfers are slower than the computation itself.This is the main reason for batch processing: it cuts down to one per module per batch the number of copies of parameters to the cache. It also cuts down the use of Python loops, which are awfully slow.

A nn.Module is actually a OO wrapper around the functional interface, that contains a number of utility methods, like eval() and parameters(), and it automatically creates the parameters of the modules for us.

We can use the functional interface whenever we want, but that requires us to define the weights by hand.

The difference between torch.nn and torch.nn.functional is very subtle. In fact, many torch.nn.functional have a corresponding equivalent in torch.nn.

In many code samples, we **use torch.nn.functional for simpler operations that have no trainable parameters or configurable parameters**. For example for ReLU, we do not require any learnable parameters to be called in forward() method hence it can be defined using the torch.nn.functional interface.

Alternatively, in some sections, we use torch.nn.Sequential to compose layers from torch.nn only.

Both approaches are simple and more like a coding style issue rather than any major implementation differences. There isnt any performance difference.

If all layers are defined with nn.functional, then all variables, such as weights, bias, etc., need to be manually defined by the user, which is very inconvenient.

The functions in torch.nn.functional are just some arithmetical operations, not the layers which have trainable parameters such as weights and bias terms.

As a result, layers with parameters are usually initialized in init to be shared by the whole module, while some connections or simple operations without parameters can be defined in forward to be used in forward propagation.

```
In [0]:
```import torch
class Model(torch.nn.Module):
super().__init__()
self.feature_extractor = nn.Sequential(
Conv2d(3, 12, kernel_size=3, padding=1, stride=1),
Conv2d(12, 24, kernel_size=3, padding=1, stride=1),)
def forward(self, x):
x = self.feature_extractor(x)
return x

We can loop over the parameters of the module using model.parameters() and then initializing each with tensor functions such as exponential, uniform, fill etc.

Every module have an attribute definition

**.apply**. We can call this apply on the module and pass it a function which handles the initialization for each of the parameter. Whenever .apply is called on a module, it is called on each of the module and parameter recursively.We could use

**torch.nn.init**module to initialize our parameters in more practical way. Suppose we have a parameter named m, and we need to initialize it using Xavier (Glorot) initialization, then we can do torch.nn.init.xavier_uniform(m) . Now, combining this with .apply we can use the module torch.nn.init to initialize parameters of all sort.

Applies function recursively to every submodule (as returned by .children()) as well as self. Typical use case includes initializing the parameters of a model (we also use torch-nn-init).

```
In [0]:
```import torch.nn as nn
def init_weights(m):
print(m)
if type(m) == nn.Linear:
#m.weight.data.fill_(1.0)
nn.init.xavier_uniform_(m.weight.data)
print(m.weight.data)
net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

```
Out[0]:
```

```
In [0]:
```seq_net = nn.Sequential(
nn.Linear(2,4),
nn.Tanh(),
nn.Linear(4,1)
)
print(seq_net[0])
print(seq_net[1])
print(seq_net[0].weight)

```
```

**parameters(recurse=True)**

Returns an iterator over module parameters. This is typically passed to an optimizer.

```
In [0]:
```for param in net.parameters():
print(param,":",param.size())

```
```

**named_parameters(prefix='', recurse=True)**

Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.

```
In [0]:
```for name, param in net.named_parameters():
print(name,": ",param.size())

```
```

**buffers(recurse=True)**

Returns an iterator over module buffers. If recurse=True, then yields buffers of this module and all submodules. Otherwise, yields only buffers that are direct members of this module

```
In [0]:
```for buf in net.buffers():
print(type(buf.data), buf.size())

**named_buffers(prefix='', recurse=True)**

Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.

**Children( modules)**

Returns an iterator over immediate children modules

```
In [0]:
```for child in net.children():
print(child)

```
```

**named_children()**

Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.

```
In [0]:
```for name, module in net.named_children():
print(name," : ", module)

```
```

**modules**()

Returns an iterator over all modules in the network.

```
In [0]:
```for idx, m in enumerate(net.modules()):
print(idx, ' ', m)

```
```

**named_modules(memo=None, prefix='')**

Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself. Duplicate modules are returned only once.

```
In [0]:
```l = nn.Linear(2, 2)
net = nn.Sequential(l, l)
for idx, m in enumerate(net.named_modules()):
print(idx,":",m)

```
```

```
In [0]:
```net.cpu()

```
Out[0]:
```

The main idea behind PyTorch’s GPU handling is that if an object (tensor, Variable, Parameter) is available on a device with id 0 , then after performing some operation (transformation) on it, the resultant object will also be stored on the same device 0. Further, **operations cannot be performed on two objects residing on different devices**, i.e., no a+b is allowed if a is on CPU and b is on GPU.

So, in order to run our training on GPU, it is enough to copy our model parameters and our input data on the GPU. This is done using** .cuda()** attribute available for all tensors and modules. For tensor, it is straightforward that this will copy the tensor to GPU, but for module (model), calling .cuda() will recursively copy all child modules and parameters to GPU.

We can verify if torch.cuda.is_available() returns True to check if a GPU is available on a machine.

cuda(device=None)

```
In [0]:
```import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
x = torch.randn(1, 2).cuda()
y = torch.randn(1, 1).cuda()
class net(torch.nn.Module):
def __init__(self):
super(net,self).__init__()
self.linear1=torch.nn.Linear(2,1)
self.linear2=torch.nn.Linear(1,2)
def forward(self,x):
h_relu = torch.nn.functional.relu(self.linear1(x))
y_pred = self.linear2(h_relu)
return y_pred
mymodel = net()
if torch.cuda.is_available():
mymodel.cuda()
optimizer = torch.optim.SGD(mymodel.parameters(), lr=0.0009, momentum=0.9)
# Before training
print("predict (before training)", mymodel(x).data[0])
loss_col = []
epochs = 5
for epoch in range(epochs):
optimizer.zero_grad()
model_output = mymodel(x)
loss = torch.nn.functional.mse_loss(model_output, y) # direct use of mse_loss from nn.functional
loss_col.append(loss)
#print(epoch, "{:.20f}".format(loss.item()))
loss.backward()
optimizer.step()
print("progress:", epoch, loss.data.item())
plt.plot(loss_col)
plt.show()
#print(list(mymodel.parameters()))
# After training
hour_var = torch.Tensor([1.0,4.0]).cuda()
y_pred = mymodel(hour_var)
print("predict (after training)", y_pred.data)

```
```

Sometimes it gets difficult to plot our loss with respect to every parameter.** As a result, we will store the loss in a list for each iteration.**

We simply create an empty list, calculate the loss, and then store the loss value in the list. We plot out the loss for each iteration.

```
In [0]:
```print(mymodel.linear1.weight.grad.size())
print(mymodel.linear1.weight.data[0])
print(mymodel.linear1.weight.data.norm()) # norm of the weight
print(mymodel.linear1.weight.grad.data.norm()) # norm of the gradients

```
```

You can save the models and load them back for inference. They are two ways of doing it:

Save the

**entire model**and load it back - torch.save function saves a serialized object to disk. This function uses Python’s pickle utility for serialization. Models, tensors, and dictionaries of all kinds of objects can be saved using this function. Similarly torch.load function uses pickle’s unpickling facilities to deserialize pickled object files to memory. This function also facilitates the device to load the data into.Save

**only model**parameters - Neural network modules as well as optimizers have the ability to save and load their internal state using .state_dict(). With this we can continue training from previously saved state dicts and if needed we'd just need to call .load_state_dict(state_dict).torch.save(model.state_dict(), '/results/model.pth')

torch.save(optimizer.state_dict(), '/results/optimizer.pth')

A **state_dict** is simply a Python dictionary object that maps each layer with learnable parameters (convolutional layers, linear layers, etc.) to its parameter tensor (weights and biases accessed by accessed with model.parameters()).

**Optimizer objects** (torch.optim) also have a **state_dict**, which contains information about the optimizer’s state, as well as the hyperparameters used

Because state_dict objects are Python dictionaries, they can be easily saved, updated, altered, and restored, adding a great deal of modularity to PyTorch models and optimizers.

```
In [0]:
```# Print model's state_dict
print("Model's state_dict:")
for param_tensor in mymodel.state_dict():
print("\n",param_tensor, "\t", mymodel.state_dict()[param_tensor].size())
# Print optimizer's state_dict
print("\nOptimizer's state_dict:")
for var_name in optimizer.state_dict():
print("\n",var_name, "\t", optimizer.state_dict()[var_name])

```
```

```
In [0]:
```# Save only model parameters
torch.save(mymodel.state_dict(), "/content/rahul.pth")
# Load only model parameters
# To re-read the parameters of the model, first we need to redefine the model once, then re-read the parameters
mymodel_n = net().cuda()
mymodel_n.load_state_dict(torch.load("/content/rahul.pth"))
print(mymodel_n)
# print(list(mymodel_n.parameters()))
# model.eval()
hour_var = torch.Tensor([1.0,4.0]).cuda()
y_pred_n = mymodel_n(hour_var)
print("predict (after training)", y_pred_n.data)

```
```

**deserialize the saved state_dict** before you pass it to the load_state_dict() function. For example, you CANNOT load using model.load_state_dict(PATH)

**Save entire model**

**disadvantage** of this approach is that the **serialized data is bound to the specific classes** and the exact directory structure used when the model is saved. The reason for this is because pickle does not save the model class itself. Rather, it saves a path to the file containing the class, which is used during load time. Because of this, your code can break in various ways when used in other projects or after refactors.

```
In [0]:
```# Save the entire model
torch.save(mymodel,"/content/rahule.pt")
# Load the entire model
mymodel_nf = torch.load("/content/rahule.pt")
# model.eval()
hour_var = torch.Tensor([1.0,4.0]).cuda()
y_pred_nf = mymodel_nf(hour_var)
print("predict (after training)", y_pred_nf.data)

**Checkpoint saving**

When saving using a general checkpoint (to be used for either inference or resuming training) we must save more than just the model’s state_dict. It is important to also save the optimizer’s state_dict, as this contains buffers and parameters that are updated as the model trains.

Other items that we may want to save are the epoch we left off on, the latest recorded training loss, external torch.nn.Embedding layers, etc.

To save multiple components, organize them in a dictionary and use torch.save() to serialize the dictionary. A common PyTorch convention is to save these checkpoints using the .tar file extension.

To **load the items, first initialize the model and optimizer**, then load the dictionary locally using torch.load(). From here, you can easily access the saved items by simply querying the dictionary as you would expect.

Remember that you must **call model.eval()** to set dropout and batch normalization layers to evaluation mode **before running inference**. Failing to do this will yield inconsistent inference results. If you wish to resuming training, call model.train() to ensure these layers are in training mode.

```
In [0]:
```# Checkpoint save
torch.save({
'epoch': epoch,
'model_state_dict': mymodel.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
...
}, PATH)

```
In [0]:
```# Checkpoint load
model = TheModelClass(*args, **kwargs)
optimizer = TheOptimizerClass(*args, **kwargs)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
model.eval()
# - or -
model.train()

**Multiple Models in One File**

When saving a model comprised of multiple torch.nn.Modules, such as a GAN, a sequence-to-sequence model, or an ensemble of models, you follow the same approach as when we are saving a general checkpoint.

In other words, save a dictionary of each model’s state_dict and corresponding optimizer. As mentioned before, we can save any other items that may aid us in resuming training by simply appending them to the dictionary.

A common PyTorch convention is to save these checkpoints using the .tar file extension.

To load the models, first initialize the models and optimizers, then load the dictionary locally using torch.load(). From here, you can easily access the saved items by simply querying the dictionary as you would expect.

Remember that you must call model.eval() to set dropout and batch normalization layers to evaluation mode before running inference. Failing to do this will yield inconsistent inference results. If you wish to resuming training, call model.train() to set these layers to training mode.

```
In [0]:
```torch.save({
'modelA_state_dict': modelA.state_dict(),
'modelB_state_dict': modelB.state_dict(),
'optimizerA_state_dict': optimizerA.state_dict(),
'optimizerB_state_dict': optimizerB.state_dict(),
...
}, PATH)

```
In [0]:
```modelA = TheModelAClass(*args, **kwargs)
modelB = TheModelBClass(*args, **kwargs)
optimizerA = TheOptimizerAClass(*args, **kwargs)
optimizerB = TheOptimizerBClass(*args, **kwargs)
checkpoint = torch.load(PATH)
modelA.load_state_dict(checkpoint['modelA_state_dict'])
modelB.load_state_dict(checkpoint['modelB_state_dict'])
optimizerA.load_state_dict(checkpoint['optimizerA_state_dict'])
optimizerB.load_state_dict(checkpoint['optimizerB_state_dict'])
modelA.eval()
modelB.eval()
# - or -
modelA.train()
modelB.train()

**Using Parameters from a Different Model**

Partially loading a model or loading a partial model are common scenarios when transfer learning or training a new complex model. Leveraging trained parameters, even if only a few are usable, will help to warmstart the training process and hopefully help your model converge much faster than training from scratch.

Whether you are loading from a partial state_dict, which is missing some keys, or loading a state_dict with more keys than the model that you are loading into, you can set the strict argument to False in the load_state_dict() function to ignore non-matching keys.

If you want to load parameters from one layer to another, but some keys do not match, simply change the name of the parameter keys in the state_dict that you are loading to match the keys in the model that you are loading into.

```
In [0]:
```torch.save(modelA.state_dict(), PATH)

```
In [0]:
```modelB = TheModelBClass(*args, **kwargs)
modelB.load_state_dict(torch.load(PATH), strict=False)

**Saving & Loading Model Across Devices**

- Save on GPU, Load on CPU
- Save on GPU, Load on GPU
- Save on CPU, Load on GPU

**Save on GPU, Load on CPU**

When loading a model on a CPU that was trained with a GPU, **pass torch.device('cpu') to the map_location argument** in the torch.load() function. In this case, the storages underlying the tensors are dynamically remapped to the CPU device using the map_location argument.

```
In [0]:
```# Save
torch.save(model.state_dict(), PATH)
# Load
device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device)

**Save on CPU, Load on GPU**

When loading a model on a GPU that was trained and saved on CPU, set the map_location argument in the torch.load() function to cuda:device_id. This loads the model to a given GPU device. Next, be sure to call model.to(torch.device('cuda')) to convert the model’s parameter tensors to CUDA tensors. Finally, be sure to use the .to(torch.device('cuda')) function on all model inputs to prepare the data for the CUDA optimized model. Note that calling my_tensor.to(device) returns a new copy of my_tensor on GPU. It does NOT overwrite my_tensor. Therefore, remember to manually overwrite tensors: my_tensor = my_tensor.to(torch.device('cuda')).

```
In [0]:
```torch.save(model.state_dict(), PATH)
device = torch.device("cuda")
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location="cuda:0")) # Choose whatever GPU device number you want
model.to(device) # Model to GPU
# Make sure to call the code; input = input.to(device) on any input tensors that you feed to the model

**Save on GPU, Load on GPU**

When loading a model on a GPU that was trained and saved on GPU, simply convert the initialized model to a CUDA optimized model using model.to(torch.device('cuda')). Also, be sure to use the .to(torch.device('cuda')) function on all model inputs to prepare the data for the model. Note that calling my_tensor.to(device) returns a new copy of my_tensor on GPU. It does NOT overwrite my_tensor. Therefore, remember to manually overwrite tensors: my_tensor = my_tensor.to(torch.device('cuda')).

```
In [0]:
```# Save
torch.save(model.state_dict(), PATH)
# Load
device = torch.device("cuda")
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.to(device) # Model to GPU
# Make sure to call the code; input = input.to(device) on any input tensors that you feed to the model

For inspecting / modifying the output and grad_output of a layer during forward or backward propogation we introduce hooks.

A **hook is a function that can be registered on a module or a tensor**.

The hook can be a forward hook or a backward hook.

The **register_forward_hook** & **register_backward_hook** functions are similar to the register_hook variable function, which can be register_hook when the module propagates forward or backpropagates.

A **forward hook function is executed each time the forward propagation execution ends**.

The forward-propagating hook function has the following form:

**hook(module, input, output)** -> None,

and backpropagation has the following form:

**hook(module, grad_input, grad_output)** -> Tensor or None .

The hook function should not modify the input and output, and **should be deleted in time after use** to avoid running the hook every time which can increase the running load.

Hook functions are mainly used in scenarios where some intermediate results are obtained, such as the output of a layer in the middle or the gradient of a layer. These results should have been written in the forward function, but if you add these processing to the forward function, it may make the processing logic more complicated. At this time, it is more appropriate to use the hook technique.

Let's consider a scenario. There is a pre-trained model that needs to extract the output of a layer (not the last layer) of the model as a feature, but does not want to modify its original model definition file. You can use the hook function. The code of the implementation is given below.

PyTorch recursively applies any hook to all submodules on a register_forward_hook() call.

```
In [0]:
``````
#!pip3 install torch torchvision
```

```
In [0]:
```import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
x = torch.randn(1, 2).cuda()
y = torch.randn(1, 1).cuda()
class net(torch.nn.Module):
def __init__(self):
super(net,self).__init__()
self.linear1=torch.nn.Linear(2,1)
self.linear2=torch.nn.Linear(1,2)
def forward(self,x):
h_relu = torch.nn.functional.relu(self.linear1(x))
y_pred = self.linear2(h_relu)
return y_pred
def forwardcall(module,input,output):
print("\nInside forward of ", module," : ",input," : ", output)
def backwardcall(module, grad_input, grad_output):
print("Inside backward of ", module," : ",grad_input," : ", grad_output)
mymodel = net()
print(mymodel)
hook1 = mymodel.linear1.register_forward_hook(forwardcall)
hook2 = mymodel.linear2.register_backward_hook(backwardcall)
if torch.cuda.is_available():
mymodel.cuda()
optimizer = torch.optim.SGD(mymodel.parameters(), lr=0.0009, momentum=0.9)
loss_col = []
epochs = 5
for epoch in range(epochs):
optimizer.zero_grad()
model_output = mymodel(x)
loss = torch.nn.functional.mse_loss(model_output, y)
loss_col.append(loss)
#print(epoch, "{:.20f}".format(loss.item()))
loss.backward()
optimizer.step()
hook1.remove() # removes the hook
hook2.remove() # removes the hook

```
```

The **current implementation will not have above behavior for complex Module** that perform many operations.

In some failure cases, grad_input and grad_output will only contain the gradients for a subset of the inputs and outputs.

For such Module, you should use torch.Tensor.register_hook() directly on a specific input or output to get the required gradients.

This is to change the mode of the model, to put it in training mode while training and to put it in evaluation mode while testing. This affects only those layers that behave differently during training and testing such as Dropout, BatchNorm etc

eval(): Sets the module in evaluation mode. This has any effect only on certain modules. See particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.

train(mode=True): Sets the module in training mode. This has any effect only on certain modules

For example **loss is not required during evaluation** as the model is already built with weights. We only need to infer it with new data.

```
In [0]:
```# You should be able to check the training state of the model:
if mymodel.training == True:
print("Model is in training mode")
if mymodel.training == False:
print("Model is in Evaluation mode")

```
```

```
In [0]:
```linear = nn.Linear(2, 2)
print(linear.weight)
linear.to(torch.double)
print(linear.weight)
gpu1 = torch.device("cuda:0")
print(linear.to(gpu1, dtype=torch.half, non_blocking=True))
print(linear.weight)
cpu = torch.device("cpu")
linear.to(cpu)
print(linear.weight)

```
```

Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.

**scheduler.step() only changes learning rate, but does not perform optimizer step. You also have to call optimizer.step(), regardless of the fact that you use scheduler or not.**

note that optim.param_groups is a list of the different weight groups which can have different learning rates and it is accessible as:

```
for g in optim.param_groups:
g['lr'] = 0.001
```

```
In [0]:
```import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
x = torch.randn(1, 2).cuda()
y = torch.randn(1, 1).cuda()
class net(torch.nn.Module):
def __init__(self):
super(net,self).__init__()
self.linear1=torch.nn.Linear(2,1)
self.linear2=torch.nn.Linear(1,2)
def forward(self,x):
h_relu = torch.nn.functional.relu(self.linear1(x))
y_pred = self.linear2(h_relu)
return y_pred
mymodel = net()
print(mymodel)
if torch.cuda.is_available():
mymodel.cuda()
#optimizer = torch.optim.SGD(mymodel.parameters(), lr=0.0009, momentum=0.9)
optimizer = torch.optim.SGD(mymodel.parameters(), lr=0.01)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.99)
# Train flag can be updated with boolean
# to disable dropout and batch norm learning
mymodel.train(True)
# execute train step
loss_col = []
epochs = 50
for epoch in range(epochs):
optimizer.zero_grad()
model_output = mymodel(x)
loss = torch.nn.functional.mse_loss(model_output, y)
loss_col.append(loss)
print("\nIteration:",epoch,"Loss:","{:.20f}".format(loss.item()))
print("lr: ", "{:.20f}".format(optimizer.param_groups[0]['lr']))
loss.backward()
scheduler.step()
optimizer.step()
mymodel.train(False)
# run inference step
# CPU seed
torch.manual_seed(42)
# GPU seed
torch.cuda.manual_seed_all(42)

torch.nn.functional.relu(input, inplace=False) takes a tensor of any size as input, applies ReLU on each value to produce a result tensor of same size.

inplace indicates if the operation should modify the argument itself. This may be desirable to reduce the memory footprint of the processing.

- Inherit a class from torch.nn.module - initialize: define modules, forward: build network
- Create an object (instance) - Construct graphs when class in initialized and forward is called
- Define optimizer and loss function
- Run loop - reset gradient, backward, update step

**Multi-processing:**

torch.nn.dataparallel

torch.multiprocessing

hogwild(async)

Solve the previous celsius to fahrenheit problem using pytorch classes and functions. Hook to the backpropogation for each iteration and plot the change of values.

Finally, Save the model, load the model and run in evaluation mode and evalute the performance of the model for a given sample.

Check if a gpu is present and use gpu for accelerating the entire model and training.

Pixels are the raw building blocks of an image. Every image consists of a set of pixels. There is no finer granularity than the pixel. Normally, a pixel is considered the “color” or the “intensity” of light that appears in a given place in our image.

Most pixels are represented in two ways:

- Grayscale/single channel
- Color - 3 Channel (RGB)

In a **grayscale image**, each pixel is a scalar value between 0 and 255, where **zero** corresponds to **“black”** and **255** being **“white”**. Values between 0 and 255 are varying shades of gray, where values closer to 0 are darker and values closer to 255 are lighter.

- Grayscale = Image Batch x Channel x Height x Weight = Image Batch x 1 x Height x Weight = Image Batch x H x W

Pixels in the RGB color space are represented by a list of three values: one value for the Red component, one for Green, and another for Blue.

- Color = Image Batch x C x H x W

**RGB and BGR Ordering** - It is important to note that some softwares for example OpenCV stores RGB channels in reverse order. While we normally think in terms of Red, Green, and Blue, OpenCV actually stores the pixel values in Blue, Green, Red order (BGR).

Data Loading

For convenience, PyTorch provides a number of utilities to load, preprocess and interact with datasets. These helper classes and functions are found in the torch.utils.data module.

PyTorch offers the torch.utils.data.DataLoader object which combines a data-set and a sampling policy to create an iterator over mini-batches.

Standard data-sets are available in torchvision.datasets, and they allow to apply transformations over the images or the labels transparently

The two major concepts here are:

```
A Dataset, which encapsulates a source of data,
A DataLoader, which is responsible for loading a dataset, possibly in parallel.
```

torch.utils.data.Dataset is an abstract class representing a dataset. The custom dataset should inherit Dataset and override the following methods:

```
__getitem__ to support the indexing such that dataset[i] can be used to get ith sample
__len__ so that len(dataset) returns the size of the dataset
```

New **datasets** are created by

**subclassing**the torch.utils.data.Dataset class and- initializating the
**init**function to actually load the data from media or database or url or using generators - the
**getitem**method to access a**single value**at a**certain index**and - overriding the
**len**method to return the number of samples in the dataset

```
In [0]:
```from torch.utils.data import Dataset
class c2fdata(Dataset):
def __init__(self):
self.celsius = torch.tensor([(float)(c) for c in range(-273,1000)])
self.fahrenheit = torch.tensor([c*1.8+32.0 for c in self.celsius])
def __getitem__(self,index):
return self.celsius[index],self.fahrenheit[index]
def __len__(self):
return self.celsius.shape[0]
mydataset = c2fdata()
print(mydataset[0])
print(len(mydataset))

```
```

Inside **init** we would usually configure some paths or change the set of samples ultimately returned. In **len**, we specify the upper bound for the index with which **getitem** may be called, and in **getitem** we return the actual sample, which could be an image or an audio snippet.

To iterate over the dataset we could, in theory, simply have a for i in range loop and access samples via **getitem**.

```
In [0]:
```for i in range(len(mydataset)):
print(mydataset[i])
if i == 3:
break

```
```

However, it would be much more convenient if the dataset implemented the iterator protocol itself, so we could simply loop over samples with for sample in dataset.

We are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing out on:

```
Batching the data
Shuffling the data
Load the data in parallel using multiprocessing workers.
```

Fortunately, this functionality is provided by the **DataLoader class** available via the **torch.utils.data.DataLoader** which is an iterator which provides all these features.

A DataLoader object takes a **dataset object as argument** and a number of options that configure the way samples are retrieved. For example, it is possible to load samples in **parallel**, using multiple processes. For this, the DataLoader constructor takes a **num_workers** argument.

Note that DataLoaders always return batches, whose size is set with the **batch_size** parameter.

List of important arguments for **dataloader**:

```
In [0]:
```mydataloader = torch.utils.data.DataLoader(mydataset,batch_size=6,num_workers=2)
for i, batch in enumerate(mydataloader):
print(i, batch)

```
In [0]:
```mydataloader = torch.utils.data.DataLoader(mydataset,batch_size=6, shuffle=True, num_workers=2, drop_last=True)
for i, batch in enumerate(mydataloader):
print(i, batch)

```
In [0]:
```import math
class RangeDataset(torch.utils.data.Dataset):
def __init__(self, start, end, step=1):
self.start = start
self.end = end
self.step = step
def __getitem__(self, index):
value = self.start + index * self.step
assert value < self.end
return value
def __len__(self):
return math.ceil((self.end - self.start) / self.step)

```
In [0]:
```dataset = RangeDataset(0, 10)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=2, drop_last=True)
for i, batch in enumerate(data_loader):
print(i, batch)

```
```

Here, we set batch_size to 4, so returned tensors will contain exactly four values.

By passing shuffle=True, the index sequence with which data is accessed is permuted, such that individual samples will be returned in random order.

We also passed drop_last=True, so that if the number of samples left for the final batch of the dataset is less than the specified batch_size, that batch is not returned. This ensures that all batches have the same number of elements, which may be an invariant that we need.

Finally, we specified num_workers to be two, meaning data will be fetched in parallel by two processes. Once the DataLoader has been created, iterating over the dataset and thereby retrieving batches is simple and natural.

```
In [0]:
```class DiabetesDataset(Dataset):
""" Diabetes dataset."""
# Initialize your data, download, etc.
def __init__(self):
xy = np.loadtxt('./data/diabetes.csv.gz',
delimiter=',', dtype=np.float32)
self.len = xy.shape[0]
self.x_data = torch.from_numpy(xy[:, 0:-1])
self.y_data = torch.from_numpy(xy[:, [-1]])
def __getitem__(self, index):
return self.x_data[index], self.y_data[index]
def __len__(self):
return self.len
dataset = DiabetesDataset()
train_loader = DataLoader(dataset=dataset,batch_size=32, shuffle=True,num_workers=2)
for epoch in range(2):
for i, data in enumerate(train_loader, 0):
# get the inputs
inputs, labels = data
# wrap them in Variable
inputs, labels = Variable(inputs), Variable(labels)
# Run your training process
print(epoch, i, "inputs", inputs.data, "labels", labels.data)

import torchvision.transforms as transforms

The datset can take an optional argument **transform** so that any required processing can be applied on the samples.

Most neural networks expect the images of a fixed size. Therefore, we will need to write some prepocessing code. For example:

```
Rescale: to scale the image
RandomCrop: to crop from image randomly. This is data augmentation.
ToTensor: to convert the numpy images to torch images (we need to swap axes).
```

Let’s say we want to rescale the shorter side of the image to 256 and then randomly crop a square of size 224 from it. i.e, we want to compose Rescale and RandomCrop transforms. **torchvision.transforms.Compose** is a simple callable class which allows us to do this.

We then pass this composition as an argument to dataset class.

```
In [0]:
```scale = Rescale(256)
crop = RandomCrop(128)
composed = transforms.Compose([Rescale(256),RandomCrop(224)])

import torchvision.datasets as datasets

We have seen how to use datasets, transforms and dataloader.

torchvision package provides some common datasets and transforms. You might not even have to write custom classes. One of the more generic datasets available in torchvision is **ImageFolder**. It assumes that images are organized in the following way:

root/ants/xxx.png

root/ants/xxy.jpeg

root/ants/xxz.png

.

.

.

root/bees/123.jpg

root/bees/nsdf3.png

root/bees/asd932_.png

where ‘ants’, ‘bees’ etc. are class labels.

```
In [0]:
```import torch
from torchvision import transforms, datasets
data_transform = transforms.Compose([
transforms.RandomSizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
mydataset = datasets.ImageFolder(root='root',transform=data_transform)
dataset_loader = torch.utils.data.DataLoader(mydataset,batch_size=4, shuffle=True,num_workers=4)

The DataLoader actually has some reasonably sophisticated logic to determine how to collate individual samples returned from our dataset’s **getitem** method into a batch, as returned by the DataLoader during iteration.

For example, if **getitem** returns a dictionary, the DataLoader will aggregate the values of that dictionary into a single mapping for the entire batch, using the same keys.

This means that if the Dataset’s **getitem** returns a dict(example=example, label=label), then the batch returned by the DataLoader will return something like dict(example=[example1, example2, ...], label=[label1, label2, ...]), i.e. unpacking the values of indidvidual samples and re-packing them into a single key for the batch’s dictionary.

To override this behavior, you can pass a function argument for the collate_fn parameter to the DataLoader object.

**Note**: To create a dataset class for most image datasets we will read the csv in**init**but leave the reading of images to**getitem**. This is**memory efficient**because all the**images are not stored**in the memory at once but read as required.

MNIST is a collection 28 x 28 pixel images of handwritten digits.

Every image can be thought of as an array of numbers between 0 and 1 describing how dark each pixel is (intensity of pixel).

Every MNIST data point has two parts:

- Image of a handwritten digit
- Corresponding label (number between 0 and 9) representing the digit drawn in the image.

The label will be used to compare the predicted digit (by the model) with the true digit (given by the data).

Each of the 28 x 28 pixel images of handwritten digits can be represented as an array of 784 numbers each corresponding to respective pixels.

Every image in MNIST is of a handwritten digit between 0 and 9

So there are only 10 possible things that an image can be

For each image, we want to give the probabilities for it being each digit.

Hence, we want to normalize the evidences, so that all the evidences for one image sum up to 1 and each element Is limited between 0 and 1.

Converting evidences into a probability distribution over 10 cases representing probabilities of input being in each class can mean:

- an 80% chance of being a 9
- a 5% chance of being an 8
- a bit of probability to all others because it is not 100% sure

**Choice of Activation**

Common choices for activation functions are sigmoid, tanh, and rectified linear unit (ReLU).

Converting the evidences into predicted probabilities is done using softmax function (a generalization of the sigmoid function). From this, we can get the predicted class of our handwritten digit by taking the label of the largest element.

Hence, we take the highest probability for being in a specific class, which will represent the prediction of the handwritten digit.

ReLU has been proven to work quite well with deep architectures.

**Choice of Loss/Cost Function**

Any distance would work to determine the cost, even the ordinary euclidean distance is fine but for classification problems one distance, called the "cross-entropy" is more efficient. It is used as a loss function in neural networks which have softmax activations in the output layer.

The softmax loss layer also computes the multinomial logistic loss of the softmax of its inputs. It’s conceptually identical to a softmax layer followed by a multinomial logistic loss layer, but provides a more numerically stable gradient.

*Classification*

Softmax or SoftmaxWithLoss

HingeLoss

*Linear Regression*

- EuclideanLoss

*Attributes / Multi Classification*

- SigmoidCrossEntropyLoss

Handwritten digits in the MNIST dataset are 28x28 pixel greyscale images. The simplest approach for classifying them is to use the 28x28=784 pixels as inputs for a 3-layer neural network.

The artificial neural network used in this post is composed of four layers. The input layer consists of 28 x 28 (=784) greyscale pixels which constitute the input data of the MNIST data set. This input is then passed through two fully connected hidden layers, each with 200 nodes, with the nodes utilizing a ReLU activation function. Finally, we have an output layer with ten nodes corresponding to the 10 possible classes of hand-written digits (i.e. 0 to 9). We will use a softmax output layer to perform this classification.

**Design thinking**

Input Layer: Each Image = 28 x 28 pixels = 784 input units. The dim shape [1,1,28,28]. If say 4 batches of images are feed then the dim shape can be shape [4,1,28,28]

Hidden Layer 1: 200 hidden units (user-defined parameter) with 784 weights

Relu1 Layer 1

Hidden Layer 2: 200 hidden units (one for each digit) with 200 weights

Relu2 Layer 2

Output Layer: 10 hidden units (one for each digit) with 10 weights

Softmax Layer

**Network Architecture**

A fully connected neural network layer is represented by the nn.Linear object, with the first argument in the definition being the number of nodes in layer l and the next argument being the number of nodes in layer l+1. As you can observer, the first layer takes the 28 x 28 input pixels and connects to the first 200 node hidden layer. Then we have another 200 to 200 hidden layer, and finally a connection between the last hidden layer and the output layer (with 10 nodes).

```
In [0]:
```#!pip3 install torch torchvision
import torch
import torch.nn as nn
import torch.nn.functional as F
class mymnist(nn.Module):
def __init__(self):
super(mymnist, self).__init__()
self.fc1 = nn.Linear(28 * 28, 200)
self.fc2 = nn.Linear(200, 200)
self.fc3 = nn.Linear(200, 10)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return F.log_softmax(x,dim=1)

```
In [0]:
```net1 = mymnist().cuda()
print(net1)

```
```

**Gradient Descent and Optimizer**

We create a stochastic gradient descent optimizer, and we specify the learning rate (which we’ve passed to the function as 0.01) and a momentum of 0.9. The other ingredient we need to supply to our optimizer is all the parameters of our network – thankfully PyTorch make supplying these parameters easy by the .parameters() method of the base nn.Module class that we inherit from in the mymnist class.

Next, we set our loss criterion to be the negative log likelihood loss – this combined with our log softmax output from the neural network gives us an equivalent cross entropy loss for our 10 classification classes.

```
In [0]:
```# create a stochastic gradient descent optimizer
optimizer = torch.optim.SGD(net1.parameters(), lr=0.01, momentum=0.9)
# create a loss function
criterion = nn.NLLLoss().cuda()
#optimizer = torch.optim.Adam(net1.parameters(), lr=0.01)
#criterion = nn.CrossEntropyLoss()

**Data Loading**

We will use torchvision utility to download the MNIST database and for the parameter transforms we'll input the object transforms.tensor, and this will convert our images to tensors automatically.

torchvision.datasets.MNIST(root, train=True, transform=None, target_transform=None, download=False)

Parameters:

- root (string) – Root directory of dataset where processed/training.pt and processed/test.pt exist.
- train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.
- download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
- transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
- target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

```
In [0]:
```#!pip3 install torch torchvision
import torch
import torchvision
import torchvision.datasets as dataset
import torchvision.transforms as transforms
BATCH_SIZE = 64
# torchvision.datasets.MNIST outputs a set of PIL images
# We transform them to tensors
transform = transforms.ToTensor()
# Load the training dataset by setting the parameters train to True and convert it to a tensor by placing a transform object in the argument transform.
trainset = torchvision.datasets.MNIST('/tmp', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
# Load the testing dataset by setting the parameters train False and convert it to a tensor by placing a transform object in the argument transform
testset = torchvision.datasets.MNIST('/tmp', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)

**Data Access**

- trainset[0] where first element is image and second element is label (tuple)
- trainset[0][0] is image
- trainset[0][1] is class of image (torch.LongTensor)

```
In [0]:
```# Print out the tuple (image,label)
#print(trainset[0])
# Print out the first image
#print("\n",trainset[0][0])
# Print out first label or class
print(trainset[0][1])
print(trainset[0][0].type())
print(trainset[0][1].type())

```
```

**Data Visualization**

```
In [0]:
```import matplotlib.pylab as plt
import numpy as np
print(trainset[3][0].shape)
plt.imshow(trainset[3][0].numpy().reshape(28,28),cmap='gray')
print(trainset[3][1].item())
plt.title('y= '+ str(trainset[3][1].item()))
#print(trainset[30])

```
Out[0]:
```

Using previously defined trainloader iterator to iterate through the dataset.

```
In [0]:
```dataiter = iter(trainloader)
images, labels = dataiter.next() # Returns batches of images and corresponding labels)
print('Labels: ', labels)
print('Batch shape: ', images.size())

```
```

We often want to display a grid of images to show samples for the training or testing images. torchvision.utils.make_grid makes a grid to be displayed as known as **Image grid.** It takes input a 4D mini-batch Tensor of shape (B x C x H x W) or a list of images all of the same size.

matplotlib.pyplot.imshow() needs a 2D array, or a 3D array with the third dimension being of shape 3 or 4!

Thus iff X is an array, it can have the following shapes and types:

```
MxN – values to be mapped (float or int)
MxNx3 – RGB (float or uint8)
MxNx4 – RGBA (float or uint8)
The value for each component of MxNx3 and MxNx4 float arrays should be in the range 0.0 to 1.0. MxN arrays are mapped to colors based on the norm (mapping scalar to scalar) and the cmap (mapping the normed scalar to a color).
```

```
In [0]:
```im = torchvision.utils.make_grid(images)
print(im.size())
plt.imshow(np.transpose(im.numpy(), (1, 2, 0)))
#plt.imshow(np.transpose(im.numpy(), (2, 1, 0)))

```
Out[0]:
```

**Training**

The outer training loop is the number of epochs, whereas the inner training loop runs through the entire training set in batch sizes which are specified in the code as batch_size.

The MNIST input data-set which is supplied in the torchvision package has the size (batch_size, 1, 28, 28) when extracted from the data loader – this 4D tensor is more suited to convolutional neural network architecture, and not so much our fully connected network. Therefore we need to flatten out the (1, 28, 28) data to a single dimension of 28 x 28 = 784 input nodes.

The **.view() function operates on PyTorch tensor to reshape them**. If we want to be agnostic about the size of a given dimension, we can use the “-1” notation in the size definition. So by using data.view(-1, 28*28) we say that the second dimension must be equal to 28 x 28, but the first dimension should be calculated from the size of the original data variable. In practice, this means that data will now be of size (batch_size, 784). We can pass a batch of input data like this into our network and the magic of PyTorch will do all the hard work by efficiently performing the required operations on the tensors.

On the next line, we run optimizer.zero_grad() – this zeroes / resets all the gradients in the model, so that it is ready to go for the next back propagation pass. In other libraries this is performed implicitly, but in PyTorch we have to remember to do it explicitly.

The next line is where we pass the input data batch into the model – this will actually call the forward() method in our Net class.

After this line is run, the tensor outputs will now hold the log softmax output of our neural network for the given data batch.

That’s one of the great things about PyTorch, we can activate whatever normal Python debugger we usually use and instantly get a gauge of what is happening in your network. This is opposed to other deep learning libraries such as TensorFlow and Keras which require elaborate debugging sessions to be setup before you can check out what your network is actually producing.

The subsequent line is where we get the negative log likelihood loss between the output of our network and our target batch data.

```
In [0]:
```# Hyper-parameters
num_epochs = 1
learning_rate = 0.001
# Train the model
total_step = len(trainloader)
for epoch in range(num_epochs):
for i, (images, labels) in enumerate(trainloader):
# Move tensors to the configured device
images = images.reshape(-1, 28*28).cuda() # Flatten the data (n, 1, 28, 28)-> (n, 784)
labels = labels.cuda()
# Forward pass
outputs = net1(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i+1) % 100 == 0:
print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, i+1, total_step, loss.item()))

```
```

**Testing**

```
In [0]:
```# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
correct = 0
total = 0
for images, labels in testloader:
images = images.reshape(-1, 28*28).cuda() # Flatten the data (n, 1, 28, 28)-> (n, 784)
labels = labels.cuda()
outputs = net1(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))
# Save the model checkpoint
torch.save(net1.state_dict(), 'model.ckpt')

```
```

- View the results of the parameters for each class after the training for the above example
- Plot the loss and accuracy on the test and validation data for the above example
- Train the network by using the sigmoid,Tanh and Relu activations functions seperately and plot the results for comparison.
- Extend the above example for inputing one test sample at time and output the predition
- Extend the above to train and predict for databases Fashion-MNIST, EMNIST
- Use a generic data loader where the samples are arranged in directory structure