Abstract- This project addresses the problem of fine-grained recognition: recognizing subordinate categories for classifying images across 10 categories:airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Based on the insight that images with similar poses can be automatically discovered for fine-grained classes in the same domain. The appearance descriptors are learned using a deep learning using convolutional neural network. Our approach requires only image level class labels, without any use of part annotations or segmentation masks, which may be costly to obtain. TensorFlow is using as the tools for the efficient implementation. Detailed explenations are provided of implemetation including TensorBoard aspects.
Neural Network is well-suited to problem in which the training data corresponds to noisy, complex sensor data, such as input from cameras and microphones. In addition, it is applicable to problem for which more symbolic representations are often used. In general, it is appropriate for problems with the following charectristics:
Deep learning (deep structured learning, hierarchical learning or deep machine learning) based on set of algorithms attempts attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.
Choossing Neural Network as an algorithm in deep learning often much harder to train; That's unfortunate, since we have good reason to believe that if we could train deep nets they'd be much more powerful. Furthermore, upon reflection, it's strange to use networks with fully-connected layers to classify images. The reason is that such a network architecture does not take into account the spatial structure of the images. For instance, it treats input pixels which are far apart and close together on exactly the same footing.
Convolution Neaural Network (CNN) presents a special architecture which is particularly well-adapted to classify images. It was inspired by biological processes and is variations of multilayer perceptrons designed to use minimal amounts of preprocessing. Using this architecture makes convolutional networks fast to train. This, in turns, helps us train deep, many-layer networks for classifying images.
TensorFlow is used to accompolish the mathematical computation with a directed graph of nodes & edges:
Two datasets are going to investigate in this network:
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. http://www.cs.toronto.edu/%7Ekriz/cifar.html.
Yahoo! Shopping Shoes Image Content, version 1.0. This dataset contains 107 folders, each corresponding to a type and brand of shoe. for instances: classes/aerosoles_sandals, classes/aetrex_sandals, ... . Each folder contains some number of images of shoes of the respective type and brand. All together 5357 images. This dataset should be polished to extract training images and test images. http://webscope.sandbox.yahoo.com.
The report is currently in beta. I welcome notification of typos, bugs, minor errors, and major misconceptions. Please drop me a line at Ali.Miraftab@utsa.edu if you spot such an error.
In particular, for each pixel in the input image, we encoded the pixel's intensity as the value for a corresponding neuron in the input layer. As an instance, for the 640×480 pixel images we've been using, this means our network has 307200 input neurons. We then trained the network's weights and biases so that the network's output would - we hope! - correctly identify the input image.
The window on the input pixels in the input image is called the local receptive field for the hidden neuron.
It is shown the local receptive field being moved by one pixel at a time. In fact, sometimes a different length is used. This movement step is called Stride. Usually, it is set as one. However, different stride lengths can be investigate by using validation data for the best performance.
And so on, building up the first hidden layer.
Each hidden neuron has a bias and weights connected to its local receptive field. CNN uses the same weights and bias for each of the Local receptive fields and correspondly hidden neurons. In other words, for the j,k the hidden neuron, the output is:
$$\begin{eqnarray} \sigma\left(b + \sum_{l=0}^x \sum_{m=0}^x w_{l,m} a_{j+l, k+m} \right). \tag{1}\end{eqnarray}$$Here, σ is the neural activation function - perhaps the sigmoid function we used in earlier chapters. b is the shared value for the bias. $w_{l, m}$ is a $5×5$ array of shared weights. And, finally, we use $a_{x, y}$ to denote the input activation at position x,y.
This implies that all the neurons in the first hidden layer detect exactly the same feature. As a matter of fact, CNN are well adapted to the translation invariance of images: move a picture of a cat (say) a little ways, and it's still an image of a cat.
For this reason, we sometimes call the map from the input layer to the hidden layer a feature map. We call the weights defining the feature map the shared weights. And we call the bias defining the feature map in this way the shared bias. The shared weights and bias are often said to define a kernel or filter. In the literature, people sometimes use these terms in slightly different ways.
A big advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network.
The network structure I've described so far can detect just a single kind of localized feature. To do image recognition we'll need more than one feature map. And so a complete convolutional layer consists of several different feature maps:
In addition to the convolutional layers just described, convolutional neural networks also contain pooling layers. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.
In detail, a pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of (say) 2×2 neurons in the previous layer.
The proposed mode for CIFAR-10 by Alex Krizhevsky consists alternating convolutions and nonlinearities which is followed by fully connected layers and at the end softmax classifier. The implemented model by TensorFlow webpage consists of two CNN and two fully connected network layers with with a peak performance of about 86% accuracy within a few hours of training time on a GPU. Here it is tried to implement the three CNN layers and improve the performance. In parallel, the sophisticated framework for implementing CNN will be study in details. the majority of the materials can be found in the TensorFlow webpage. However, this report tries to put the explanations in a same place and modify the framework by it's own purpose.
Note: The lack of GPU in the current machines causes spending much more time.
In [ ]:
distorted_image = tf.image.random_brightness(distorted_image, max_delta=63)
distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)
Note Remotely running TensorBoard requires making the tunnel, one of the simplest way is
$ ssh -N -f -L localhost:destination_Port:localhost:Source_port root@10.241.8.101
The detailes graph of the model is provided by TensorBoard as Follow: (default directory for storing the graph information /tmp/ciraf10_train)
tensorboard --logdir=/tmp/ciraf10_train
For monitoring the ongoing situation TesorBoard have been used. For implementing the TensorBoard following chart provides the guidline:
Visualizing TensorBoard with scalar_summary:
Variation of the Weigths and Bias of the model:
In [ ]:
The variation of the total loss:
The Cross etropy of the total loss:
Bias Gradiant if the CNN1:
In [ ]: