Published by: Amit Nikhade
June 12 . 2021
Advancement in computer vision
“Computer Vision”, a field of Artificial Intelligence that helps Machines to visualize this beautiful world. Computer vision has led to wonders in enhancing Artificial Intelligence. From pattern recognition to Human Pose estimation, And from Robot navigation to solid-state physics, computer vision has much more useful and helpful applications. Using computer vision and deep learning, we successfully give machines the ability to visualize and understand images, videos, etc. But revolution is fastened.
Earlier The Convolutions have given superlative contributions to computer vision and deep learning for medical research, business, technology, and many more. At the last technology can’t be stable it has to be refurbished.
CNN was first introduced in the 1980s by Yann LeCun. The sum of the product of pixels values and their weights is the actual mechanism behind convolution. CNN Mainly focuses on extracting features from the image such as corners, edges, color gradients, and much more. It typically comprises 3 layers that are convolution, pooling, and fully connected layers. In 2015 Microsoft’s research fabricated a highly deep CNN network that outperformed AlexNet, the network was about 200 layers deep. AlexNet was considered the most influential paper published ever in computer vision.
In the case of Natural Language Processing(NLP), As we know it’s the sub-branch of Artificial Intelligence that helps the human being and computer to interact with each other, technically it is the technique to accord computers the ability to understand the human language and derive its essence. NLP is also a toolkit to handle text data with ease. Some popular natural language processing applications include sentiment analysis, text classification, speech-to-text, neural machine translation, etc.
But where exactly the CNN lacks?
2017 was the year when Google Brain, Google research, and the University of Toronto introduced the transformers, it abruptly took the NLP to the next level, it brought a transmutation into the seq to seq model. sequence to sequence models (LSTM/GRU i.e the RNN’s) was used to transform sequence from one form to another, but they too suffer from some adverse problems like vanishing gradient and the model used to handle sequence word by word i.e it took much time and was an obstacle for parallelization. Attention plays a very crucial role in the transformers architecture to extract salient features from the input data. we won’t go much deeper into transformers. I won’t go much deeper into the transformers, hopefully, You must be aware of the transformers architecture which made you read the visual transformer.
Various models were built upon the transformers architecture like the BERT, GPT, TransformerXL, XLNET and there are many more which bestow state-of-art performance.
Recently Google’s BERT has just brought a slight change into its Encoder architecture, which made BERT faster and accurate than before. If you wanna read more on this here is it below.
You might be wondering when the vision transformer will come into play, to understand the vision transformer you need to understand the Indept transformer working. Vision transformer is also a slight change to the transformer.
22 Oct 2020
An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale prescribed that by just splitting the image into patches of fixed size, embedding them linearly plus the position embedding and feeding output vectors parallelly to the transformers Encoder.
As shown in the above diagram the images in the dataset are split into n number of patches of the same size s. i.e
We’ll try to implement the vision transformer model using PyTorch in Python
import torch from torch._C import dtype import torch.nn as nn from torch.nn.modules.conv import Conv2d import torch.nn.functional as F
define the device object to use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
define parameters for image size, patch size, embedding dimension size, multilayer perceptron dimensions, number of layers, number of heads attention dropout rate, number of classes.
img_s = 224 patch_s = 16 emb_dim = 128 mlp_dim = 128 num_heads = 16 num_layers = 5 atten_dropout = .0 num_classes = 2
Build the Vision transformer class
The VIT class accepts the parameters and starts with splitting the reshaped images with a defined patch size. The number of patches is calculated by
n_patches = (image_height *image_width)/(patch_height*patch_width)
the further process involves embedding the patches, you might be wondering why I had used the convolutional for embedding patches. Lol, My title is refuting convolution. Convolution extracts feature more appropriately with suitable inductive baises, which leads to a rise in performance. The CNN is used to extract the low-level features in an image, and ViT is used for relating high-level concepts. ResNet or EfficientNet can also be trimmed to certain layers for extracting the features. patch embedding also can be done through a linear layer. which is commonly used by many. Further, we add the class tokens in the sequence of patches and passed with their positional embeddings through dropout regulation. Adding positional embedding helps the model to understand the structure of the image as well as its patch location.
The rest flow goes the same as for the native transformer Encoder.
class VIT(nn.Module): def __init__(self, img_size= (img_s,img_s),patch_size= (patch_s, patch_s), emb_dim = emb_dim, mlp_dim= mlp_dim ,num_heads=num_heads,n_classes=2, dropout_rate=0., at_d_r=atten_dropout): super(VIT, self).__init__() ih, iw = img_size ph, pw = patch_size num_patches = int((ih*iw)/(ph*pw)) self.cls_tokens = nn.Parameter(torch.rand(1, 1, emb_dim)) self.patch_embed = Conv2d(in_channels=3, out_channels=emb_dim, kernel_size=patch_size, stride=patch_size) self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, emb_dim)) self.dropout = nn.Dropout(dropout_rate) self.enco = transencoder(emb_dim, mlp_dim, num_heads, at_d_r) self.mlp_head = nn.Sequential( nn.LayerNorm(emb_dim), nn.Linear(emb_dim, n_classes) ) def forward(self,x): x = self.patch_embed(x) x = x.permute(0, 2, 3, 1) b, h, w, c = x.shape x = x.reshape(b, h * w, c) cls_token = self.cls_tokens.repeat(b, 1, 1) x= torch.cat([cls_token, x], dim=1) embeddings = x + self.pos_embed embeddings = self.dropout(embeddings) enc = layer(embeddings) mlp_head = self.mlp_head(enc[:, 0]) return mlp_head
The transformer Encoder part. This encoder layer can be stacked n times to extract some new information every time from each layer this results in good predictive power to the transformer
class transencoder(nn.Module): def __init__(self,emb_dim, mlp_dim, num_heads, at_d_r): super(transencoder, self).__init__() self.norm = nn.LayerNorm(emb_dim, eps=1e-6) self.mha = mha(emb_dim, num_heads, at_d_r) self.mlp = Mlp(emb_dim, mlp_dim) def forward(self, x): n = self.norm(x) attn = self.mha(n,n,n) output = attn+x n2 = self.norm(output) ff = self.mlp(n2) out = ff+output return out
The Transformers Encoder consists of a multi-head self-attention and multi-layer perceptron. Where multi-head attention plays a crucial act by paying attention to the input sequence. Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel.
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key
— According to the paper
class mha(nn.Module): def __init__(self, h_dim, n_heads, at_d_r): super().__init__() self.h_dim=h_dim self.linear = nn.Linear(h_dim, h_dim, bias=False) self.num_heads = n_heads self.norm = nn.LayerNorm(h_dim) self.dropout = nn.Dropout(at_d_r) self.softmax = nn.Softmax(dim=2) def forward(self, q, k ,v): rs = q.size() batches, sequence_length, embeddings_dim = q.size() q1= nn.ReLU()(self.linear(q)) k1= nn.ReLU()(self.linear(k)) v1= nn.ReLU()(self.linear(v)) q2 = torch.cat(torch.chunk(q1, self.num_heads, dim=2), dim=0) k2 = torch.cat(torch.chunk(k1, self.num_heads, dim=2), dim=0) v2 = torch.cat(torch.chunk(v1, self.num_heads, dim=2), dim=0) outputs = torch.bmm(q2, k2.transpose(2, 1)) outputs = outputs / (k2.size()[-1] ** 0.5) outputs = F.softmax(outputs, dim=-1) outputs = self.dropout(outputs) outputs = torch.bmm(outputs, v2) outputs = outputs.split(rs, dim=0) outputs = torch.cat(outputs, dim=2) outputs += outputs + q outputs = self.norm(outputs) return outputs
A simple neural network is used to conclude a binary result. A perceptron is a linear classifier, that classifies input by separating two categories. The Multilayer perceptron consists of the linear functions, GELU activation (Gaussian Error Linear Unit), and dropouts.
class Mlp(nn.Module): def __init__(self, emb_dim, mlp_dim, dropout_rate=0.): super(Mlp, self).__init__() self.fc1 = nn.Linear(emb_dim, mlp_dim) self.fc2 = nn.Linear(mlp_dim, emb_dim) self.act = nn.GELU() self.dropout= nn.Dropout(dropout_rate) def forward(self, x): out = self.fc1(x) out = self.act(out) out = self.dropout(out) out = self.fc2(out) out = self.dropout(out) return out
Let’s print the flow of our model
model = VIT() print(model)
I have performed a classification task using ViT please star the repository if you find it helpful, or create a new issue if you found any.
The paper has released 3 variant of the vision transformers which are adopted from BERT, the base one, which has 12 layers the large one which has 24 layers and the huge one which has 32 layers with 632M parameters. The smaller the input patch size yields larger computational model, as the formula for the number of patches exhibits
The vision transformer takes a bit longer time due to its low understanding of image data and hence requires a much higher quantity. The model is pre-trained from a large dataset and is finetuned with smaller data. Vision-Transformer has higher accuracy on a sustainably large dataset with reduced training time.
During fine-tuning, the patch size should be the same if compared to the patch size on which the model is pre-trained. Fine-tuning the model with Images of Higher resolution can give better performance.
As the Google blog displays, vision transformers performed poorly when pre-trained with fewer amounts of data, and it outperformed SOTA with sufficient i.e large enough training data.
Vit demonstrates excellent performance when trained on sufficient data, outperforming a comparable state-of-the-art CNN with four times fewer computational resources. — Google blog
Vision transformer pretends and is authentic in remitting better performance, that beats the state of art convolution neural networks. Self-attention again is the foremost component in ViT which pays attention to the predominant features of the image. Vit’s can be used as the replacement of convolutional pipelines. Vit is the measure to approach the scalable architecture in computer vision. It was a needful invention in the field of computer vision along with the increase in data and computing power.
Please Hit claps on medium, where its originally published, if you found my writings helpful and insightful.
Thank you, don’t forget to comment on any issue and follow my newsletter to stay updated.