A Layman’s Guide to Understanding Convolutional Neural Networks

Tahira
4 min readJan 29, 2021

Artificial Intelligence has witnessed monumental growth in the last decade. Instead of being labelled as ‘the future’, it is now very much a part of our daily lives in small yet distinct ways. A few notable examples include the face id unlock, filters on Snapchat and other social apps, and even the auto-pilot mode on Tesla cars.

Thanks to A.I. advancements, we have an ever-innovating field of deep learning. As we peel back the layers of the deep learning onion, we find ourselves in a realm of numerous sub-fields derived from the fundamental idea of artificial intelligence.

One such interdisciplinary field is Computer Vision — which is what I’ll be talking about in this article.

The primary goal computer vision sets out to achieve is to obtain an elevated understanding of the human mind. It seeks to mimic the neural connections that occur in the human brain by analysing digital images or videos.

This is where Convolutional Neural Networks (also known as CNN or ConvNet) come in — the cutting edge brainchild of computer vision.

CNN aims to recognize objects similar to how our brain does. Although there is a fundamental difference in how a computer perceives an object versus how our brains do, the end result, however, is largely similar (with the mitigation of emotional value in a machine’s case, of course).

What makes CNN so good is how effective it is at copying and utilizing the human neural process and the underlying biology of visualization.

Humans are exceptionally visually perceptive and much of the learning we do relies heavily on visual cues aka imagery. CNN takes the same biological concept and applies it — except that it does so with the use of numbers and data.

Every image we see no our screens is made up of a 3-dimensional matrix that contains pixel values. These values span across length x breadth and the #channels (RGB).

CNN reads these pixels and extract information from each image. When given a large dataset, the neural networks are able to self-learn.

It assimilates datasets made up of large volumes of digital imagery. With each sample, the neural network’s self-learning and recognition capabilities increase significantly. To understand how the CNN model works, we need to understand its multi-faceted architecture.

The CNN Architecture:

The CNN architecture consists of three main layers. These layers are the building blocks of a convolutional neural network. The first layer is the input layer, followed by the second layer which is essentially the hidden layer and the third being the output layer.

Let’s start with the first layer-

The Input Layer

This is the top layer of a CNN structure. It is the core building block upon which the rest of the deep learning model rests. This layer performs the ‘convolution’ part of the network. It is also called the mathematical layer as it deals with intelligently discerning and conceptualizing the number patterns it encounters.

Let’s take a two-dimensional image data. We assume that the first point in this layer begins applying a filter, also known as a neuron or a kernel, to the top left corner of the image. After reading that single point of the image, it configures an array of numbers, multiplies that array of pixels, and infers a single number out of this operation. This number that the top layer has just read is called the receptive field. It is a representation of that single point on the image.

In the same fashion, this layer continues with its perceptions, moving along the right of the image, one unit at a time. In this way, the convolutional layer reads the entire image data. It then goes ahead and assigns a number to each unit it reads and stores this data in an array. Underlying this whole operation is the biology of the human brain. This filter acts as the computer’s very own visual cortex.

The Second Layer

The second layer of the CNN structure is the Rectified Linear Unit Layer or ReLU. This is where the activation of the CNN neurons takes place. It is a piecewise linear function that generates output much the same as a Boolean AND Table. It outputs the input directly only when it is positive, otherwise, it outputs zero. Owing to its linear form, it is very straight forward to train and more often than not yields exceptional results. Conversely, because of the fact that ReLU is only active when all the units are positive, it leads to a slowdown in the learning of the subsequent layers. This becomes a problem when large weight updates are used as input because this would mean that the resulting output would always be negative or zero, regardless of the input provided.

The Third Layer

The third layer is more widely known as The Fully Connected Layer. This layer comes at the end of the CNN structure. It takes an input and outputs an N dimensional vector. Here, N is a representation of the number of classes the program has to choose from.

Take, for instance, the image of a human face. If the program is looking at photos of the human face, it will seek prominent features such as two eyes, one nose, the lips or the ears. It will read those high-level features and make connections with the input image. It then generates output classifying the image as that of a human face.

On the face of it, CNN might seem like a novel concept but it is undeniably a huge underlying part of our technology-rich lives. These neural networks prove to be engines of generating fresher opportunities. Tech giants along the likes of Facebook, Google, Snapchat and Amazon use neural networks in their businesses, be it for visual search, image tagging or user behaviour analytics. CNN finds its applications in numerous ways but its specialty belongs to facial recognition. It is immensely accurate with a tiny window for error and that window just keeps on getting smaller and smaller each day. This goes to prove that CNNs are poised to become our future, finding themselves to be central in how we live.

--

--