An Intuitive Explanation of Convolutional Neural Networks

What are Convolutional Neural Networks and why are they important?

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.

Screen Shot 2017-05-28 at 11.41.55 PM.png

Figure 1: Source [1]

In Figure 1 above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while Figure 2 shows an example of ConvNets being used for recognizing everyday objects, humans and animals. Lately, ConvNets have been effective in several Natural Language Processing tasks (such as sentence classification) as well.

Screen Shot 2016-08-07 at 4.17.11 PM.png

Figure 2: Source [2]

ConvNets, therefore, are an important tool for most machine learning practitioners today. However, understanding ConvNets and learning to use them for the first time can sometimes be an intimidating experience. The primary purpose of this blog post is to develop an understanding of how Convolutional Neural Networks work on images.

If you are new to neural networks in general, I would recommend reading this short tutorial on Multi Layer Perceptrons to get an idea about how they work, before proceeding. Multi Layer Perceptrons are referred to as “Fully Connected Layers” in this post.

The LeNet Architecture (1990s)

LeNet was one of the very first convolutional neural networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988 [3]. At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.

Below, we will develop an intuition of how the LeNet architecture learns to recognize images. There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if you have a clear understanding of the former.

Screen Shot 2016-08-07 at 4.59.29 PM.png

Figure 3: A simple ConvNet. Source [5]

The Convolutional Neural Network in Figure 3 is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks). As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories. The sum of all probabilities in the output layer should be one (explained later in this post).

There are four main operations in the ConvNet shown in Figure 3 above:

  1. Convolution
  2. Non Linearity (ReLU)
  3. Pooling or Sub Sampling
  4. Classification (Fully Connected Layer)

These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets. We will try to understand the intuition behind each of these operations below.

An Image is a matrix of pixel values

Essentially, every image can be represented as a matrix of pixel values.


Figure 4: Every image is a matrix of pixel values. Source [6]

Channel is a conventional term used to refer to a certain component of an image. An image from a standard digital camera will have three channels – red, green and blue – you can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255.

grayscale image, on the other hand, has just one channel. For the purpose of this post, we will only consider grayscale images, so we will have a single 2d matrix representing an image. The value of each pixel in the matrix will range from 0 to 255 – zero indicating black and 255 indicating white.

The Convolution Step

ConvNets derive their name from the “convolution” operator. The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.

As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):

Screen Shot 2016-07-24 at 11.25.13 PM

Also, consider another 3 x 3 matrix as shown below:

Screen Shot 2016-07-24 at 11.25.24 PM

Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in Figure 5 below:Convolution_schematic

Figure 5: The Convolution operation. The output matrix is called Convolved Feature or Feature Map. Source [7]

Take a moment to understand how the computation above is being done. We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride.

In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. It is important to note that filters acts as feature detectors from the original input image.

It is evident from the animation above that different values of the filter matrix will produce different Feature Maps for the same input image. As an example, consider the following input image:


In the table below, we can see the effects of convolution of the above image with different filters. As shown, we can perform operations such as Edge Detection, Sharpen and Blur just by changing the numeric values of our filter matrix before the convolution operation [8] – this means that different filters can detect different features from an image, for example edges, curves etc. More such examples are available in Section 8.2.4 here.

Screen Shot 2016-08-05 at 11.03.00 PM.png

Another good way to understand the Convolution operation is by looking at the animation in Figure 6 below:


Figure 6: The Convolution Operation. Source [9]

A filter (with red outline) slides over the input image (convolution operation) to produce a feature map. The convolution of another filter (with the green outline), over the same image gives a different feature map as shown. It is important to note that the Convolution operation captures the local dependencies in the original image. Also notice how these two different filters generate different feature maps from the same original image. Remember that the image and the two filters above are just numeric matrices as we have discussed above.

In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process). The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.

The size of the Feature Map (Convolved Feature) is controlled by three parameters [4] that we need to decide before the convolution step is performed:

  • Depth: Depth corresponds to the number of filters we use for the convolution operation. In the network shown in Figure 7, we are performing convolution of the original boat image using three distinct filters, thus producing three different feature maps as shown. You can think of these three feature maps as stacked 2d matrices, so, the ‘depth’ of the feature map would be three.

Screen Shot 2016-08-10 at 3.42.35 AM

Figure 7
  • Stride: Stride is the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.
  • Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution. This has been explained clearly in [14].

Introducing Non Linearity (ReLU)

An additional operation called ReLU has been used after every Convolution operation in Figure 3 above. ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given by:

Screen Shot 2016-08-10 at 2.23.48 AM.png

Figure 8: the ReLU operation

ReLU is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear (Convolution is a linear operation – element wise matrix multiplication and addition, so we account for non-linearity by introducing a non-linear function like ReLU).

The ReLU operation can be understood clearly from Figure 9 below. It shows the ReLU operation applied to one of the feature maps obtained in Figure 6 above. The output feature map here is also referred to as the ‘Rectified’ feature map.

Screen Shot 2016-08-07 at 6.18.19 PM.png

Figure 9: ReLU operation. Source [10]

Other non linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations.

The Pooling Step

Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.

In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.

Figure 10 shows an example of Max Pooling operation on a Rectified Feature map (obtained after convolution + ReLU operation) by using a 2×2 window.

Screen Shot 2016-08-10 at 3.38.39 AM.png

Figure 10: Max Pooling. Source [4]

We slide our 2 x 2 window by 2 cells (also called ‘stride’) and take the maximum value in each region. As shown in Figure 10, this reduces the dimensionality of our feature map.

In the network shown in Figure 11, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).

Screen Shot 2016-08-07 at 6.19.37 PM.png

Figure 11: Pooling applied to Rectified Feature Maps

Figure 12 shows the effect of Pooling on the Rectified Feature Map we received after the ReLU operation in Figure 9 above.

Screen Shot 2016-08-07 at 6.11.53 PM.png

Figure 12: Pooling. Source [10]

The function of Pooling is to progressively reduce the spatial size of the input representation [4]. In particular, pooling

  • makes the input representations (feature dimension) smaller and more manageable
  • reduces the number of parameters and computations in the network, therefore, controlling overfitting [4]
  • makes the network invariant to small transformations, distortions and translations in the input image (a small distortion in input will not change the output of Pooling – since we take the maximum / average value in a local neighborhood).
  • helps us arrive at an almost scale invariant representation of our image (the exact term is “equivariant”). This is very powerful since we can detect objects in an image no matter where they are located (read [18] and [19] for details).

Story so far

Screen Shot 2016-08-08 at 2.26.09 AM.png

Figure 13

So far we have seen how Convolution, ReLU and Pooling work. It is important to understand that these layers are the basic building blocks of any CNN. As shown in Figure 13, we have two sets of Convolution, ReLU & Pooling layers – the 2nd Convolution layer performs convolution on the output of the first Pooling Layer using six filters to produce a total of six feature maps. ReLU is then applied individually on all of these six feature maps. We then perform Max Pooling operation separately on each of the six rectified feature maps.

Together these layers extract the useful features from the images, introduce non-linearity in our network and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation [18].

The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer, which we will discuss in the next section.

Fully Connected Layer

The Fully Connected layer is a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer (other classifiers like SVM can also be used, but will stick to softmax in this post). The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. I recommend reading this post if you are unfamiliar with Multi Layer Perceptrons.

The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. For example, the image classification task we set out to perform has four possible outputs as shown in Figure 14 below (note that Figure 14 does not show connections between the nodes in the fully connected layer)

Screen Shot 2016-08-06 at 12.34.02 AM.png

Figure 14: Fully Connected Layer -each node is connected to every other node in the adjacent layer

Apart from classification, adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better [11].

The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

Putting it all together – Training using Backpropagation

As discussed above, the Convolution + Pooling layers act as Feature Extractors from the input image while Fully Connected layer acts as a classifier.

Note that in Figure 15 below, since the input image is a boat, the target probability is 1 for Boat class and 0 for other three classes, i.e.

  • Input Image = Boat
  • Target Vector = [0, 0, 1, 0]

Screen Shot 2016-08-07 at 9.15.21 PM.png

Figure 15: Training the ConvNet

The overall training process of the Convolution Network may be summarized as below:

  • Step1: We initialize all filters and parameters / weights with random values
  • Step2: The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
    • Lets say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
    • Since weights are randomly assigned for the first training example, output probabilities are also random.
  • Step3: Calculate the total error at the output layer (summation over all 4 classes)
    • Total Error = ∑  ½ (target probability – output probability) ²
  • Step4: Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.
    • The weights are adjusted in proportion to their contribution to the total error.
    • When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
    • This means that the network has learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.
    • Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.
  • Step5: Repeat steps 2-4 with all images in the training set.

The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.

Note 1: The steps above have been oversimplified and mathematical details have been avoided to provide intuition into the training process. See [4] and [12] for a mathematical formulation and thorough understanding.

Note 2: In the example above we used two sets of alternating Convolution and Pooling layers. Please note however, that these operations can be repeated any number of times in a single ConvNet. In fact, some of the best performing ConvNets today have tens of Convolution and Pooling layers! Also, it is not necessary to have a Pooling layer after every Convolutional Layer. As can be seen in the Figure 16 below, we can have multiple Convolution + ReLU operations in succession before having a Pooling operation. Also notice how each layer of the ConvNet is visualized in the Figure 16 below.


Figure 16: Source [4]

Visualizing Convolutional Neural Networks

In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize. For example, in Image Classification a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers [14]. This is demonstrated in Figure 17 below – these features were learnt using a Convolutional Deep Belief Network and the figure is included here just for demonstrating the idea (this is only an example: real life convolution filters may detect objects that have no meaning to humans).

Screen Shot 2016-08-10 at 12.58.30 PM.png

Figure 17: Learned features from a Convolutional Deep Belief Network. Source [21]

Adam Harley created amazing visualizations of a Convolutional Neural Network trained on the MNIST Database of handwritten digits [13]. I highly recommend playing around with it to understand details of how a CNN works.

We will see below how the network works for an input ‘8’. Note that the visualization in Figure 18 does not show the ReLU operation separately.


Figure 18: Visualizing a ConvNet trained on handwritten digits. Source [13]

The input image contains 1024 pixels (32 x 32 image) and the first Convolution layer (Convolution Layer 1) is formed by convolution of six unique 5 × 5 (stride 1) filters with the input image. As seen, using six different filters produces a feature map of depth six.

Convolutional Layer 1 is followed by Pooling Layer 1 that does 2 × 2 max pooling (with stride 2) separately over the six feature maps in Convolution Layer 1. You can move your mouse pointer over any pixel in the Pooling Layer and observe the 2 x 2 grid it forms in the previous Convolution Layer (demonstrated in Figure 19). You’ll notice that the pixel having the maximum value (the brightest one) in the 2 x 2 grid makes it to the Pooling layer.

Screen Shot 2016-08-06 at 12.45.35 PM.png

Figure 19: Visualizing the Pooling Operation. Source [13]

Pooling Layer 1 is followed by sixteen 5 × 5 (stride 1) convolutional filters that perform the convolution operation. This is followed by Pooling Layer 2 that does 2 × 2 max pooling (with stride 2). These two layers use the same concepts as described above.

We then have three fully-connected (FC) layers. There are:

  • 120 neurons in the first FC layer
  • 100 neurons in the second FC layer
  • 10 neurons in the third FC layer corresponding to the 10 digits – also called the Output layer

Notice how in Figure 20, each of the 10 nodes in the output layer are connected to all 100 nodes in the 2nd Fully Connected layer (hence the name Fully Connected).

Also, note how the only bright node in the Output Layer corresponds to ‘8’ – this means that the network correctly classifies our handwritten digit (brighter node denotes that the output from it is higher, i.e. 8 has the highest probability among all other digits).


Figure 20: Visualizing the Filly Connected Layers. Source [13]

The 3d version of the same visualization is available here.

Other ConvNet Architectures

Convolutional Neural Networks have been around since early 1990s. We discussed the LeNet above which was one of the very first convolutional neural networks. Some other influential architectures are listed below [3] [4].

  • LeNet (1990s): Already covered in this article.
  • 1990s to 2012: In the years from late 1990s to early 2010s convolutional neural network were in incubation. As more and more data and computing power became available, tasks that convolutional neural networks could tackle became more and more interesting.
  • AlexNet (2012) – In 2012, Alex Krizhevsky (and others) released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a significant breakthrough with respect to the previous approaches and the current widespread application of CNNs can be attributed to this work.
  • ZF Net (2013) – The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters.
  • GoogLeNet (2014) – The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M).
  • VGGNet (2014) – The runner-up in ILSVRC 2014 was the network that became known as the VGGNet. Its main contribution was in showing that the depth of the network (number of layers) is a critical component for good performance.
  • ResNets (2015) – Residual Network developed by Kaiming He (and others) was the winner of ILSVRC 2015. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016).
  • DenseNet (August 2016) – Recently published by Gao Huang (and others), the Densely Connected Convolutional Network has each layer directly connected to every other layer in a feed-forward fashion. The DenseNet has been shown to obtain significant improvements over previous state-of-the-art architectures on five highly competitive object recognition benchmark tasks. Check out the Torch implementation here.


In this post, I have tried to explain the main concepts behind Convolutional Neural Networks in simple terms. There are several details I have oversimplified / skipped, but hopefully this post gave you some intuition around how they work.

This post was originally inspired from Understanding Convolutional Neural Networks for NLP by Denny Britz (which I would recommend reading) and a number of explanations here are based on that post. For a more thorough understanding of some of these concepts, I would encourage you to go through the notes from Stanford’s course on ConvNets as well as other excellent resources mentioned under References below. If you face any issues understanding any of the above concepts or have questions / suggestions, feel free to leave a comment below.

All images and animations used in this post belong to their respective authors as listed in References section below.


  1. karpathy/neuraltalk2: Efficient Image Captioning code in Torch, Examples
  2. Shaoqing Ren, et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2015, arXiv:1506.01497 
  3. Neural Network Architectures, Eugenio Culurciello’s blog
  4. CS231n Convolutional Neural Networks for Visual Recognition, Stanford
  5. Clarifai / Technology
  6. Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks
  7. Feature extraction using convolution, Stanford
  8. Wikipedia article on Kernel (image processing) 
  9. Deep Learning Methods for Vision, CVPR 2012 Tutorial 
  10. Neural Networks by Rob Fergus, Machine Learning Summer School 2015
  11. What do the fully connected layers do in CNNs? 
  12. Convolutional Neural Networks, Andrew Gibiansky 
  13. A. W. Harley, “An Interactive Node-Link Visualization of Convolutional Neural Networks,” in ISVC, pages 867-877, 2015 (link). Demo
  14. Understanding Convolutional Neural Networks for NLP
  15. Backpropagation in Convolutional Neural Networks
  16. A Beginner’s Guide To Understanding Convolutional Neural Networks
  17. Vincent Dumoulin, et al, “A guide to convolution arithmetic for deep learning”, 2015, arXiv:1603.07285
  18. What is the difference between deep learning and usual machine learning?
  19. How is a convolutional neural network able to learn invariant features?
  20. A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
  21. Honglak Lee, et al, “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations” (link)




屏幕截图2017-05-28 at 11.41.55 PM.png

图1:来源[ 1 ]


屏幕截图2016-08-07 at 4.17.11 PM.png

图2:来源[ 2 ]


如果您通常对神经网络不熟悉,我建议您 在继续之前阅读此有关多层感知器的简短教程,以了解它们的工作原理。多层感知器在本文中称为“完全连接层”。


LeNet是最早推动深度学习领域的卷积神经网络之一。 自1988年以来,Yann LeCun的这项开创性工作在经过多次成功的迭代之后被命名为LeNet5 [ 3 ]。当时LeNet架构主要用于字符识别任务,例如读取邮政编码,数字等。


屏幕快照2016-08-07 at 4.59.29 PM.png

图3:一个简单的ConvNet。来源[ 5 ]

图3中 的卷积神经网络在结构上与原始LeNet相似,并将输入图像分为四类:狗,猫,船或鸟(原始LeNet主要用于字符识别任务)。从上图可以明显看出,在接收到船图像作为输入时,网络会正确分配所有四个类别中船的最高概率(0.94)。输出层中所有概率的总和应为1(在本文后面解释)。


  1. 卷积
  2. 非线性(ReLU)
  3. 合并或子采样
  4. 分类(全连接层)

这些操作是每个  卷积神经网络的基本构建块,因此了解这些工作原理是发展对ConvNet的良好理解的重要一步。我们将在下面尝试理解这些操作背后的直觉。




图4:每个图像都是像素值矩阵。来源[ 6 ]


甲  灰度图像,另一方面,只有一个信道。出于本文的目的,我们将仅考虑灰度图像,因此我们将使用一个表示图像的2d矩阵。矩阵中每个像素的值范围为0到255-零表示黑色,而255表示白色。


ConvNets是从“卷积”运算符派生的  。对于ConvNet,卷积的主要目的是从输入图像中提取特征。卷积通过使用输入数据的小方块学习图像特征来保留像素之间的空间关系。我们将不在这里讨论卷积的数学细节,但将尝试了解卷积在图像上的工作原理。

如上所述,每个图像都可以视为像素值的矩阵。考虑一个5 x 5图像,其像素值仅为0和1(请注意,对于灰度图像,像素值的范围为0到255,下面的绿色矩阵是像素值仅为0和1的特殊情况):


另外,考虑另一个3 x 3矩阵,如下所示:


然后,可以计算出5 x 5图像和3 x 3矩阵的卷积,如下图5中的动画所示:卷积示意图

图5:卷积运算。输出矩阵称为卷积特征或特征图。来源[ 7 ]


在CNN术语中,3×3矩阵称为“ 过滤器 ”或“内核”或“特征检测器”,通过在图像上滑动过滤器并计算点积而形成的矩阵称为“卷积特征”或“激活”地图”或“ 要素地图 ”。重要的是要注意,滤镜充当原始输入图像的特征检测器。



在下表中,我们可以看到使用不同滤镜的上图卷积的效果。如图所示,我们只需在卷积运算之前更改滤镜矩阵的数值即可执行诸如边缘检测,锐化和模糊之类的操作[ 8 ] –这意味着不同的滤镜可以从图像中检测不同的特征,例如边缘,曲线等。更多此类示例可在第8.2.4节中  找到

屏幕截图2016-08-05 at 11.03.00 PM.png

了解卷积操作的另一种好方法是查看 下面的图6中的动画:


图6:卷积运算。来源[ 9 ]



特征图(卷积特征)的大小由我们需要在执行卷积步骤之前确定的三个参数[ 4 ] 控制:

  • 深度:  深度对应于我们用于卷积运算的滤波器数量。在图7所示的网络中  ,我们使用三个不同的滤镜对原始船形图像进行卷积,从而产生了三个不同的特征图,如图所示。您可以将这三个特征图视为堆叠的2d矩阵,因此,特征图的“深度”将为3。


  • 步幅:步幅是 我们将滤镜矩阵滑过输入矩阵的像素数。当跨度为1时,我们将滤镜一次移动一个像素。当跨度为2时,滤镜一次滑动2个像素即可。跨度较大将产生较小的特征图。
  • 零填充:有时,在边界周围用零填充输入矩阵很方便,因此我们可以将滤镜应用于输入图像矩阵的边界元素。零填充的一个不错的功能是它允许我们控制特征图的大小。添加零填充也称为宽卷积不使用零填充将是窄卷积。这已经在[ 14 ]中清楚地解释了。


在 上面的图3中的每个卷积运算之后,都使用了一个称为ReLU的附加运算。ReLU代表整流线性单位,是一种非线性运算。其输出如下:

屏幕截图2016-08-10 at 2.23.48 AM.png



从 下面的图9可以清楚地了解ReLU操作。它显示了应用于 上面图6中获得的特征图之一的ReLU操作。此处的输出要素图也称为“整流”要素图。

屏幕截图2016-08-07 at 6.18.19 PM.png

图9:ReLU操作。来源[ 10 ]

也可以使用其他非线性函数(例如  tanh或  Sigmoid)代替ReLU,但是已经发现ReLU在大多数情况下的性能更好。



在“最大池化”的情况下,我们定义一个空间邻域(例如2×2窗口),并从该窗口内经过校正的特征图中获取最大的元素。除了获取最大元素外,我们还可以获取该窗口中所有元素的平均值(平均池)或总和。在实践中,Max Pooling已显示出更好的效果。

图10  显示了使用2×2窗口在整流特征图上进行最大池化操作的示例(在卷积+ ReLU操作之后获得)。

屏幕截图2016-08-10 at 3.38.39 AM.png

图10:最大池化。来源[ 4 ]

我们将2 x 2的窗口滑动2个单元格(也称为“跨步”),并在每个区域取最大值。如图10所示,这减小了特征图的维数。

图11所示的网络中    将池化操作分别应用于每个功能图(请注意,由于这个原因,我们从三个输入图获得三个输出图)。

屏幕截图2016-08-07 at 6.19.37 PM.png


图12  显示了池化对 上面图9中的ReLU操作之后我们收到的整流后的特征图的影响。

屏幕快照2016-08-07 at 6.11.53 PM.png

图12:合并。来源[ 10 ]

池化的功能是逐渐减小输入表示的空间大小[ 4 ]。特别是池化

  • 使输入表示形式(特征尺寸)更小,更易于管理
  • 减少了网络中参数和计算的数量,因此,控制过拟合  [ 4 ]
  • 使网络对于输入图像中的小变形,畸变和平移不变(输入中的小畸变不会改变Pooling的输出-因为我们在局部邻域中取最大值/平均值)。
  • 帮助我们获得图像的几乎不变的不变表示(确切的术语是“相等”)。这是非常强大的功能,因为无论对象位于何处,我们都可以检测到它们(有关详细信息,请阅读[ 18 ]和[ 19 ])。


屏幕截图2016-08-08 at 2.26.09 AM.png



这些层一起从图像中提取有用的特征,在我们的网络中引入非线性并减小特征尺寸,同时旨在使特征在一定程度上与比例尺和平移相等[ 18 ]。



完全连接层是传统的多层感知器,在输出层中使用softmax激活功能(也可以使用其他分类器,例如SVM,但在本文中将坚持使用softmax)。术语“完全连接”表示上一层中的每个神经元都连接到下一层中的每个神经元。 如果您不熟悉多层感知器,我建议您阅读这篇文章


屏幕截图2016-08-06 at 12.34.02 AM.png


除了分类之外,添加完全连接的层也是学习这些功能的非线性组合的一种(通常)廉价的方法。来自卷积层和池化层的大多数特征对于分类任务可能是好的,但是这些特征的组合可能甚至更好[ 11 ]。




请注意,在 下面的图15中,由于输入图像是一条船,因此船类的目标概率为1,其他三类的目标概率为0,即

  • 输入图像=船
  • 目标向量= [0,0,1,0]

屏幕截图2016-08-07 at 9.15.21 PM.png



  • 步骤1:我们使用随机值初始化所有过滤器和参数/权重
  • 步骤2: 网络将训练图像作为输入,进行正向传播步骤(卷积,ReLU和池化操作以及全连接层中的正向传播),并找到每个类别的输出概率。
    • 可以说,上面的船图像的输出概率为[0.2,0.4,0.1,0.3]
    • 由于权重是为第一个训练示例随机分配的,因此输出概率也是随机的。
  • 步骤3:计算输出层的总误差(所有4类的总和)
    •  总误差= ∑½(目标概率–输出概率)²
  • 步骤4:使用反向传播来计算相对于网络中所有权重的误差梯度,并使用梯度下降来更新所有滤波器值/权重和参数值,以最大程度地减少输出误差。
    • 权重根据它们对总误差的贡献进行调整。
    • 当再次输入同一图像时,输出概率现在可能是[0.1,0.1,0.7,0.1],更接近目标矢量[0,0,1,0]。
    • 这意味着网络已经学会了通过调整其权重/滤波器来正确分类该特定图像,从而减少输出误差。
    • 诸如过滤器数量,过滤器大小,网络体系结构等参数都已在步骤1之前固定,并且在训练过程中不会更改-仅更新过滤器矩阵的值和连接权重。
  • 步骤5:对训练集中的所有图像重复步骤2-4。

上面的步骤  训练  了ConvNet –从本质上讲,这意味着ConvNet的所有权重和参数都已经过优化,可以对训练集中的图像进行正确分类。


注意1:以上步骤已被简化,并且避免了数学上的细节以提供对训练过程的直觉。参见[ 4 ]和[ 12 ]以了解数学公式和透彻的理解。

注意2:在上面的示例中,我们使用了两组交替的卷积和池化层。但是请注意,这些操作可以在单个ConvNet中重复任意次数。实际上,当今一些性能最好的卷积网络具有数十个卷积和池化层!另外,在每个卷积层之后也不必具有池化层。从下面的图16中可以看出,在进行Pooling操作之前,我们可以连续进行多个Convolution + ReLU操作。还要注意在下面的图16中如何可视化ConvNet的每一层。


图16:来源[ 4 ]


通常,卷积步骤越多,网络将能够学会识别的功能就越复杂。例如,在图像分类中,ConvNet可以学会从第一层中的原始像素检测边缘,然后使用边缘来检测第二层中的简单形状,然后使用这些形状来阻止更高级别的特征,例如面部形状在更高的层次[ 14 ]。这在下面的图17中得到了证明–这些特征是使用  卷积深度信仰网络学习的  ,此处包含该图只是为了说明这个想法(这只是一个示例:现实生活中的卷积过滤器可能检测到对人类没有意义的对象) 。

屏幕截图2016-08-10 at 12.58.30 PM.png

图17:从卷积深度信念网络中学到的功能。来源[ 21 ]

亚当·哈雷(Adam Harley)创建了卷积神经网络的惊人可视化效果,该网络在MNIST手写数字数据库上进行了训练[ 13 ]。我强烈建议您试用它,以了解CNN的工作原理的详细信息。

我们将在下面看到输入“ 8”时网络的工作方式。请注意,图18中的图表未单独显示ReLU操作。


图18:可视化用手写数字训练的ConvNet。来源[ 13 ]

输入图像包含1024个像素(32 x 32图像),并且第一卷积层(卷积层1)由六个唯一的5×5(步幅1)滤镜与输入图像卷积而成。如图所示,使用六个不同的滤镜可生成深度为六的特征图。

卷积层1之后是池化层1,池化层1在卷积层1中的六个特征图上分别进行了2×2的最大池化(步幅为2)。您可以将鼠标指针移到池化层中的任何像素上并观察2 x它在先前的卷积层(图19所示)中形成2个网格。您会注意到,在2 x 2网格中具有最大值(最亮的像素)的像素使其进入了Pooling层。

屏幕截图2016-08-06 at 12.45.35 PM.png

图19:可视化合并操作。来源[ 13 ]



  • 第一FC层中有120个神经元
  • 第二FC层中有100个神经元
  • 第三FC层中的10个神经元对应于10位数字-也称为输出层

请注意,在图20中,输出层中的10个节点中的每个节点如何连接到第二个Fully Connected层中的所有100个节点(因此名为Fully Connected)。

另外,请注意输出层中唯一的亮节点如何对应于“ 8” –这意味着网络正确地对我们的手写数字进行了分类(较亮的节点表示其输出更高,即8在所有其他数字中具有最高的概率) )。


图20:可视化Filly连接层。来源[ 13 ]



卷积神经网络自1990年代初开始出现。我们在上面讨论了LeNet,它 是最早的卷积神经网络之一。其他一些有影响力的体系结构在下面[ 3 ] [ 4 ] 中列出。

  • LeNet(1990年代):已在本文中介绍。
  • 1990年代至2012年:在1990年代末至2010年代初,卷积神经网络得到了发展。随着越来越多的数据和计算能力变得可用,卷积神经网络可以解决的任务变得越来越有趣。
  • AlexNet(2012)–  2012年,Alex Krizhevsky(及其他人)发布了AlexNet,它是LeNet的更深,更广泛的版本,并在2012年以较大的优势赢得了艰难的ImageNet大规模视觉识别挑战赛(ILSVRC)。相对于以前的方法和CNN的当前广泛应用而言,这是一项重大突破,可以归功于这项工作。
  • ZF Net(2013)– 2013年  ILSVRC冠军是Matthew Zeiler和Rob Fergus的卷积网络。它被称为ZFNet(Zeiler&Fergus Net的缩写)。通过调整体系结构超参数,这是对AlexNet的改进。
  • GoogLeNet(2014)– 2014年  ILSVRC获奖者是Szegedy等人的卷积网络来自Google。它的主要贡献是开发了一个Inception模块,该模块大大减少了网络中的参数数量(4M,而AlexNet为60M)。
  • VGGNet(2014)– ILSVRC 2014  的亚军是被称为VGGNet的网络。它的主要贡献在于表明网络深度(层数)是获得良好性能的关键因素。
  • ResNets(2015)–Kaiming  He(及其他人)开发的残差网络赢得了ILSVRC 2015的冠军。ResNets目前是最先进的卷积神经网络模型,并且是在实践中使用ConvNets的默认选择(截至2016年5月) )。
  • DenseNet(2016年8月)–最近由Gao Huang(及其他作者)发表的  Densely Connected卷积网络  使每一层都以前馈方式直接连接到其他每一层。事实证明,DenseNet在五个高度竞争的对象识别基准测试任务上比以前的最新体系结构有了显着改进。在此处检查Torch的实现。



这篇文章最初是由Denny Britz(我建议阅读)的理解NLP理解卷积神经网络的启发而来的  ,此处的许多解释都是基于该文章。对于一些概念更透彻的了解,我会鼓励你去通过音符  从斯坦福大学的课程上ConvNets  以及根据下文引用中提到的其他优秀的资源。如果您在理解上述任何概念时遇到任何问题,或者有任何疑问/建议,请随时在下面发表评论。



  1. karpathy / neuraltalk2:Torch中的高效图像字幕代码,示例
  2. 任少卿  等,  “快速R-CNN:通过区域提议网络实现实时目标检测”,2015年,  arXiv:1506.01497 
  3. 神经网络架构,Eugenio Culurciello的博客
  4. CS231n用于视觉识别的卷积神经网络,斯坦福
  5. Clarifai /技术
  6. 机器学习很有趣!第3部分:深度学习和卷积神经网络
  7. 使用卷积,斯坦福进行特征提取
  8. 维基百科有关内核的文章(图像处理) 
  9. 视觉深度学习方法,CVPR 2012教程 
  10. 神经网络,Rob Fergus,2015年机器学习暑期学校
  11. CNN中完全连接的层做什么? 
  12. 卷积神经网络,Andrew Gibiansky 
  13. AW Harley,“卷积神经网络的交互式节点链接可视化”,在ISVC中,第867-877页,2015年(链接)。演示版
  14. 了解用于NLP的卷积神经网络
  15. 卷积神经网络中的反向传播
  16. 了解卷积神经网络的初学者指南
  17. Vincent Dumoulin 等人,“深度学习卷积算法指南”,2015年,arXiv:1603.07285
  18. 深度学习和普通机器学习之间有什么区别?
  19. 卷积神经网络如何学习不变特征?
  20. 深度卷积神经网络的计算机视觉分类法
  21. Honglak Lee 等人,“用于分层表示的可扩展无监督学习的卷积深度信念网络”(链接