Lecture 13

Convolutional Neural Networks

Today’s Topics:


1. What CNNs Can Do

Image Classification

Definition: Assigning a single label to an entire image from a set of predefined categories.

Key Characteristics:

Examples:

Figure 1. Example of Image Classification


Object Detection

Definition: Identifying and localizing multiple objects within an image, including their positions and classifications.

Key Characteristics:

Examples:

Figure 2. Example of Object Detection


Object Segmentation

Definition: Pixel-level classification where each pixel in the image is assigned to a specific object category.

Key Characteristics:

Types of Segmentation:

  1. Semantic Segmentation:
    • Labels each pixel with a class
    • Doesn’t distinguish between object instances
    • Example: All “car” pixels get same label
  2. Instance Segmentation:
    • Identifies and separates individual object instances
    • Example: Differentiates between car1, car2, car3

Examples:

Figure 3. Example of Object Segmentation

2. Image Classification

Why Images Are Hard for Neural Networks

1. Visual Variations

Figure 4. Example of Image Classification Challenge


2. The Limitations of Fully-Connected Networks

Figure 5. Example of Limitation of Fully-Connected Networks

3. CNN Basics

Convolutional Neural Networks (LeCun 1989)

Core Idea: Parameter Sharing

Key Mechanism:

Weight Sharing in Kernels

How it works: A kernel (a small matrix of weights, e.g., $w, x, y, z$) acts as a sliding filter.

Figure 6. Example of Weight Sharing


Alternative Visualization of Kernels

Concept: A “feature detector” (the filter or kernel) slides over the input image (like the “5”) to generate a “feature map”.

Figure 7. Example of Visualization of Kernels

Kernels for each channel

Multiple feature detectors (kernels) can be applied in parallel to the same input image.
Each kernel learns different weights, producing different feature maps that highlight distinct aspects of the image.

Figure 8. Kernels applied to each channel create multiple feature maps.

Convolutional Neural Networks [LeCun 1989]

LeCun and colleagues pioneered the use of Convolutional Neural Networks (CNNs) for digit recognition.
Their architecture, known as LeNet-5, combined convolutional feature extraction with traditional fully connected classifiers.

Key insights:

Network structure (LeNet-5 example):

Figure 9. LeNet-5 architecture (LeCun et al., 1989/1998). Convolutions and pooling serve as automatic feature extractors, followed by fully connected layers for classification.

“Pooling”: lossy compression

Pooling layers reduce the dimensionality of feature maps while retaining the most important information.
This introduces translation invariance, helps control overfitting, and lowers computational cost.

Types of pooling:

Key points:

Figure 10. Pooling operations: max pooling preserves the strongest activation, while mean pooling averages across the region.

Main ideas of CNNs

Convolutional Neural Networks are built around three core principles:


4. CC vs Convolution

Convolution: Adding two random variables

Convolution also appears in probability theory, when adding two independent random variables.

Continuous case: \(P(X+Y=z) = \int P_X(x) P_Y(z-x) \, dx\)

This integral is called the convolution of $P_X$ and $P_Y$:
\((P_X * P_Y)(z) = \int P_X(x) P_Y(z-x)\, dx\)

For discrete random variables, convolution is defined using a summation:

\[(P_X * P_Y)(z) = \sum_x P_X(x) P_Y(z-x)\]

More generally:

In CNNs, convolution is not about probability, but the same mathematical operation is reused.

Formally: \(Z[i,j] = \sum_{u=-k}^k \sum_{v=-k}^k K[u,v] \, A[i-u, j-v]\)

Which is written compactly as: \(Z[i,j] = K * A\)

Cross-Correlation vs. Convolution

In practice, most deep learning libraries (e.g., PyTorch, TensorFlow) actually implement cross-correlation rather than strict convolution.

Key difference:

Figure 11. Cross-correlation vs. convolution. Convolution involves flipping the kernel, while CNNs usually use cross-correlation.

Sparse Connectivity in CNNs

Unlike fully-connected layers, CNNs use local connectivity:

Figure 12. CNNs use sparse connectivity (top) compared to dense fully-connected layers (bottom).

Receptive Fields

Although each neuron starts with a local receptive field, stacking multiple convolution layers expands the effective receptive field:

Figure 13. Receptive fields grow with depth: higher-layer neurons aggregate information from larger input regions.

Impact of convolutions on size

The side length O of the feature map is below: \(O = \frac{W-K+2P}{S}+1\) W: input width. K: kernel width. P: padding. S: stride

Below is a graph showing padding and stride:

Figure 14. Padding.

Kernel dimensions and trainable parameters

CNNs and Translation/Rotation/Scale Invariance

5. CNN Backprop

Figure 15. Computational Graph.

mean of gradient

6. CNN PyTorch