Lecture 04

Single-layer networks

Announcements


Outline

  1. Perceptrons
  2. Geometric intuition
  3. Notational conventions for single-layer nets
  4. A fully-connected (linear) layer in PyTorch

1. Perceptrons

Rosenblatt’s Perceptron

First, we’ll talk about Rosenblatt’s perceptron, which is seen as the foundation of today’s artificial neural networks.

What Rosenblatt proposed is actually “A learning rule for the computational/mathematical neuron model”. This contains two parts:

  1. A computational/mathematical neuron model,
  2. A learning rule.

So back to 1957, Rosenblatt not only defined the mathematical model for an artificial neuron (weighted sum + activation function), but also formulated a learning rule that enables the model to learn autonomously from data.

In our brain, there are lots of neurons. Similarly, we invented artificial neural networks to mimic this biological structure.

Figure 1. Biological Neuron.
Figure 2. Rosenblatt’s Perceptron.

The perceptron is the most basic unit of neural networks. However, even for perceptron, there are so many different variants. In today’s lecture, our “Perceptron” will specifically represent “a classic Rosenblatt Perceptron”. This basically means we’ll use the threshold function as our activation function.

Figure 3. Overview of Perceptrons.

While we’ll be somewhat loose about the terminology here, we are building foundations that will lead us to multi-layer perceptrons (MLPs).

Terminology

Before everything starts, we need to declare our terminology used here.

Generally, like when we talk about logistic regression, multilayer nets and extra, our terminology follow the convention below:

For some speical cases, we may use some specifical terminology:

Mathematical Formulation

A perceptron takes multiple inputs $x_1, x_2, \ldots, x_n$ and produces a single binary output. The mathematical formulation is:

y = \sigma \left( \sum_{i=1}^{n} w_i x_i + b \right)

where:

Activation Functions

The activation function $\sigma$ determines the output of the perceptron. For the classical perceptron, we use a step function:

\sigma(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}

Handling Bias

In the above, the bias and the weights are seperate, this is not very convenient when we are actually doing the calculation. A common approach to deal with the bias in neural networks is to include the bias directly into the input vector $X$ by adding an extra dimension with a constant value of 1:

\begin{aligned} \tilde{X} &= [x_1, x_2, \ldots, x_n, 1]^T \\ \tilde{W} &= [w_1, w_2, \ldots, w_n, b]^T \end{aligned}

This allows us to write the perceptron output simply as:

y = \sigma(\tilde{W}^T \tilde{X})

More about the Activation Functions

While the classical Rosenblatt perceptron uses the threshold function, modern neural networks employ various activation functions depending on the application and historical period:

Threshold Function (Perceptron, 1950+)

\sigma(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}

This is the original activation function used in Rosenblatt’s perceptron. It produces binary outputs but is not differentiable, making it unsuitable for gradient-based learning.

Sigmoid Function (before 2000)

\sigma(z) = \frac{1}{1 + e^{-z}}

The sigmoid function was widely used before 2000s. It’s smooth and differentiable, which enables backpropagation learning. However, it suffers from vanishing gradient problems in deep networks.

ReLU Function (popular since CNNs)

\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}

ReLU (Rectified Linear Unit) became popular with the rise of CNNs and deep learning. It’s computationally efficient and helps mitigate vanishing gradient problems.

Many Variants of ReLU

Modern deep learning employs numerous ReLU variants:

Each variant addresses specific issues like dead neurons (Leaky ReLU) or provides smoother approximations to ReLU (GeLU, Swish).

Perceptron Learning Algorithm

The perceptron learning algorithm is a simple and elegant method for training a perceptron to classify linearly separable data. Here’s the formal algorithm:

Algorithm (Pseudocode)

Let the training dataset be:

D = \left\{(\mathbf{x}^{[1]}, y^{[1]}), (\mathbf{x}^{[2]}, y^{[2]}), \ldots, (\mathbf{x}^{[n]}, y^{[n]})\right\} \in (\mathbb{R}^m \times \{0,1\})^n

Algorithm Steps:

  1. Initialize $\mathbf{w} := \mathbf{0}^m$ (assume weight includes bias)

  2. For every training epoch:

    1. For every $(\mathbf{x}^{[i]}, y^{[i]}) \in D$:
      1. $\hat{y}^{[i]} := \sigma(\mathbf{x}^{[i]T} \mathbf{w})$ ← Only 0 or 1
      2. $err := (y^{[i]} - \hat{y}^{[i]})$ ← Only -1, 0, or 1
      3. $\mathbf{w} := \mathbf{w} + err \times \mathbf{x}^{[i]}$

Algorithm (Detailed Logic)

The learning rule can be broken down into three cases:


2. Geometric Intuition

It might seem weird at first why we can directly add or subtract weights during learning. The key insight is that for linear classifiers like perceptrons:

Decision Boundaries

A perceptron creates a linear decision boundary in the input space. For a 2D input space with features $x_1$ and $x_2$, the decision boundary is defined by:

w_1 x_1 + w_2 x_2 + b = 0

This is a straight line that separates the input space into two regions:

Weight Vector Interpretation

The weight vector $\mathbf{w} = [w_1, w_2]^T$ is perpendicular to the decision boundary. The direction of $\mathbf{w}$ points toward the positive class, and its magnitude affects the “confidence” of the classification.

Updating Weight Vector

When a positive example x (green) is misclassified by the current weight vector w (red)—i.e.,

\mathbf{w}^{\top}\mathbf{x} + b < 0

—we update the model by moving the weights toward the example:

\mathbf{w}_{\text{new}}=\mathbf{w}_{\text{old}}+\eta\,\mathbf{x},\qquad b_{\text{new}}=b_{\text{old}}+\eta .

Geometrically, this adds a step in the direction of x, rotating and shifting the decision boundary so that x moves to the correct side. The blue arrow shows the new weight vector after the update; the dashed line indicates the margin change.

Figure 4. Updating Weight Vector.

Perceptron Limitations

Parallel Histories

Perceptrons and Deep Learning


3. Notational Conventions for Neural Networks

Standard Notation

When working with neural networks, we typically use the following conventions:

A Fully-Connected Layer

With all the notation we have, we can come up with a fully-connected layer:

Figure 5. A Fully-Connected Layer with Matrix.

Matrix Form

For a batch of inputs $\mathbf{X} \in \mathbb{R}^{n \times d}$ (where $n$ is the batch size):

\begin{aligned} \mathbf{Z} &= \mathbf{X}\mathbf{W}^T + \mathbf{b} \\ \mathbf{A} &= \sigma(\mathbf{Z}) \end{aligned}

Intuition of W\mathbf{x} notation

W\begin{bmatrix}x\\y\end{bmatrix} = \begin{bmatrix}a & b\\ c & d\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix} = x\begin{bmatrix}a\\c\end{bmatrix} + y\begin{bmatrix}b\\d\end{bmatrix}. \begin{aligned} W=[\,\mathbf{w}_1\ \cdots\ \mathbf{w}_n\,],\quad \mathbf{x}=(x_1,\ldots,x_n)^{\top} \ \Rightarrow\ W\mathbf{x}=\sum_{i=1}^{n} x_i\,\mathbf{w}_i. \end{aligned}

4. A Fully Connected (Linear) Layer in PyTorch

Figure 6. A Fully-Connected Layer.

Implementation

In PyTorch, a fully connected layer is implemented using nn.Linear:

import torch
import torch.nn as nn

# Create a linear layer: input_size=784, output_size=10
linear_layer = nn.Linear(784, 10)

# Forward pass
x = torch.randn(32, 784)  # batch_size=32, input_size=784
output = linear_layer(x)  # output shape: (32, 10)

Understanding the Parameters

# Access weights and bias
print(f"Weight shape: {linear_layer.weight.shape}")  # (10, 784)
print(f"Bias shape: {linear_layer.bias.shape}")      # (10,)

# The operation performed is: output = x @ weight.T + bias

Building a Simple Perceptron

class Perceptron(nn.Module):
    def __init__(self, input_size):
        super(Perceptron, self).__init__()
        self.linear = nn.Linear(input_size, 1)
        self.activation = nn.Sigmoid()  # or nn.Heaviside for step function
    
    def forward(self, x):
        z = self.linear(x)
        return self.activation(z)

# Usage
perceptron = Perceptron(input_size=2)
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
output = perceptron(x)

Key Takeaways

  1. Historical Significance: Rosenblatt’s perceptron laid the foundation for modern neural networks
  2. Linear Separation: Single-layer perceptrons can only solve linearly separable problems
  3. Geometric Interpretation: The weight vector defines the decision boundary orientation
  4. Mathematical Foundation: Understanding matrix operations is crucial for neural network implementation
  5. PyTorch Implementation: nn.Linear provides an efficient implementation of fully connected layers

Next Steps

In the next lecture, we will explore:


Note: This lecture provides the fundamental building blocks that we’ll use throughout the course. Make sure you understand the geometric intuition and mathematical formulation before moving to more complex architectures. Only basic.