12 Neural Networks
12.1 Introduction
Neural networks are functions loosely modeled on the brain. In the brain, we have billions of neurons that connect to one another. Each neuron can be thought of as a node in a graph, and the edges are the connections from one neuron to the next (Figure 12.1). The edges are directed; electrical signals propagate in just one direction along the wires in the brain.
Outgoing edges are called axons and incoming edges are called dendrites. A neuron fires, sending a pulse down its axon, when the incoming pulses, from the dendrites, exceed a threshold.
12.2 The Perceptron: A Simple Model of a Single Neuron
Let’s consider a neuron, shaded in gray, that has four inputs and one output (Figure 12.2).
A simple model for this neuron is the . A perceptron is a neuron with
In words, we take a weighted sum of the inputs and, if that sum exceeds a threshold (here 0), the neuron fires (outputs a 1). The function
Mathematically,
12.2.1 The Perceptron as a Classifier
People got excited about perceptrons in the late 1950s because it was shown that they can learn to classify data [1]. Let’s see how that works. We will consider a perceptron with two inputs,
Notice that
12.2.2 Learning with a Perceptron
So a perceptron acts like a classifier, but how can we use it to learn? The idea is that given data,
In Figure 12.4, this optimization process corresponds to shifting and rotating the decision boundary, until you find a line that separates data labeled as
You might be wondering, what’s the exact optimization algorithm that will find the best line that separates the classes? The original perceptron paper proposed one particular algorithm, the “perceptron learning algorithm.” This was an optimizer tailored to the specific structure of the perceptron. Older papers on neural nets are full of specific learning rules for specific architectures: the delta rule, the Rescorla-Wagner model, and so forth [2]. Nowadays we rarely use these special-purpose algorithms. Instead, we use general-purpose optimizers like gradient descent (for differentiable objectives) or zeroth-order methods (for nondifferentiable objectives). The next chapter will cover the backpropagation algorithm, which is a general-purpose gradient-based optimizer that applies to essentially all neural networks we will see in this book (but, note that for the perceptron objective, because it has a nondifferentiable threshold function, we would instead opt for a zeroth order optimizer).
12.3 Multilayer Perceptrons
Perceptrons can solve linearly separable binary classification problems, but they are otherwise rather limited. For one, they only produce a single output. What if we want multiple outputs? We can achieve this by adding edges that fan out after the perceptron (Figure 12.5).
This network maps an input layer of data
More commonly we might have many hidden units in stack, which we call a (Figure 12.6).
How many layers does this net have? Some texts will say two [
Because this network has multiple layers of neurons, and because each neuron in this net acts as a perceptron, we call it a multilayer perceptron (MLP). The equation for this MLP is:
In general, MLPs can be constructed with any number of layers following this pattern: linear layer, activation function, linear layer, activation function, and so on.
The activation function
Beyond MLPs, this kind of sequence (linear layer, pointwise nonlinearity, linear layer, pointwise nonlinearity, and so on) is the prototpyical motif in almost all neural networks, including most we will see later in this book.
12.4 Activations Versus Parameters
When working with deep nets it’s useful to distinguish activations and parameters. The activations are the values that the neurons take on,
Conversely, parameters are the weights and biases of the network. These are the variables being learned. Both activations and parameters are tensors of variables.
Often we think of a layer as a function
That is, each layer takes the activations from the previous layer, as well as parameters of the current layer as input, and produces activations of the next layer. Varying either the input activations or the input parameters will affect the output of the layer. From this perspective, anything we can do with parameters, we can do with activations instead, and vice versa, and that is the basis for a lot of applications and tricks. For example, while normally we learn the values of the parameters, we could instead hold the parameters fixed and learn the values of the activations that achieve some objective. In fact, this is what is done in many methods such as network prompting, adversarial attacks, and network visualization, which we will see in more detail in later chapters.
12.4.1 Fast Activations and Slow Parameters
So what’s different about activations versus parameters? One way to think about it is that activations are fast functions of a datapoint: they are the result of a few layers of processing this datapoint. Parameters are also functions of the data (they are learned from data) but they are slow functions of datasets: the parameters are arrived at via an optimization procedure over a whole dataset. So, both activations and parameters are statistics of the data, that is, information extracted about about the data that organizes or summarizes it. The parameters are a kind of metasummary since they specify a functional transformation that produces activations from data, and activations themselves are a summary of the data. Figure 12.7 shows how this looks.
12.5 Deep Nets
Deep nets are neural nets that stack the linear-nonlinear motif many times (Figure 12.8):
Each layer is a function. Therefore, a deep net is a composition of many functions:
The
These functions are parameterized by weights
Deep nets are powerful because they can perform nonlinear mappings. In fact, a deep net with sufficiently many neurons can fit almost any desired function arbitrarily closely, a property we will investigate further in Section 12.5.2.
12.5.1 Deep Nets Can Perform Nonlinear Classification
Let’s return to our binary classification problem shown previously, but now let’s make the two classes not linearly separable. Our new dataset is shown in Figure 12.9.
Here there is no line that can separate the zeros from the ones. Nonetheless, we will demonstrate a multilayer network that can solve this problem. The trick is to just add more layers! We will use the two layer MLP shown in Figure 12.10.
Consider using the following settings for
The
Here we have introduced a new pointwise nonlinearity, the Rectified linear unit (relu), which is like a graded version of a threshold function, and has the advantage that it yields non-zero gradients over half its domain, thus being amenable to gradient-based learning.
We visualize the values that the neurons take on as a function of
As can be seen in the rightmost plot, at the output
12.5.2 Deep Nets Are Universal Approximators
Not only can deep nets perform nonlinear classification, they can in principle perform any continuous input-output mapping. The universal approximation theorem [3] states that this is true even for a network with just a single hidden layer. The caveat is that the number of neurons in the hidden layers will have to be very large in order to fit complicated functions.
Technically, this theorem only holds for continuous functions on compact subsets of
To get an intuition for why this is true, we will consider the case of approximating an arbitrary function from
As an example, in Figure 12.12 we show a curve (blue line) approximated in this way. As the width,
While we only consider scalar functions
Next we will show that a relu-MLP can represent Equation 12.3. The weighted sum
It turns out the construction is rather simple:
Here we show how a neural net can represent a function as a sum of basis functions. This idea is also foundational in signal processing, where signals are often represented as a sum of sine waves Chapter 16, boxes Figure 21.18, or trapezoids Figure 21.19.
As linear
-relu
-linear
network. In Figure 12.13, we show an example of constructing a bump in this way:
Putting everything together, we have a linear
-relu
-linear
for each bump, followed by a linear layer for summing up all the bumps. The two linear layers in sequence can be collapsed to a single linear layer, and hence the full function can therefore be approximated, to arbitrary precision, by a linear
-relu
-linear
net.
Most literature refers to such a net as having a single hidden layer, using the convention that we don’t count pre- and postactivation neurons as separate layers.
Notice that in this approximation, we need four relu neurons for each bump we are modeling. Therefore if we want to approximate a very bumpy function, say with
12.5.3 Depth versus Width
Above we saw that if you have a hidden layer with
If different layers have different numbers of neurons, then we may specify the width per layer. Here we will assume all layers have the same width and simply speak of the width of the network.
So, as we increase the width of a network, we can fit ever more complicated functions. What if we instead increase the depth of a network, that is, its number of layers? It turns out that this can also be an effective way to increase the capacity of the net, but its effect is a bit different than increasing width.
Interestingly, it is sometimes the case that deep nets require far fewer parameters to fit data than wide nets. Evidence for this statement comes mostly from empiricism, where researchers have found that deeper nets just work better in practice on many popular problems. However, there is also the beginning of a mathematical theory of when and why this can happen. The basic idea of this theory is to establish that there are certain classes of function that can be represented with a polynomial number of neurons in a depth
12.6 Deep Learning: Learning with Neural Nets
Using the formalism we defined in Chapter 9, learning consists of using an optimizer to find a function in a hypothesis space, that maximizes an objective. From this perspective, neural nets are simply a special kind of hypothesis space (and a particular parameterization of that hypothesis space). Deep learning refers to learning algorithms that use this parameterized hypothesis space.
Deep learning also typically involves using gradient-based optimization to search the hypothesis space for the best fit to the data. We will investigate this approach in detail in Chapter 14, where we will learn about the backpropagation algorithm for gradient-based learning with neural nets. However, it is certainly possible to optimize neural nets with other methods, including zeroth-order optimizers like evolution strategies Section 10.6.1, [5].
One intriguing alternative to backpropagation is called Hebbian learning [6]. Backpropagation is a top-down learning algorithm, where errors incurred at the output (top) of the net are propagated backward to inform earlier layers how to update their weights and biases to minimize the loss, which is a form of learning based on feedback. Hebbian learning, in contrast, is a bottom-up approach, where neurons wire up just based on the feedforward pattern of activity in the net. The canonical learning rule in Hebbian methods is Hebb’s rule: “fire together, wire together.” That is, we increase the weight of the connection between two neurons whenever the two neurons are active at the same time. Although this learning rule is not explicitly minimizing a loss function, it has been shown to lead to effective neural representations. For example, Hebb-like rules can learn infomax representations, which capture, in the neural activations, as much information as possible about the input signal [7]. Similar rules lead to networks that act like memory banks [8]. Hebbian learning is also of interest because it is considered to be more biologically plausible than backpropagation. This is because Hebb’s rule can be computed locally—each neuron strengthens and weakens its weights based just on the activity of adjacent neurons—whereas backpropagation requires global coordination throughout the neural network. It is currently unknown how this global coordination can be achieved in biological brains.
12.6.1 Data Structures for Deep Learning: Tensors and Batches
The main data structure that we will encounter in deep learning is the tensor, which is just a multidimensional array. This may seem simple, but it’s important to get comfortable with the conventions of tensor processing.
In general, everything in deep learning is represented as tensors—the input is one tensor, the activations are tensors, the weights are tensors, the outputs are tensors. If you have data that is not natively represented as a tensor, the first step, before feeding it to a deep net, is to convert it into a tensor format. Most often we use tensor of real numbers, that is, the elements of the tensor are in
Suppose we have a dataset
The activations in the network are also tensors. For the MLP networks we have seen so far, the activation tensors have shape
One other important concept is . Normally, we don’t process one image at a time through a neural net. Instead we run a batch of images all at once, and they are processed in parallel. A batch sampled from the training data can be denoted as
The weights and biases of the net are also usually represented as tensors. The weights and biases of a linear layer will be tensors of shape
As an example, in Figure 12.14 below, we visualize all the tensors associated with a batch of three datapoints being processed by the MLP from Figure 12.10. For this network, the input is not a set of images but instead a set of vectors
where the capital letters are the batches of datapoints and activations corresponding to the lowercase names of datapoints and hidden units in Figure 12.10.
This example shows the basic concept of working with tensors and batches for one-dimensional data, but, in vision, most of the time we will be working with higher-dimensional tensors. For image data we typically use four-dimensional tensors: batch
This is closer to the actual ND tensors vision systems work with, and many concepts can be adequately captured just by thinking in 3D. We will see some examples in later chapters.
12.7 Catalog of Layers
Below, we use the color blue to denote parameters and the color red to denote data/activations (inputs and outputs to each layer).
We color equations in this way only in this chapter, to make clear the roles of different variables. However, be on the lookout for these colors in figures later in the book. We will often draw activations in red and parameters in blue.
12.7.1 Linear Layers
Linear layers are the workhorses of deep nets. Almost all parameters of the network are contained in these layers; we call these parameters the weights and biases. We have already introduced linear layers previously. They look like this:
12.7.2 Activation Layers
If a net only contained linear layers then it could only compute linear functions. This is because the composition of
12.7.3 Normalization Layers
Normalization layers add another kind of nonlinearity. Instead of being a pointwise nonlinearity, like in activation layers, they are nonlinearities that perturb each neuron based on the collective behavior of a set of neurons. Let’s start with the example of (batchnorm) [9].
Batchnorm standardizes each neural activation with respect to its mean and variance over a batch of datapoints. Mathematically,
where
Recall from statistics that the standard score of a draw of a random variable is how many standard deviations it differs from the mean:
There are numerous other normalization layers that have been defined over the years. Two more that we will highlight are
Layernorm is similar except that it standardizes the vector of input activations:
Notice that layernorm, like
Notice that layernorm also looks quite similar to batchnorm. Both standardize activations but do so with respect to different statistics. Layernorm computes a mean and variance over elements of a datapoint
One issue with batchnorm is that it requires processing a batch of datapoints all at once, and introduces a dependency between each datapoint in the batch. This violates the principle that datapoints should be processed independently and identically (iid), and this can lead to bugs if your method relies on the iid assumption. Layernorm does not have this problem and does indeed process each datapoint in an iid fashion.
12.7.4 Output Layers
The last piece we need is an output layer that maps a neural representation—a high-dimensional array of floating point numbers—to a desired output representation. In classification problems, the desired output is a class label, and the most common output operation is the softmax function, which we have already encountered in previous chapters. In image synthesis problems, the desired output is typically a 3D array with dimensions
In the softmax definition, we have added a temperature parameter
The output layer is the input to the loss function, thus completing our specification of the deep learning problem. However, to use the outputs in practice requires translating them into actual pictures, or actions, or decisions. For a classification problem, this might mean taking the argmax of the softmax distribution, so that we can report a single class. For image prediction problems, it might mean rounding each output to an integral value since common image formats represent RGB values as integers.
There are of course many other output transformations you can try. Often, they will be very problem specific since they depend on the structure of the output space you are targeting.
12.8 Why Are Neural Networks a Good Architecture?
As you will soon learn, almost all modern computer vision algorithms involve deep nets in one way or another. So you may be wondering: why are deep nets such a good architecture? We will highlight here five reasons:
They are high capacity (big enough nets are universal approximators).
They are differentiable (the parameters can be optimized via gradient descent).
They have good inductive biases (neural architectures reflect real structure in the world).
They run efficiently on parallel hardware.
They build abstractions.
Let’s look at reasons 1-3 in light of the discussion of searching for truth from chapter Chapter 11 (see Figure 11.4). Reason 1 relates to the size of the hypothesis space. The hypothesis space can be made very big if we use a large neural network with many parameters. So we can usually be sure that our true solution (or a close approximation to it) does indeed lie in the space spanned by the neural net architecture. Reason 2 says that searching for the solution within this space is relatively easy, since we can use gradients to direct us toward ever better fits to the data. Reason 3 is one we will only come to appreciate later in the book as we investigate more advanced neural net architectures. It turns out that these architectures impose constraints and regularizers that bias our search toward solutions that capture true structure about the visual world, and this leads to learned solutions that generalize.
Reason 4 is equally important to the first three: it says we can do all of this efficiently because most computations can be parallelized on modern hardware; in particular both matrix multiplies (linear layers) and pointwise operations (e.g., relu layers) are easy to parallelize on graphical processing units. Further, most operations are applied to image batches, where each item in the batch can be sent to a different parallel compute node.
Reason 5 is the perhaps the most subtle. It is related to the layered structure of neural nets. Layer by layer, neural nets build up increasingly abstracted representations of the input data, and these abstractions tend to be increasingly useful. This argument is not easy to appreciate at first glance, but it will be a major theme of the subsequent chapters in this book, especially those on representation learning. For now, just keep in mind that the internal representations that are built up layer by layer in deep nets are useful and important beyond just the net’s overall input-output behavior.
12.9 Concluding Remarks
Neural nets are a very simple and useful parameterized hypothesis space. They are universal approximators that can be trained via gradient descent and run on parallel hardware. Deep nets are especially effective in computer vision; as we will soon see, deep architectures can be constructed that specifically reflect structure in the visual world, making visual processing highly efficient and performant. Artificial neural nets also have connections to the real neural nets in our brains. This connection runs deeper than merely sharing a name: the deep net architectures we will see later in this book (e.g., convolutional networks, transformers) are our best current models of computation in animal brains, in the sense that they explain brain data better than any competing models [12]. This is a class of models truly worth paying attention to.