9 Introduction to Learning
9.1 Introduction
The goal of learning is to extract lessons from past experience in order to solve future problems. Typically, this involves searching for an algorithm that solves past instances of the problem. This algorithm can then be applied to future instances of the problem.
Past and future do not necessarily refer to the calendar date; instead they refer to what the has previously seen and what the learner will see next.
Because learning is itself an algorithm, it can be understood as a meta-algorithm: an algorithm that outputs algorithms (Figure 9.1).
Learning usually consists of two phases: the training phase, where we search for an algorithm that performs well on past instances of the problem (training data), and the testing phase, where we deploy our learned algorithm to solve new instances of the problem.
9.2 Learning from Examples
Learning from examples is also called supervised learning.
Imagine you find an ancient mathematics text, with marvelous looking proofs, but there is a symbol you do not recognize, “
What do you think
It may not seem like it, but you just performed learning! You learned what
Nice job!
It turns out, we can learn almost anything from examples.
Some things are not learnable from examples, such as noncomputable functions. An example of a noncomputable function is a function that takes as input a program and outputs a 1 if the program will eventually finish running, and a 0 if it will run forever. It is noncomputable because there is no algorithm that can solve this task in finite time. However, it might be possible to learn a good approximation to it.
Remember that we are learning an algorithm, i.e., a computable mapping from inputs to outputs. A formal definition of example, in this context, is an {input
, output
} pair. The examples you were given for
This kind of learning, where you observe example input-output behavior and infer a functional mapping that explains this behavior, is called supervised learning.
Another name for this kind of learning is fitting a model to data.
We were able to model the behavior of
You probably came up with something like “it fills in the missing pixels.” That’s exactly right, but it’s sweeping a lot of details under the rug. Remember, we want to learn an algorithm, a procedure that is completely unambiguous. How exactly does
It’s hard to say in words. We may need a very complex algorithm to specify the answer, an algorithm so complex that we could never hope to write it out by hand. This is the point of machine learning. The machine writes the algorithm for us, but it can only do so if we give it many examples, not just these three.
9.3 Learning without Examples
Even without examples, we can still learn. Instead of matching input-output examples, we can try to come up with an algorithm that optimizes for desirable properties of the input-output mapping. This class of learners includes unsupervised learning and reinforcement learning.
In unsupervised learning, we are given examples of input data
In reinforcement learning, we suppose that we are given a reward function that explicitly measures the quality of the learned function’s output. To be precise, a reward function is a mapping from outputs to scores:
At first glance unsupervised learning and reinforcement learning look similar: both maximize a function that scores desirable properties of the input-output mapping. The big difference is that unsupervised learning has access to training data whereas reinforcement learning usually does not; instead the reinforcement learner has to collect its own training data.
9.4 Key Ingredients
A learning algorithm consists of three key ingredients:
: What does it mean for the learner to succeed, or, at least, to perform well?
: What is the set of possible mappings from inputs to outputs that we will we search over?
: How, exactly, do we search the hypothesis space for a specific mapping that maximizes the objective?
These three ingredients, when applied to large amounts of data, and run on sufficient hardware (referred to as compute) can do amazing things. We will focus on the learning algorithms in this chapter, but often the data and compute turn out to be more important.
A learner outputs an algorithm,
9.4.1 Importance of Parameterization
The hypothesis space can be described by a set
Overparameterized models, where you use more parameters than the minimum necessary to fit the data, are especially important in modern computer vision; most neural networks Chapter 12 are overparameterized.
9.5 Empirical Risk Minimization: A Formalization of Learning from Examples
The three ingredients from the last section can be formalized using the framework of (ERM). This framework applies specifically to the supervised setting where we are learning a function that predicts
Here, input
, output
} pairs), and
9.6 Learning as Probabilistic Inference
Depending on the loss function, there is often an interpretation of ERM as doing maximum likelihood probabilistic inference. In this interpretation, we are trying to infer the hypothesis
The term
To fully specify this model, we have to define the form of this conditional distribution. One common choice is that the prediction errors,
In later chapters we will see that priors
9.7 Case Studies
The next three sections cover several case studies of particular learning problems. Examples 1 and 3 showcase the two most common workhorses of machine learning: regression and classification. Example 2, Python program induction, demonstrates that the paradigms in this chapter are not limited to simple systems but can actually apply to very general and sophisticated models.
9.7.1 Example 1: Linear Least-Squares Regression
One of the simplest learning problems is known as linear least-squares regression. In this setting, we aim to model the relationship between two variables,
As a concrete example, let’s imagine temperature outside
, number of people at the beach
} pairs, denoted as
Our hypothesis space is linear functions, that is, the relationship between
Our objective is that predictions should be near ground truth targets in a least-squares sense, that is,
We will use
The full learning problem is as follows:
We can choose any number of optimizers to solve this problem. A first idea might be “try a bunch of random values for
From calculus, we know that at any maxima or minima of a function,
This function can be rewritten as
The
We set this derivative to zero and solve for
The
We can now summarize the entire linear least-squares learning problem as follows:
In these diagrams, we will sometimes describe the objective just in terms of
9.7.2 Example 2: Program Induction
At the other end of the spectrum we have what is known as program induction, which is one of the broadest classes of learning algorithm. In this setting, our hypothesis space may be all Python programs. Let’s contrast linear least-squares with Python program induction. Figure 9.7 shows what linear least-squares looks like.
The learned function is an algebraic expression that maps
Figure 9.8 shows Python program induction solving the same problem. In this case, the learned function is a Python program that maps
9.7.3 Example 3: Classification and Softmax Regression
A common problem in computer vision is to recognize objects. This is a problem. Our input is an image
How should we formulate this task as a learning problem? The first question is how do we even represent the input and output? Representing images is pretty straightforward; as we have seen elsewhere in this book, they can be represented as arrays of numbers representing red-green-blue colors:
How can we represent class labels? It turns out a convenient representation is to let
Next, we need to pick a loss function. Our first idea might be that we should minimize misclassifications. That would correspond to the so called 0-1 loss:
The way to think about this is
For that interpretation to be valid, we require that
The
To ensure that the output of our learned function
A popular way to squash is via the softmax function:
Using softmax is a modeling choice; we could have used any function that squashes into a valid pmf, that is, a nonnegative vector that sums to 1.
The values in
Now we have,
Figure 9.10 shows what the variables look like for processing one photo of a fish during training.
The prediction placed about 40 percent probability on the true class, “guitarfish,” so we are 60 percent off from an ideal prediction (indicated by the red bar; an ideal prediction would place 100 percent probability on the true class). Our loss is
This learning problem, which is also called , can be summarized as follows:
Softmax regression is just one way to model a classification problem. We could have made other choices for how to map input data to class labels.
Notice that we have left the hypothesis space only partially specified, and we left the optimizer unspecified. This is because softmax regression refers to the whole family of learning methods that have this general form. This is one of the reasons we conceptualized the learning problem in terms of the three key ingredients described previously: you can often develop them each in isolation, then mix and match.
9.8 Learning to Learn
Learning to learn, also called metalearning, is a special case of learning where the hypothesis space is learning algorithms.
Recall that learners train on past instances of a problem to produce an algorithm that can solve future instances of the problem. The goal of metalearning is to handle the case where the future problem we will encounter is itself a learning problem, such as “find the least-squares line fit to these data points.” One way to train for this would be by example.
Suppose that we are given the following {input
, output
} examples:
These are examples of performing least-squares regression; therefore the learner can fit these examples by learning to perform least-squares regression.
Note that least-squares regression is not the unique solution that fits these examples, and the metalearner might arrive at a different solution that fits equally well.
Since least-squares regression is itself a learning algorithm, we can say that the learner learned to learn.
We started this chapter by saying the learning is a meta-algorithm: it’s an algorithm that outputs an algorithm. Metalearning is a meta-meta-algorithm and we can visualize it by just adding another outer loop on top of a learner, as shown in Figure 9.12.
Notice that you can apply this idea recursively, constructing meta-meta-...-metalearners. Humans perform at least three levels of this process, if not more: we have evolved to be taught in school how to learn quickly on our own.
Evolution is a learning algorithm according to our present definition.
9.9 Concluding Remarks
Learning is an extremely general and powerful approach to problem solving. It turns data into algorithms. In this era of big data, learning is very often the preferred approach. It is a a major component of almost all modern computer vision systems.