Study Guide: Neural Networks and Deep Learning by Michael Nielsen

December 10, 2021

After finishing Part 1 of the free online course Practical Deep Learning for Coders by fast.ai, I was hungry for a deeper understanding of the fundamentals of neural networks.

To tackle this, I worked through Michael Nielsen's openly licensed and freely available book entitled Neural Networks and Deep Learning, published in 2015.

Working through the book a couple of times was a challenging and effective exercise for me to fill in knowledge gaps I had after the broad but surface-level understanding of neural nets I attained via Part 1 of the fast.ai course. I've attempted to summarize my notes and thoughts from each chapter below to facilitate my learning process and in the hope that it's helpful to others like me who are just getting started on their machine learning journey.

I thought that the book did a fantastic job of explaining neural networks from the ground up. Nielsen takes you on a journey that starts with a brief history lesson and some intuition building using simple perceptron-based neural nets and then slowly builds up to optimizing a handwritten digit classifier that achieves >99% accuracy on the MNIST data set. Accompanying the book is a well-documented code repository with three different iterations of a network that is walked through and evolved over the six chapters.

Despite being just six chapters, the book is meaty. I don't think I would recommend it to someone just starting to study machine learning unless they already have a strong math background. However, I certainly would highly recommend it to anybody who wants to go deeper in exploring or reviewing the fundamentals of neural networks and isn't content with just fine-tuning off-the-shelf models for practical applications.

A note on the math

If you are like me and have forgotten most of your college level linear algebra, multivariable calculus, or just seem to have small heart attacks when presented with tricky math notation, you might find yourself discouraged by about the middle of Chapter 2. Nielsen has a Ph.D. in Physics and is a quantum computing pioneer, so even though he tries to take it easy with the math, what is simple and straightforward for him may feel quite daunting for us mere mortals. That being said, please trust me that it is possible for it to start to make sense and feel readable if you give it time and effort, so don't give up! Instead, take your time and brush up on the math using other resources while you read. I'd highly recommend Khan Academy's courses on multivariable calculus (especially the section on partial derivatives) and linear algebra, as well as 3Blue1Brown's YouTube playlists on the Essence of Calculus, the Essence of Linear Algebra, and Neural Networks. Finally, here's a link to a math notation cheat sheet that you may find useful.

My other tip here might seem obvious, but maybe it will help somebody: math notation is like a stack trace from a software program. Your brain wants to filter that stuff out because it's not plain English. Don't do that. Read the math notation just like you'd read a stack trace - carefully, line by line. Learn what every little symbol means and then try and visualize what's happening as you plug something real into the variables. Slowly you'll be able to read it with more fluency and it becomes much less scary.

Study Guide

Below, I've summarized my notes from each chapter, attempted to highlight the key concepts that the book introduces, and provided some of my own commentary at various points. My goal for this is to serve as a sort of in-depth table of contents that can be used as a jumping-off point to dive deeper into the book. Summarizing is hard and I'm sure I've made mistakes below or glossed over important bits, so please refer to the book as the actual source of truth here. Also, please let me know if you see something that needs to be corrected!

Chapter 1: Using neural nets to recognize handwritten digits

This chapter introduces neural networks via the challenge of classifying handwritten digits, which is an easy problem for our human brains but nearly impossible for computer programs unless they use machine learning. There are three major components in a neural network: 1) a model of an artificial neuron which receives input and (based on the strength of the input and its activation threshold) produces some output; 2) a model for the strength of connections between these neurons; 3) a learning algorithm that automatically adjusts the strength of these connections to improve the performance of the network as it trains for a specific task.

Using the simplest possible model of an artificial neuron called a perceptron (which receives and produces a binary output), Nielsen shows that by connecting just a few perceptrons into a network, you can build a NAND gate and thus:

because NAND gates are universal for computation, it follows that perceptrons are also universal for computation.

The chapter then introduces a type of artificial neuron called the sigmoid neuron, which is distinct from a perceptron in that instead of producing a binary output, it produces an output that is any real number between 0 and 1. Moving from a discrete output to a continuous output is important because it gives us more visibility into how small adjustments to the strengths of connections (the weights) or the activation threshold of the neurons (the biases) in the network move us closer to or farther away from our goal, which in this case is correctly classifying digits.

This measurement of how well or poorly the network is achieving its goal is called the cost function, and by minimizing this function, we can improve the performance of our network. To minimize the cost function, we can apply a technique called gradient descent which is a learning algorithm that works by calculating the gradient of the cost function with respect to each input variable and then adjusting these input variables in small increments so that they reduce the cost. The details of this process, specifically backpropagation, which is the algorithm often used by gradient descent for efficiently calculating the gradient of the cost function with respect to each variable in the model, are probably the trickiest math bits to wrap your head around and are discussed in more detail in Chapter 2.

The chapter concludes by discussing what makes a neural network "deep" (having two or more hidden layers) and building some intuition for why these additional layers allow networks to perform well on more complicated tasks. The idea is that the broader classification task can be decomposed into sub-questions or sub-problems, with early layers answering the "base cases" or the simple subtasks and subsequent layers using this output to answer increasingly more complex questions until we are able to make accurate predictions on the overall task. This is foreshadowing for Chapter 6!

Chapter 1 Key Terms

perceptron: the simplest model of an artificial neuron which receives 1 to N binary inputs and produces a single binary output.
weights: a number representing the strength of a connection between two artificial neurons in the network.
bias: aka the activation threshold of the artificial neuron. If the sum of the inputs multiplied by the weights is greater than the activation threshold, the perceptron "fires" and outputs 1, otherwise it outputs 0.
activation function: the function that an artificial neuron uses to calculate its output.
sigmoid neuron: an artificial neuron that produces an output between 0 and 1 by using the sigmoid function as an activation function.
cost function: aka loss or objective function is a function that measures how well our network is doing relative to the goal - the lower the cost function is, the better the network is performing. Some properties of a good cost function are that it always outputs a cost > 0, and that as the cost function approaches 0, the performance of the network improves.
gradient descent: an optimization algorithm used to minimize a function by moving in the direction of steepest descent which is found by calculating the gradient of the function.
learning rate: aka alpha or step size is the size of the step taken at each iteration of gradient descent. Set the learning rate too high and gradient descent might step over the minimum and never converge. Set it too small and training will be very slow.
stochastic gradient descent: a version of gradient descent that uses a random subset of the training data to calculate the gradient and adjust the network parameters. Uses a subset of the data allows training networks faster than if we included all training examples in every step.
hyper-parameters: these are variables that will affect the performance of the neural network but which aren't learned by the network itself. Some examples of hyperparameters are the neural network architecture, the learning rate, cost function, activation function, weight initialization, etc.
mean squared error cost function: aka the quadratic cost function is a cost function that is calculated by taking the squared difference between the expected activation and the actual prediction, and then averaging this across training examples

Chapter 2: How the backpropagation algorithm works

Backpropagation is the algorithm most commonly used with gradient descent to efficiently make adjustments to the network parameters to improve network performance. This chapter was challenging because there is quite a bit of math notation, and I think backpropagation generally can be tricky to wrap your head around. In addition to reviewing the chapter a few times, I also found the 3Blue1Brown videos "What is backpropagation really doing?" and "Backpropagation Calculus" helpful for building intuition. I'd also highly recommend this CS231n lecture by Andrej Karpathy and the associated lecture notes because I think he explains it in a complementary way to the book and provides smaller, more concrete examples. This is also a decent summary of backprop.

Chapter 2 Key Terms

backpropagation: a fast algorithm for efficiently optimizing the parameters of the network by computing the gradient of the cost function with respect to each weight and bias in the network and taking a learning-rate-sized step in the downward direction.
chain rule: a mathematical property of composite functions - the derivative of a composite function is the derivative of the inner function multiplied by the derivative of the outer function. Used by backpropagation to calculate the derivative of the cost function with respect to any parameter in the network.

Chapter 3: Improving the way neural networks learn

This chapter introduces techniques for optimizing the performance of neural networks. The beauty of this chapter is that even though many of these new terms sound fancy and complicated, they generally turn out to be minor adjustments to one or more of our core components work. To clarify what I mean here, in Chapter 1, we established a mental model for a neural net containing three main components: a model of an artificial neuron, a model for the connections between neurons, and a learning algorithm that evaluates some cost function and then updates the network's parameters in a way that incrementally decreases that cost. Chapter 3 is just about exploring how we can tweak the mechanics of these components to continue to improve the performance of our network.

Activation Functions

The choice of activation function makes one type of artificial neuron distinct from another. For example, the perceptron used a step function as its activation function, outputting either 0 or 1. Then we moved to a sigmoid neuron, which uses the sigmoid function for its activation function and thus outputs any real number between 0 and 1. There are many other possible activation functions: tanh, which is really just a rescaled sigmoid function that outputs a value between -1 and 1 instead of 0 and 1; rectified linear units aka ReLU which is a fancy term for taking the max of the activation and 0; softmax, which is similar to sigmoid in that the output is between 0 and 1 but averaged across the rest of the activations in the output layer, allowing us to more easily interpret the level of activation of an individual neuron relative to other activations in that layer; and many others! The main idea is that different activation functions have different characteristics, so experimenting with this hyper-parameter can lead to improved training performance depending on your training data and use case.

Overfitting and regularization

Another concept that this chapter introduces is overfitting, which is when the network is significantly more accurate when presented with training data than with test data, indicating that it hasn't learned a general solution to the task. There are two main approaches to reduce overfitting: increasing the size of the training data and using regularization techniques.

To get more training data, you can collect it if that's feasible and/or you can also artificially expand your existing data set by modifying it in a way that makes it distinct from the original but still a valid example for training. In computer vision, this could mean transforming the image via rotating it, distorting it, occluding parts of it, etc. This is known as data augmentation.

Regularization techniques help prevent overfitting by discouraging "bad behavior" in the network. One example of bad behavior is when a network's weights are inflated to large values unnecessarily. The intuition behind this is that smaller weights have less complexity, so if you can minimize the cost with a smaller weight vs a larger weight, the smaller weight should be preferred. Nielsen points out that this isn't a perfect line of reasoning because there are plenty of examples in science where the more complex explanation for something has turned out to be true. The main reason we apply regularization is because we have empirical evidence that networks with regularization generalize better! Anyway, to mitigate inflated weights, you can use L1 or L2 regularization aka weight decay, which scales the cost function by the weights (the larger the weights the higher the cost).

Another example of bad network behavior is when individual neurons are heavily reliant on neighboring connections to provide useful predictions. To combat this and improve the robustness of the network, the dropout regularization technique hides a different random subset of neurons in the network with each training epoch. The end result is that you have a network that is acting like the average of many different smaller networks (which may individually have overfit their subset of training examples), but the net effect is usually that the average of these results will reduce overfitting.

Learning algorithm and cost functions

The way in which we evaluate the loss for our model is going to obviously have an effect on the rate at which our model learns. Nielsen introduces a new loss function called cross-entropy loss which is often a better choice than mean squared error in many cases because it more heavily penalizes predictions that are more wrong, thus helping the model start to converge more rapidly. Similar to activation functions, there are many possibilities for cost functions which have interesting properties that could make them good candidates depending on the model.

Nielsen also touches on variations to gradient descent, including momentum-based gradient descent which, as the name implies, introduces the idea of velocity and friction into the descent steps so that prior steps can either speed up or slow down future steps.

Chapter 3 Final Thought

For me, a major takeaway of the chapter is that tuning hyper-parameters is an iterative and exploratory process. It is part of what makes machine learning an art as well as a science. Despite all the research in this area, we still don't yet deeply understand why all of this stuff works, yet we are able to measure that at least for some standard data sets and tasks, they are highly effective. In other words, we have a lot of empirical evidence without a strong theoretical foundation. Like Nielsen, I find this fascinating and inspiring -- there is still so much room for innovation and discovery in this space. I loved this bit from the final paragraph of the chapter:

It's like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. ... When you understand something poorly - as the explorers understood geography, and as we understand neural nets today - it's more important to explore boldly than it is to be rigorously correct in every step of your thinking.

Chapter 3 Key Terms

softmax: typically used as an activation function in the output layer because it has the property that each activation is divided by the sum of all the activations for that layer, thus allowing activations in that layer to be compared relative to one another. They form a probability distribution because they collectively sum to one.
rectified linear unit: aka ReLU: an activation function that is just max(0, z) where z is the activation, thus chopping off negative activations.
cross-entropy cost function: a cost function that calculates a "measure of surprise" and thus increases as the prediction diverges from the expected label -- it's more effective than mean squared error at recovering from poorly initialized weights and biases.
log likelihood cost function: often used with a softmax activation layer, just the natural log of the activation times negative one.
neuron saturation: when the activation of a neuron is near the maximum value, eg 1 - this can cause learning to slow down.
overfitting: when the network is significantly less accurate on the test data than the training data, indicating it has memorized the training data instead of generalizing to the prediction task.
regularization methods: techniques for reducing overfitting in a network
L1 regularization: adding the sum of the absolute values of the weights to the cost function times some regularization parameter, thus driving the cost function higher when weights are large.
L2 regularization: aka weight decay means adding the sum of the squared weights times some regularization parameter to the cost function, thus driving the cost function higher when weights are large.
dropout: removing a random subset of neurons from the network during each training epoch to reduce overfitting.
data augmentation: artificially expanding your available training data by modifying the data in a way that makes it distinct from the original but still a valid/useful training example.

Chapter 4: A visual proof that neural nets can compute any function

In this short chapter, Nielsen provides a bunch of awesome interactive visualizations as a form of visual proof that neural networks can approximate any function (aka any real-world process like translating a language, classifying a digit, or predicting the temperature) as long as the network has at least one hidden layer. I don't have much to add here other than to encourage some playtime with the interactive visualizations. I found this chapter inspiring. The universality theorem provides a sense of grounding in the sense that neural networks are the right "substrate" to continue experimenting with as we collectively make progress toward more general learning architectures and algorithms.

Chapter 4 Key Terms

universality theorem: states that neural networks are capable of approximating any function to an arbitrary degree of accuracy if given the appropriate weights.

Chapter 5: Why are deep neural networks hard to train?

Even though we know that we can approximate any function with a shallow network of a single hidden layer, there are a few good reasons to prefer deeper networks (more hidden layers) over shallow networks. Firstly, it has been proven that it can require exponentially more neurons to approximate a function in a shallow network than it would in a deeper network. Additionally, intuition tells us that deeper layers of neural nets are capable of learning more complex hierarchical representations than shallower networks are.

However, the deeper we build our networks, the more likely we are to run into issues with gradient based learning. The problem is that gradients are unstable because the gradients of early hidden layers are the product of gradients in later layers in the network. As we add layers to the network, we increase the risk of these earlier gradients either diminishing toward 0 (the vanishing gradient problem) which grinds learning to a halt for these layers or the opposite problem which is having the gradients "explode" toward large values (the exploding gradient problem) and learning being unstable in the network.

In addition to exploding and vanishing gradients, Nielsen also cites a few other papers that highlight challenges with training deep neural nets that have to do with weight initialization, moment schedules in momentum-based gradient descent, and choice of the activation function.

In short, it seems like the point of this chapter is to provide a warning that pretty much any hyper-parameter can cause you headaches in training a network if it's not chosen thoughtfully!

Chapter 5 Key Terms

unstable gradient problem: the gradient in early layers is the product of gradients in subsequent layers (due to the chain rule) leading to the different layers learning at different rates and the parameters at earlier layers either diminishing to 0 or exploding to large values.
vanishing gradient problem: when earlier hidden layers learn significantly slower than later hidden layers because the gradients approach zero as they backpropagate to these earlier layers.
exploding gradient problem: when hidden layers gradient get extremely large leading to correspondingly large updates to weights and biases and thus an unstable learning process.

Chapter 6: Deep learning

The final chapter of the book explores more advanced deep learning architectures (at least relative to what's been presented thus far) and speculates a bit on the future of deep learning. It's on the longer side, so I've just attempted to highlight and comment on the most interesting sections.

It starts with an introduction to convolutional neural networks (CNNs), which introduces an exciting concept: we can capture more information from our input data if we change the architecture of the network. The issue that motivates a CNN architecture is that in a deep feedforward network, when we flatten out our input into one long vector, we are discarding spatial information about where the pixels reside in the image relative to one another. CNNs capture this spatial information by using hidden layers that represent specific feature activations within certain parts of the image. These are created by sliding a small matrix called a filter or kernel across the input image and using the activation at each stop as input to a single hidden neuron in the subsequent layer. The filter has just one set of weights, so the idea is that no matter where it's looking at the image, it's going to activate for the same type of features. The resulting hidden layer is called a feature map because it represents what features are present and where they are spatially in the input. Many kernels/filters can be used simultaneously to create many feature maps, and these features can be fed into additional convolutional layers, which can learn more abstract features from the simpler features in the earlier layers. Check out the diagrams of CNNs from the book, because that was helpful for building intuition for what is happening.

The next section is an annotated walkthrough of the third iteration of the neural network, this time using the Theano deep learning framework to implement a CNN. I read through the code here to make sure I was following but didn't give this a ton of attention because the Theano project appears to be inactive now. The ML community has moved toward PyTorch and Tensorflow (thus, I'll be focusing on getting intimately familiar with these libraries instead).

The next section highlights a few papers from 2012-2015 that demonstrate the rapid progress in computer vision that CNNs made possible (smashing through previous records for classification accuracy on ImageNet).

Then there is a section that briefly surveys other exciting neural net architectures, specifically recurrent neural networks (RNNs), generative networks, and reinforcement learning. I found the section on RNNs particularly interesting because in the same way a CNN architecture allows us to preserve spatial information in the input data, RNNs allow us to preserve temporal or sequential relationships in the input data, thus lending itself to tasks like speech recognition or natural language processing.

The final section is a discussion about the future of deep learning and neural networks. Nielsen touches on the the likely rise and improvement of intention-driven user interfaces, how machine learning algorithms are already being used to discover "known unknowns" in data, and attempts to answer the question "Will neural networks and deep learning soon lead to artificial intelligence?" His take is that it's too early to tell if we are on a path to true AGI, but that deep learning and artificial neural nets are powerful ideas either way and thus likely to be here to stay.

Chapter 6 Key Terms

convolutional neural networks: a neural network architecture widely used for computer vision that preserves information about the spatial structure of the input data through the use of local receptive fields, shared weights, and pooling.
kernel: aka filter aka feature detector - all the connections from each local receptive field in the input layer to neurons in the hidden layer share the same weights and bias, meaning that all neurons in that hidden layer detect the same feature but at different locations in the input image.
feature map: a hidden layer that is the result of a filter or kernel on the input image
pooling: a set of techniques for condensing information from the convolutional layer, reducing the number of parameters needed in later layers by reducing the spatial precision of the feature map activations.
max pooling: taking the maximum activation of a region of a feature map
L2 pooling: taking the square root of the sum of the squares of all the activations of a region of a feature map
stride length: in a CNN, this is the pixel distance between each local receptive field. Eg if we move the local receptive field 1 pixels to the right, the stride length is 1.
ensembles: training multiple networks and taking their average prediction - this tends to improve overall accuracy by a few percentage points.
recurrent neural networks: aka RNNs are a class of neural networks that capture temporal or sequence information of the input by allowing the hidden layers of the network to depend not only on the activations of the previous layer but also on activations at earlier times.

Appendix: Is there a simple algorithm for intelligence?

This appendix speculates on whether or not we will eventually be able to discover a simple algorithm or set of unifying principles (like evolution or gravity) for general intelligence. Nielsen chooses to take the optimistic stance, believing that such an algorithm exists.

When it comes to research, an unjustified optimism is often more productive than a seemingly better justified pessimism, for an optimist has the courage to set out and try new things. That's the path to discovery, even if what is discovered is perhaps not what was originally hoped.

In other words, we can't know for sure, but we'll never figure it out if we don't try!

Takeaways, Inspiration, and Future Work

When I started working through the book, my goal was to understand the fundamentals of neural networks more deeply. I feel like I took significant steps toward this goal and I'm looking forward to building on this stronger foundation with an enormous amount of practice building neural nets myself through many projects!

One thing that pleasantly surprised me about Nielsen's writing was all the anecdotes from other scientific disciplines and science history that he sprinkled throughout the book. It's awe-inspiring to think about how old some of these ideas are (e.g. backpropagation was introduced in 1970), and that they may lie dormant for decades until combined with the right complementary idea or hardware breakthrough, and all of a sudden surge in popularity and usefulness. It's so exciting to think about how much we have left to discover and that some of these old, currently unpopular ideas might be diamonds in the rough just waiting to be discovered and put into the right light.

Another thing I loved was the interactive visualization elements scattered throughout the book, especially in Chapter 3, to demonstrate the impact of different cost and activation functions. I also happen to be passionate about creating educational tools and found myself daydreaming about ways to enhance my understanding of what was happening via interactive visualization. For example, I was craving a Khan Academy-style article on backpropagation that would give me a small neural net that I could optimize by hand for a few epochs. Another idea: a neural net that I could step through and watch as the network trains epoch by epoch, watch the gradients flow backward (could show changes in the network via the physical size of the connection between neurons or changing color of neurons as they activate, etc.)

In doing some quick searching, it seems like there is some work in similar directions, for example ConvNetJS and especially the Tensorflow playground. An extension of these projects could be to show all of the different representations of the net at once - as matrices, as code, and as a neural net graph - with linking between them (eg if I highlight a column in the weight matrix, the corresponding edge of the neural net graph is highlighted as well). For large nets, this is likely infeasible, but for the smaller example nets and as an educational resource, I think this could be useful and something I may circle back to over the next few months as a side-project. In addition to an educational use case, I wonder if this could be useful as a debugging or interpretation tool for larger nets. It would be interesting to watch how the network activations light up in real-time when doing inference on a training example (and especially interesting to compare this to a training example of the same class that the network is misclassifying!) We can call it "Neural Net-ercises", or "Tensor Town", because learning should be fun. Oh, or "Vectorville"! I'll stop now.

Anyway, if you made it this far, I hope you found this useful! If you are also learning about machine learning, I'd love to connect, so feel free to reach out!

Give Back

By the way, if you got to this point, then by proxy you have gotten value from the hard work Nielsen put into writing the book. Pay the man! He has a donation link in the sidebar. He suggests $5 because he's humble (I put the value of the book closer to $50+, ie cost of a textbook), and takes payment via Paypal (because he's practical) or via BTC (because he's enlightened). Give what you can!