Neural networks currently perform as well — or nearly as well — as humans in a variety of tasks and domains.
On the surface, modern neural networks may seem sophisticated since they can have billions of parameters.
However, deep inside, they are well organized as a sequence of manageable functions, or 'layers'.
These layers transform the input data into internal representations that themselves are eventually transformed into a solution of the task.
The behavior of these layers is complex and multi-dimensional.
The internal representations are not directly comparable across layers.
How can we make them more readily visible?
Taking image classification as an example, ImageNet
Note that the overall color already hints at the class of the image. For example, a very blue background often contains ocean animals,
outdoor scenes, airplanes, or birds,
but others may not have a blue background.
You are seeing these images in a 2D layout: positions on the screen. These layouts often do not have enough capacity to show the full structure of a high-dimensional space. Even if we were to summarize the entire image as one color, there are three dimensions in color: red, green and blue; if we consider individual pixels, we can get near million-dimensional data. In ImageNet, we have 150,528 dimensions in the input, from images that are 224 x 224 x 3. Clearly, 2D will not be enough, but one million dimensions might be too much (there are "only" 1000 classes in the task, after all).
As it turns out, dimensionality reduction techniques do not restrict us from computing only 2D or 3D layouts.
If we give it more room than 3 dimensions, UMAP will have the chance to preserve more of the structures in the data.
In this example, we project (224 x 224 x 3)-dimensional image data to 15 dimensions with UMAP
Admittedly, 15-dimensional space is not exactly intuitive.
To overcome that, we use the Grand Tour
The rotation of the Grand Tour is random but thorough: the sequence of projections depends on the initial (random) state, but any animation will ultimately go through every projection. As the rotation happens, you can probably already spot interesting clusters, and it would be useful to nudge the projection this way or that. To make projections more controllable, we provide you with "handles" to grab and move around.
By doing this, we can explore structures in our data in a more controllable way. In this example and others that follows, we use UMAP to project our data to 15D and use the Grand Tour to display it.
The internal representations of the images as they go through a deep neural network's layers are known as neuron activations. Just like the input images, neuron activations are also high dimensional vectors. Projecting neuron activations of any chosen layer to more than three dimensions preserves interesting structures of the feature space in that layer, and reveals how a layer sorts images and patterns in its feature space. Using image examples as probe and the Grand Tour as lens, we can get a glimpse of the internal feature space of these neural networks.
Sometimes, we may want to hide the fine details of individual images and focus on the overall distribution of points.
In such cases, we can represent images as dots and color them by a coarse labeling which is derived from their true classes
As an example of real-world deep neural network, GoogLeNet
One of its internal layers, Inception-4d, shows some striking patterns: it encodes the orientation of animals,
cluster various animal faces,
and despite there being no 'human' class in the 1000 labels, the network recognizes human faces.
Again, let's take a Grand Tour and look at the intricate structure of the 15-dimensional space, one 2D projection at a time.
As with the input data, one way of changing projections in the Grand Tour is to drag the handles provided around some given set of data points. We can also create custom handles.
We have seen two layers of GoogLeNet: input layer and Inception-4d layer. Now, let's tour the other layers.
Starting off from the input layer,
diving down to a max-pooling layer,
then Inception-4d,
Inception-5b,
and finally, the softmax classification output.
Let's pay attention to a technical detail for a bit: notice that the layer-to-layer transitions are consistent with each other, the overall orientation of the layers not changing. Later, we will use the same technique to compare different neural network architectures.
In the previous section, we saw how individual layers work, looking at their 15-dimensional embeddings. However, on two different layers, we should not expect the embedding algorithm to give us directly comparable coordinates, given that the two layers have completely different neuron activation patterns and the embedding algorithm (for example, the Stochastic gradient descent in UMAP) is not deterministic. Without handling this problem, we can easily lose track of individual data points when switching from embedding of one layer to another.
We handle the misalignment of embeddings by a combination of two techniques
The second technique deserves some more detail.
Given a pair of embeddings of different layers, the Orthogonal Procrustes Problem finds an optimal orthogonal matrix which rotates one embedding to align against another
Note that the two embeddings do not have to come from the same model.
On the right, which we've seen before, we have our old friend GoogLeNet.
On the left we have ResNet50
See the UMAP Tour of other models: