Machine Learning

Should We Still Use Softmax As The Final Layer?

Softmax layer is commonly used as the last layer to generate probabilities, but it can lead to instability. Why?

Note: It is possible to bake this tf.nn.softmax in as the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it’s impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output. tensorflow

Status Quo Usage

Let’s say we want to create a neutal network to classify the MNIST dataset. By using Tensorflow Keras, we would quickly stetch the following:

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10),
    tf.keras.layers.Softmax()
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])

During 99.9% time of the running of this code, there would be no problem at all. However, there is still 0.1% chance where a pitfall can happen, and that is related to how we perform backpropagation of the gradients.

Explaination

Let’s use DeepMind’s Simon Osindero’s slide to explain: The grey block on the left we are looking at is only a cross entropy operation, the input x (a vector) could be the softmax output from previous layer (not the input for the neutral network), and y (a scalar) is the cross entropy result of x. To propagate the gradient back, we need to calculate the gradient of $dy/dx_i$ , which is $-p_i/x_i$ for each element in x. As we know the softmax function scale the logits into the range [0,1], so if in one training step, the neutral network becomes super confident and predict one of the probabilties $x_i$ to be 0, then we have a numerical problem in calculting $dy/dx_i$ .

While in the other case, where we take the logits and calculate the softmax and crossentropy at one shot (XentLogits function), we don’t have this problem. Because the derivative of XentLogits is $dL/dx_i = Softmax(x_i, x) - p_i$ (I think there is a typo in the slide, y is cost which is a scalar, it can not minus a vector p), a more elaborated derivation can be found here.

In Practice

We would still need to use Softmax function in the end, in order to calculate the cross-entropy loss, but not as the final layer in the neutral network, rather embed it into the loss function. Still using the previous example, in tensorflow you can do:

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10),
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])

We remove the Softmax layer from the model, but for SparseCategoricalCrossentropy function, we pass from_logits=True, and the Softmax will be calculated automagically before the cross-entropy is performed.

2020-12-25 MACHINE LEARNING

Status Quo Usage

Explaination

In Practice

Dialogue & Discussion