Context-sensitive visualization of deep learning NLP models
image source: https://medium.com/
“The latest deep learning models are extremely complex, and can contain hundreds of millions of different trainable
parameters. This high level of complexity makes it very difficult for humans to understand their inner workings.
Visualization methods give us a window into these black box models, and can be beneficial for a number of reasons.
For example, visualizations may provide a mechanism for debugging complex models by showing which parts of the
input data the model is focusing on. This can be especially useful in classification models, because it may show that a
model is focusing on parts of the data that are irrelevant to the predicted class.
Saliency maps are a popular tool for gaining insight into deep learning. In this case, saliency maps are typically depicted
as heatmaps of neural layers, where “hotness” corresponds to regions that have a big impact on the model’s final
decision. Grad-CAM is a state-of-the art technique for saliency map visualization  using the gradient information
obtained from backpropagating the error signal from the loss function with respect to a specific feature map, at any
layer of the network.
While the idea of using visualization methods on image processing models makes intuitive sense, in recent years there
has been a growing interest in visualizing NLP models. However, with NLP models visualization is not as straight
forward due to working with text. Current solutions to visualizing NLP models involve heatmaps similar to Grad-CAM,
and visualizing connections between tokens in models that utilize attention mechanisms. These solutions typically look
at the importance of individual tokens on the model output. Before being passed to an NLP model, the text must be
broken up into a series of tokens that are then transformed into numerical vectors. Each token typically represents
a single word or character in the input text. The goal of these NLP visualization methods is to highlight the most
significant parts of the text that have the greatest impact on the model output. While effective, current methods look at
each separate token independently, thus lacking context from the entire text. We believe the combination and order in
which the words appear is significant and that they influence an NLP model’s interpretation of the text…”