3 deep learning mysteries: Ensemble, knowledge- and self-distillation

3 deep learning mysteries: Ensemble, knowledge- and self-distillation

3 deep learning mysteries: Ensemble, knowledge- and self-distillation

“Under now-standard techniques, such as over-parameterization, batch-normalization, and adding residual links, “modern age” neural network training—at least for image classification tasks and many others—is usually quite stable. Using standard neural network architectures and training algorithms (typically SGD with momentum), the learned models perform consistently well, not only in terms of training accuracy but even in test accuracy, regardless of which random initialization or random data order is used during the training. For instance, if one trains the same WideResNet-28-10 architecture on the CIFAR-100 dataset 10 times with different random seeds, the mean test accuracy is 81.51% while the standard deviation is only 0.16%…”

Source: www.microsoft.com/en-us/research/blog/three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation/

Subscribe to our Digest