Multiple Generation Based Knowledge Distillation: A Roadmap
Shangmin Guo, University of Edinburgh, s.guo@ed.ac.uk
Knowledge distillation (KD) was originally proposed to compress the large models. Although the modern deep neural networks obtained very good generalisation performance on various of “artificial intelligence” tasks, their size is very large thus the inference speed is quite slow. Thus, researchers proposed to distil knowledge from a large model (“teacher”) trained on a specific task to a smaller model (“student”). Since the student model is much smaller than the teacher model, the inference cost is drastically reduced. Meanwhile, the student model could obtain a very similar performance to the teacher. This sounds quite good, and people achieved their purpose. So, why should we care about the multi-generation based KD?
Well, it has been by some works that distilling knowledge can not only help to compress the size of models, but improves the generalisation performance of students with generation-based methods. Or, just to say, we can train students to outperform their teachers. This makes quite a lot sense, right? At least in human society, we accumulate knowledges over generations and (more efficiently) deliver them to next generation, thus every generation could know more than their parents.
However, we may ask the following questions:
There is no answer yet. But, please do not be disappointed, we may could solve the above questions in the future. It is quite exciting to discove the answer of these problems.
So, let’s see what progress we have made yet,
Citation:
@article{shangmin2021,
title = "Multiple Generation Based Knowledge Distillation for Neural Networks: A Roadmap",
author = "Shangmin Guo",
year = "2021",
url = "https://github.com/Shawn-Guo-CN/Multiple-Generation-Based-Knowledge-Distillation"
}
To better understand the paper listed in the following sections, we may need some basic understanding of machine learning and probabilistic models:
In this section, we would list the works that proposed or tried to understand KD. This is the core works that study KD for neural networks.
It has been shown that part of the KD derivative is equivalent to the effect brought by label smoothing. So, understanding label smoothing could help us to understand more about multi-generation based KD.
Pesudo labeling is a kind of semi-supervised method. It would generate “pesudo” labels for massive unlabelled data, thus generate a dataset whose size is much larger than the original one. During this procedure, the pesudo labels are also generated by a teacher trained on the original supervised dataset.
Since the calculation of probability of all possible sequences in intractable, we cannot directly apply the above methods to sequence-level KD. In this section, we would first on the application in NLP. We could also treat the trajectories in reinforcement learning (RL) as sequences, but we will discuss them later since RL has different factorisation assumption.
Since the key element of RL is the distribution of trajectories and their corresponding returns, the standard KD methods can not be directly applied in RL as well. Besides, the fact that RL follows the Markov property also introduces new challenges into the KD for RL.