Quick review — Model Compression
The need for a small and efficient model has always been there for many reasons such as resource and adaptation for experimental changes. The inference time is even more important since that is what the end user is experiencing. So the need for smaller and faster models is crucial.
A few methods were discussed such as quantization, pruning and knowledge distillation and low rank factorization. For a brief understanding on how they work we can list the following:
1- purining and quantization:
reduce the redundant parameters which are not sensitive and not affect the performance
2- distillation is based on training compact neural networks with distilled knowledge in teacher-student scheme.
Quantizationis based on reducing the number of bits that represent each weight, so simply put you can convert the weight from float to integer, this can be done after training or during the training where the training becomes aware of quantization.
Pruning implying that you remove some of the unnecessary parts of the models, improve generalization and help with the overfitting too. You can either prune the weight by setting them to 0, the smaller weight has the less effect so it makes sense to choose them for pruning.
Knowledge distillation aims to create a student model to follow the probabilities distribution and yet the behavior of the teacher model. This could be implemented during the pre-training, fine-tunning or both. DistilBERT and MobilBERT are two popular examples of model distillation.