2022-03-28
- Lecture05 recap
- Adaptive learning rate
- Delta-bar-delta: if grad sign not changed, rate should increase
- AdaGrad, RMSprop
- Adam: Ada + Momentum
- What is the optimal optimizer: no unified answer
- Second order optimization: ......
- Overfitting
- Regularisation
- The goal is to reduce the size of hypothesis space
- L1 regularisation:
- $w^* = \text{argmin} \sum L(w_1, ..., w_N; x, y) + \lambda \sum |w_i|$
- $w_i = w_i - \lambda \eta \frac{w_i}{|w_i|} - \eta \nabla_w L$
- feature selection: if gradient is zero, w decreases to zero
- L2 regularisation
- $w^* = \text{argmin} \sum L(w_1, ..., w_N; x, y) + \lambda \sum |w_i|^2$
- $w_i = (1 - \lambda \eta) w_i - \eta \nabla_{w_i} L$
- Mean teacher: .......
- Dataset augmentation: 用人、模型的经验增加数据量
- Mixup: use interpolation to augment data
- AugMix: 先增广,后混合
- Adversarial training: FGSM: $x^{adv} = x + \epsilon \cdot \text{sgn}(\nabla L(f(x), y))$
- Early stopping: performance on test set stops before training set
- Some common practices
- Dropout: 随机选择某些参数不更新(实验证明学习更快)
- Weight initialization: dont set the same value for all parames; maintain the same variance for both input and output; Pretraining/fine-tuning, uses dataset to initialize params
- Data pre-processing: subtract data mean, divide it by diagonal covariance