风灵玉秀

2022-03-28

Lecture05 recap
Adaptive learning rate
- Delta-bar-delta: if grad sign not changed, rate should increase
- AdaGrad, RMSprop
- Adam: Ada + Momentum
- What is the optimal optimizer: no unified answer
- Second order optimization: ......
Overfitting
Regularisation
- The goal is to reduce the size of hypothesis space
- L1 regularisation:
  - $w^* = \text{argmin} \sum L(w_1, ..., w_N; x, y) + \lambda \sum |w_i|$
  - $w_i = w_i - \lambda \eta \frac{w_i}{|w_i|} - \eta \nabla_w L$
  - feature selection: if gradient is zero, w decreases to zero
- L2 regularisation
  - $w^* = \text{argmin} \sum L(w_1, ..., w_N; x, y) + \lambda \sum |w_i|^2$
  - $w_i = (1 - \lambda \eta) w_i - \eta \nabla_{w_i} L$
Mean teacher: .......
Dataset augmentation: 用人、模型的经验增加数据量
- Mixup: use interpolation to augment data
- AugMix: 先增广，后混合
- Adversarial training: FGSM: $x^{adv} = x + \epsilon \cdot \text{sgn}(\nabla L(f(x), y))$
Early stopping: performance on test set stops before training set
Some common practices
- Dropout: 随机选择某些参数不更新（实验证明学习更快）
- Weight initialization: dont set the same value for all parames; maintain the same variance for both input and output; Pretraining/fine-tuning, uses dataset to initialize params
- Data pre-processing: subtract data mean, divide it by diagonal covariance

首页