• Lecture05 recap
  • Adaptive learning rate
    • Delta-bar-delta: if grad sign not changed, rate should increase
    • AdaGrad, RMSprop
    • Adam: Ada + Momentum
    • What is the optimal optimizer: no unified answer
    • Second order optimization: ......
  • Overfitting
  • Regularisation
    • The goal is to reduce the size of hypothesis space
    • L1 regularisation:
      • $w^* = \text{argmin} \sum L(w_1, ..., w_N; x, y) + \lambda \sum |w_i|$
      • $w_i = w_i - \lambda \eta \frac{w_i}{|w_i|} - \eta \nabla_w L$
      • feature selection: if gradient is zero, w decreases to zero
    • L2 regularisation
      • $w^* = \text{argmin} \sum L(w_1, ..., w_N; x, y) + \lambda \sum |w_i|^2$
      • $w_i = (1 - \lambda \eta) w_i - \eta \nabla_{w_i} L$
  • Mean teacher: .......
  • Dataset augmentation: 用人、模型的经验增加数据量
    • Mixup: use interpolation to augment data
    • AugMix: 先增广,后混合
    • Adversarial training: FGSM: $x^{adv} = x + \epsilon \cdot \text{sgn}(\nabla L(f(x), y))$
  • Early stopping: performance on test set stops before training set
  • Some common practices
    • Dropout: 随机选择某些参数不更新(实验证明学习更快)
    • Weight initialization: dont set the same value for all parames; maintain the same variance for both input and output; Pretraining/fine-tuning, uses dataset to initialize params
    • Data pre-processing: subtract data mean, divide it by diagonal covariance