Lecture 05

  • Gradient decent($w^{(t+1)} = w^{(t)} - \eta_t\nabla_{w^{(t)}}L$)优化重点:
    • 假设所有数据(X, y)已知,但实际数据并不完整
    • 如何初始化$W$?
    • 如何设置learning rate
      • 对不同的数据(X, y)可以设置不同的学习率
      • 对不同的$w$可以设置不同的学习率
  • Batch gradient decent: $\nabla_{w^{(t)}}L = \frac1m\sum\limits_{i=1}^{m}\nabla_{w^{(t)}}L(w; x_i; y_i)$
    • optimization: Hessian acceleration can be used (not pratical, too expensive in terms of computation)
    • disavantiage: cannot deal with huge training set (too slow)
  • Stoachasitc gradient decent
    • Firstly, shuffle training set; Secondly, use $\nabla_{w^{(t)}}L = \nabla_{w^{(t)}}L(w; x_i; y_i)$ to update $w$ for every $i$.
    • Advantages: (1) fast (2) randomness prevents overfitting (3) able to change as data changes
    • Disavantage: (1) It is approximation of approximation (2) difficult parallelism
  • Mini-batch gradient decent: A small sample of training set at a time
  • Optimization challeges: (1) ill-conditioning (2) local minima (3) saddle points
  • SGD with momentum, AdaGrad, ......