Lecture 05

2022-03-14

Gradient decent($w^{(t+1)} = w^{(t)} - \eta_t\nabla_{w^{(t)}}L$)优化重点：
- 假设所有数据(X, y)已知，但实际数据并不完整
- 如何初始化$W$？
- 如何设置learning rate
  - 对不同的数据(X, y)可以设置不同的学习率
  - 对不同的$w$可以设置不同的学习率
Batch gradient decent: $\nabla_{w^{(t)}}L = \frac1m\sum\limits_{i=1}^{m}\nabla_{w^{(t)}}L(w; x_i; y_i)$
- optimization: Hessian acceleration can be used (not pratical, too expensive in terms of computation)
- disavantiage: cannot deal with huge training set (too slow)
Stoachasitc gradient decent
- Firstly, shuffle training set; Secondly, use $\nabla_{w^{(t)}}L = \nabla_{w^{(t)}}L(w; x_i; y_i)$ to update $w$ for every $i$.
- Advantages: (1) fast (2) randomness prevents overfitting (3) able to change as data changes
- Disavantage: (1) It is approximation of approximation (2) difficult parallelism
Mini-batch gradient decent: A small sample of training set at a time
Optimization challeges: (1) ill-conditioning (2) local minima (3) saddle points
SGD with momentum, AdaGrad, ......

首页