Lecture 05
2022-03-14
- Gradient decent($w^{(t+1)} = w^{(t)} - \eta_t\nabla_{w^{(t)}}L$)优化重点:
- 假设所有数据(X, y)已知,但实际数据并不完整
- 如何初始化$W$?
- 如何设置learning rate
- 对不同的数据(X, y)可以设置不同的学习率
- 对不同的$w$可以设置不同的学习率
- Batch gradient decent: $\nabla_{w^{(t)}}L = \frac1m\sum\limits_{i=1}^{m}\nabla_{w^{(t)}}L(w; x_i; y_i)$
- optimization: Hessian acceleration can be used (not pratical, too expensive in terms of computation)
- disavantiage: cannot deal with huge training set (too slow)
- Stoachasitc gradient decent
- Firstly, shuffle training set; Secondly, use $\nabla_{w^{(t)}}L = \nabla_{w^{(t)}}L(w; x_i; y_i)$ to update $w$ for every $i$.
- Advantages: (1) fast (2) randomness prevents overfitting (3) able to change as data changes
- Disavantage: (1) It is approximation of approximation (2) difficult parallelism
- Mini-batch gradient decent: A small sample of training set at a time
- Optimization challeges: (1) ill-conditioning (2) local minima (3) saddle points
- SGD with momentum, AdaGrad, ......