[Advanced Learning Algorithms] 27.Advanced Optimization

My Course منتشر شده در تاریخ 1402/10/23

395 بار بازدید - 7 ماه پیش - Gradient descent is an optimization

Gradient descent is an optimization algorithm that is widely used in machine learning, and was the foundation of many algorithms like linear regression and logistic regression and early implementations of neural networks. But it turns out that there are now some other optimization algorithms for minimizing the cost function, that are even better than gradient descent. In this video, we'll take a look at an algorithm that can help you train your neural network much faster than gradient descent. Recall that this is the expression for one step of gradient descent. A parameter w_j is updated as w_j minus the learning rate Alpha times this partial derivative term. How can we make this work even better? In this example, I've plotted the cost function J using a contour plot comprising these ellipsis, and the minimum of this cost function is at the center of this ellipsis down here. Now, if you were to start gradient descent down here, one step of gradient descent, if Alpha is small, may take you a little bit in that direction. Then another step, then another step, then another step, then another step, and you notice that every single step of gradient descent is pretty much going in the same direction, and if you see this to be the case, you might wonder, well, why don't we make Alpha bigger, can we have an algorithm to automatically increase Alpha? They just make it take bigger steps and get to the minimum faster. There's an algorithm called the Adam algorithm that can do that. If it sees that the learning rate is too small, and we are just taking tiny little steps in a similar direction over and over, we should just make the learning rate Alpha bigger. In contrast, here again, is the same cost function if we were starting here and have a relatively big learning rate Alpha, then maybe one step of gradient descent takes us here, in the second step takes us here, third step, and the fourth step, and the fifth step, and the sixth step, and if you see gradient descent doing this, is oscillating back and forth. You'd be tempted to say, well, why don't we make the learning rates smaller? The Adam algorithm can also do that automatically, and with a smaller learning rate, you can then take a more smooth path toward the minimum of the cost function. Depending on how gradient descent is proceeding, sometimes you wish you had a bigger learning rate Alpha, and sometimes you wish you had a smaller learning rate Alpha. The Adam algorithm can adjust the learning rate automatically. Adam stands for Adaptive Moment Estimation, or A-D-A-M, and don't worry too much about what this name means, it's just what the authors had called this algorithm...

7 ماه پیش در تاریخ 1402/10/23 منتشر شده است.

395 بـار بازدید شده

... بیشتر