Previous | Next --- Slide 52 of 79

itoen

The idea of momentum is to maintain an update vector that is the amount the parameters will be updated at each time step. This update vector is changed by adding does this by adding a fraction of the update vector of the past time step to the current update vector, in addition to the current gradient. This makes it possible to have the gradients be updated asynchronously, as the past gradients from a calculation that was scheduled earlier still impacts the gradients we calculate at this time step when a later scheduled subgradient is added.

Momentum was originally designed to improve the stability of SGD, making the hill-climbing down a "ravines" faster. It's very interesting to see these algorithmic techniques become relevant in the context of parallel computing.

ishangaur

Like mentioned earlier, can these stale momenta be used like noise from SGD itself to provide a regularizing effect on training?

Please log in to leave a comment.