## A series of handwriting algorithms for Python machine learning optimizers

There are several evidences 2020-11-13 12:40:55
series handwriting algorithms python machine

Let's first review the gradient descent , Take the first article in this series 《python Machine learning handwriting algorithm series —— Linear regression 》 For example .

Objective function ：

y = f ( θ , x ) y=f(\theta,x)

The objective function of linear regression is ：

y = a x + b + ε y=ax+b+ε
y ^ = a x + b \hat{y}=ax+b

The loss function is ：

J ( θ ) J(\theta)

The loss function of linear regression is ：

J ( a , b ) = 1 2 n ∑ i = 0 n ( y i − y ^ i ) 2 J(a,b)=\frac{1}{2n}\sum_{i=0}^{n}(y_i−\hat{y}_i )^2

Optimization function ：

θ = θ − α ∂ J ∂ θ \theta = \theta - \alpha \frac{\partial J}{\partial \theta}

For the optimization function of linear regression of one variable ：

a = a − α ∂ J ∂ a a = a - \alpha \frac{\partial J}{\partial a}
b = b − α ∂ J ∂ b b = b - \alpha \frac{\partial J}{\partial b}

here ∂ J ∂ a \frac{\partial J}{\partial a} and ∂ J ∂ b \frac{\partial J}{\partial b} Namely ：

∂ J ∂ a = 1 n ∑ i = 0 n x ( y ^ i − y i ) \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)

∂ J ∂ b = 1 n ∑ i = 0 n ( y ^ i − y i ) \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)

here y ^ \hat{y} yes y The estimate of , θ \theta Is the parameter , J J Is the loss function , α \alpha It's the learning rate , The same below .

Here we use python The code for the implementation is as follows

def model(a, b, x):
return a*x + b
def cost_function(a, b, x, y):
n = 5
return 0.5/n * (np.square(y-a*x-b)).sum()
def sgd(a,b,x,y):
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
a = a - alpha*da
b = b - alpha*db
return a, b


We optimize 100 The results are as follows ：

The loss is 0.00035532090622957674, Let's write it down here , after , We will use other optimization functions to achieve this value respectively . See how many calculations they need .

# Momentum

m = β m − α ∂ J ∂ θ m = \beta m - \alpha \frac{\partial J}{\partial \theta}

θ = θ + m \theta = \theta + m

For linear regression with one variable ：

m a = β m a − α ∂ J ∂ a m_a = \beta m_a - \alpha \frac{\partial J}{\partial a}

a = a + m a a = a + m_a

m b = β m b − α ∂ J ∂ b m_b = \beta m_b - \alpha \frac{\partial J}{\partial b}

b = b + m a b = b + m_a

among ：

∂ J ∂ a = 1 n ∑ i = 0 n x ( y ^ i − y i ) \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)

∂ J ∂ b = 1 n ∑ i = 0 n ( y ^ i − y i ) \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)

Python Realization

def momentum(a, b, ma, mb, x, y):
n = 5
alpha = 1e-1
beta = 0.9
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta*ma - alpha*da
mb = beta*mb - alpha*db
a = a + ma
b = b + mb
return a, b, ma, mb


In order to achieve the same loss ,Momentum It was used 46 Time . It's less than half of the gradient .

Nesterov Accelerated Gradient（NAG）, Also called Nesterov momentum optimization, yes Yurii Nesterov stay 1983 Put forward in . be relative to Momentum, It's a little bit “ prospect ”. When it's calculating the loss function , Have already put Momentum It's in the parameters . So it calculates the first derivative of the update loss .
m = β m − α ∂ J ( θ + β m ) ∂ θ m = \beta m - \alpha \frac{\partial J(\theta+\beta m)}{\partial \theta}

θ = θ + m \theta = \theta + m

For linear regression with one variable ：

m a = β m a − α ∂ J ( a + β m a ) ∂ a m_a = \beta m_a - \alpha \frac{\partial J(a + \beta m_a)}{\partial a}

a = a + m a a = a + m_a

m b = β m b − α ∂ J ( b + β m b ) ∂ b m_b = \beta m_b - \alpha \frac{\partial J(b+\beta m_b)}{\partial b}

b = b + m a b = b + m_a

among ,

∂ J ∂ a = 1 n ∑ i = 0 n x ( y ^ i − y i ) \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)

∂ J ∂ b = 1 n ∑ i = 0 n ( y ^ i − y i ) \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)

here , We just need to make

y ^ = ( a + m a ) x + ( b + m b ) \hat{y}=(a+m_a)x+(b+m_b)

that will do . Other places and Momentum equally .

Python Realization

def nesterov(a, b, ma, mb, x, y):
n = 5
alpha = 1e-1
beta = 0.9
y_hat = model(a+ma,b+mb,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta*ma - alpha*da
mb = beta*mb - alpha*db
a = a + ma
b = b + mb
return a, b, ma, mb


here , We use the 21 Times achieved the same loss .

In the following illustration , The blue one is gradient descent , It's going fast in the direction of the maximum gradient , Instead of moving towards the end of the picture . Yellow is AdaGrad, It points to the global optimum . The way it works is to shrink （scaling down） The maximum gradient parameter .

The formula ：

ϵ = 1 e − 10 \epsilon=1e-10

s = s + ∂ J ∂ θ ⊙ ∂ J ∂ θ s = s + \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta}

θ = θ − α ∂ J ∂ θ ⊘ s + ϵ \theta = \theta - \alpha \frac{\partial J}{\partial \theta} \oslash \sqrt{s+\epsilon}

Again , We use a,b Replace θ \theta that will do .

Python Code ：

def ada_grad(a,b,sa, sb, x,y):
epsilon=1e-10
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
sa=sa+da*da + epsilon
sb=sb+db*db + epsilon
a = a - alpha*da / np.sqrt(sa)
b = b - alpha*db / np.sqrt(sb)
return a, b, sa, sb


here , We use the 114 Time , It's more than a gradient drop 14 Time .

# RMSProp

AdaGrad What's the risk , It's slowing down too fast . So it took longer to reach the global optimum . It may even never be globally optimal .PMSProp This deceleration is reduced . The method is to multiply by a parameter β \beta , Usually its value is 0.9.

ϵ = 1 e − 10 \epsilon=1e-10

s = β s + ( 1 − β ) ∂ J ∂ θ ⊙ ∂ J ∂ θ s = \beta s + (1-\beta) \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta}

θ = θ − α ∂ J ∂ θ ⊘ s + ϵ \theta = \theta - \alpha \frac{\partial J}{\partial \theta} \oslash \sqrt{s+\epsilon}

Python Code

def rmsprop(a,b,sa, sb, x,y):
epsilon=1e-10
beta = 0.9
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
sa=beta*sa+(1-beta)*da*da + epsilon
sb=beta*sb+(1-beta)*db*db + epsilon
a = a - alpha*da / np.sqrt(sa)
b = b - alpha*db / np.sqrt(sb)
return a, b, sa, sb


here , Only by the 11 Time , It's a gradient drop, the same loss .

Let's finish with Adam,Adam Namely adaptive moment estimation. He is momentum and RMSProp The combination of .

m = β 1 m − ( 1 − β 1 ) ∂ J ∂ θ m = \beta_1 m - (1-\beta_1)\frac{\partial J}{\partial \theta}

s = β 2 s + ( 1 − β 2 ) ∂ J ∂ θ ⊙ ∂ J ∂ θ s = \beta_2 s + (1-\beta_2) \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta}

m ^ = m 1 − β 1 T \hat{m} = \frac{m}{1-\beta_1^T}

s ^ = s 1 − β 2 T \hat{s} = \frac{s}{1-\beta_2^T}

θ = θ + α m ^ ⊘ s ^ + ϵ \theta = \theta + \alpha \hat{m} \oslash \sqrt{\hat{s}+\epsilon}

there ⊙ \odot and ⊘ \oslash It's dot multiplication and dot division , Or call it element-wise product and element-wise division.

Python Realization

def adam(a, b, ma, mb, sa, sb, t, x, y):
epsilon=1e-10
beta1 = 0.9
beta2 = 0.9
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta1 * ma - (1-beta1)*da
mb = beta1 * mb - (1-beta1)*db
sa = beta2 * sa + (1-beta2)*da*da
sb = beta2 * sb + (1-beta2)*db*db
ma_hat = ma/(1-beta1**t)
mb_hat = mb/(1-beta1**t)
sa_hat=sa/(1-beta2**t)
sb_hat=sb/(1-beta2**t)
a = a + alpha*ma_hat / np.sqrt(sa_hat)
b = b + alpha*mb_hat / np.sqrt(sb_hat)
return a, b, ma, mb, sa, sb


Adam It was used 25 Time .

Nadam Namely Nesterov + Adam.nadam and adam comparison , Only the difference is used in calculating the loss θ + m \theta+m To replace the θ \theta

m = β 1 m − ( 1 − β 1 ) ∂ J ( θ + β 1 m ) ∂ θ m = \beta_1 m - (1-\beta_1)\frac{\partial J(\theta+\beta_1 m)}{\partial \theta}

s = β 2 s + ( 1 − β 2 ) ∂ J ( θ + β 1 m ) ∂ θ ⊙ ∂ J ( θ + β 1 m ) ∂ θ s = \beta_2 s + (1-\beta_2) \frac{\partial J(\theta+\beta_1 m)}{\partial \theta} \odot \frac{\partial J(\theta+\beta_1 m)}{\partial \theta}

m ^ = m 1 − β 1 T \hat{m} = \frac{m}{1-\beta_1^T}

s ^ = s 1 − β 2 T \hat{s} = \frac{s}{1-\beta_2^T}

θ = θ + α m ^ ⊘ s ^ + ϵ \theta = \theta + \alpha \hat{m} \oslash \sqrt{\hat{s}+\epsilon}

def nadam(a, b, ma, mb, sa, sb, t, x, y):
epsilon=1e-10
beta1 = 0.9
beta2 = 0.9
n = 5
alpha = 1e-1
# we only modify here
# with a = a + ma
# and b = b + mb
y_hat = model(a+ma,b+mb,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta1 * ma - (1-beta1)*da
mb = beta1 * mb - (1-beta1)*db
sa = beta2 * sa + (1-beta2)*da*da
sb = beta2 * sb + (1-beta2)*db*db
ma_hat = ma/(1-beta1**t)
mb_hat = mb/(1-beta1**t)
sa_hat=sa/(1-beta2**t)
sb_hat=sb/(1-beta2**t)
a = a + alpha*ma_hat / np.sqrt(sa_hat)
b = b + alpha*mb_hat / np.sqrt(sb_hat)
return a, b, ma, mb, sa, sb


# summary

For our univariate linear regression . To achieve the same effect （ Loss ）, All iterations of various optimizers ：

optimization algorithm The number of iterations impulse Nesterov adaptive Self regulation remarks
SGD 100
Momentum 46
Nesterov 21
RMSProp 11

# Reference material

When writing this article , It's time to see 《hands-on machine learning (2nd Edition)》, So I took this book as a reference , The screenshots in the article are all from this book .

I've never read the watermelon book you often say , And there's no plan to see it . Machine learning should start with English materials , Not in Chinese . It's hard to see English in the early stage , It's easy later . It's easy to watch early Chinese , It's hard at the back .

# Write optimization functions , You can also

These optimization functions , When it's done , Just apply the formula in the book . I spent a total of three hours , All the functions are implemented . among , Function is wrong. Waste an hour . Look for dot multiplication and dot division latex perhaps markdown It's a waste of half an hour .

Dot multiplication and dot division markdown / latex：

\odot
\oslash


I think the optimization function is the simplest among all the handwriting algorithms I have done . There's basically no need to derive formulas , Just use it directly .

You can try to write it yourself , I can't write it out , You can refer to my code .

I wish you all the best in machine learning .

May the machine learn with you.

https://github.com/EricWebsmith/machine_learning_from_scrach

Mainland users render github jupyter notebook Please use jupyter notebook viewer：
https://nbviewer.jupyter.org/github/EricWebsmith/machine_learning_from_scrach/blob/master/optimizers.ipynb

# python Machine learning handwriting algorithm series

python Machine learning handwriting algorithm series —— Linear regression

python Machine learning handwriting algorithm series —— Logical regression

python Machine learning handwriting algorithm series —— Optimization function

python Machine learning handwriting algorithm series —— Decision tree

python Machine learning handwriting algorithm series ——kmeans clustering

python Machine learning handwriting algorithm series ——GBDT Gradient lifting classification

python Machine learning handwriting algorithm series ——GBDT Gradient ascension regression

python Machine learning handwriting algorithm series —— Bayesian optimization Bayesian Optimization

python Machine learning handwriting algorithm series ——PageRank Algorithm