# 梯度下降

y = f ( θ , x ) y=f(\theta,x)

y = a x + b + ε y=ax+b+ε
y ^ = a x + b \hat{y}=ax+b

J ( θ ) J(\theta)

J ( a , b ) = 1 2 n ∑ i = 0 n ( y i − y ^ i ) 2 J(a,b)=\frac{1}{2n}\sum_{i=0}^{n}(y_i−\hat{y}_i )^2

θ = θ − α ∂ J ∂ θ \theta = \theta - \alpha \frac{\partial J}{\partial \theta}

a = a − α ∂ J ∂ a a = a - \alpha \frac{\partial J}{\partial a}
b = b − α ∂ J ∂ b b = b - \alpha \frac{\partial J}{\partial b}

∂ J ∂ a = 1 n ∑ i = 0 n x ( y ^ i − y i ) \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)

∂ J ∂ b = 1 n ∑ i = 0 n ( y ^ i − y i ) \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)

def model(a, b, x):
return a*x + b
def cost_function(a, b, x, y):
n = 5
return 0.5/n * (np.square(y-a*x-b)).sum()
def sgd(a,b,x,y):
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
a = a - alpha*da
b = b - alpha*db
return a, b


# Momentum

Momentum，他和梯度下降相比，增了考虑了之前梯度，即更新参数的时候，除了梯度之外，也加上了之前的梯度（Momentum）。

m = β m − α ∂ J ∂ θ m = \beta m - \alpha \frac{\partial J}{\partial \theta}

θ = θ + m \theta = \theta + m

m a = β m a − α ∂ J ∂ a m_a = \beta m_a - \alpha \frac{\partial J}{\partial a}

a = a + m a a = a + m_a

m b = β m b − α ∂ J ∂ b m_b = \beta m_b - \alpha \frac{\partial J}{\partial b}

b = b + m a b = b + m_a

∂ J ∂ a = 1 n ∑ i = 0 n x ( y ^ i − y i ) \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)

∂ J ∂ b = 1 n ∑ i = 0 n ( y ^ i − y i ) \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)

Python实现

def momentum(a, b, ma, mb, x, y):
n = 5
alpha = 1e-1
beta = 0.9
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta*ma - alpha*da
mb = beta*mb - alpha*db
a = a + ma
b = b + mb
return a, b, ma, mb


Nesterov Accelerated Gradient（NAG），又叫Nesterov momentum optimization，是Yurii Nesterov在1983年提出的。相对于Momentum，它有一点“前瞻”。它在计算损失函数的时候，就已经把Momentum放进了参数里了。这样它计算的就是更新后损失的一阶导数。
m = β m − α ∂ J ( θ + β m ) ∂ θ m = \beta m - \alpha \frac{\partial J(\theta+\beta m)}{\partial \theta}

θ = θ + m \theta = \theta + m

m a = β m a − α ∂ J ( a + β m a ) ∂ a m_a = \beta m_a - \alpha \frac{\partial J(a + \beta m_a)}{\partial a}

a = a + m a a = a + m_a

m b = β m b − α ∂ J ( b + β m b ) ∂ b m_b = \beta m_b - \alpha \frac{\partial J(b+\beta m_b)}{\partial b}

b = b + m a b = b + m_a

∂ J ∂ a = 1 n ∑ i = 0 n x ( y ^ i − y i ) \frac{\partial J}{\partial a} = \frac{1}{n}\sum_{i=0}^{n}x(\hat{y}_i-y_i)

∂ J ∂ b = 1 n ∑ i = 0 n ( y ^ i − y i ) \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=0}^{n}(\hat{y}_i-y_i)

y ^ = ( a + m a ) x + ( b + m b ) \hat{y}=(a+m_a)x+(b+m_b)

Python实现

def nesterov(a, b, ma, mb, x, y):
n = 5
alpha = 1e-1
beta = 0.9
y_hat = model(a+ma,b+mb,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta*ma - alpha*da
mb = beta*mb - alpha*db
a = a + ma
b = b + mb
return a, b, ma, mb


ϵ = 1 e − 10 \epsilon=1e-10

s = s + ∂ J ∂ θ ⊙ ∂ J ∂ θ s = s + \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta}

θ = θ − α ∂ J ∂ θ ⊘ s + ϵ \theta = \theta - \alpha \frac{\partial J}{\partial \theta} \oslash \sqrt{s+\epsilon}

Python代码：

def ada_grad(a,b,sa, sb, x,y):
epsilon=1e-10
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
sa=sa+da*da + epsilon
sb=sb+db*db + epsilon
a = a - alpha*da / np.sqrt(sa)
b = b - alpha*db / np.sqrt(sb)
return a, b, sa, sb


# RMSProp

ϵ = 1 e − 10 \epsilon=1e-10

s = β s + ( 1 − β ) ∂ J ∂ θ ⊙ ∂ J ∂ θ s = \beta s + (1-\beta) \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta}

θ = θ − α ∂ J ∂ θ ⊘ s + ϵ \theta = \theta - \alpha \frac{\partial J}{\partial \theta} \oslash \sqrt{s+\epsilon}

Python代码

def rmsprop(a,b,sa, sb, x,y):
epsilon=1e-10
beta = 0.9
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
sa=beta*sa+(1-beta)*da*da + epsilon
sb=beta*sb+(1-beta)*db*db + epsilon
a = a - alpha*da / np.sqrt(sa)
b = b - alpha*db / np.sqrt(sb)
return a, b, sa, sb


m = β 1 m − ( 1 − β 1 ) ∂ J ∂ θ m = \beta_1 m - (1-\beta_1)\frac{\partial J}{\partial \theta}

s = β 2 s + ( 1 − β 2 ) ∂ J ∂ θ ⊙ ∂ J ∂ θ s = \beta_2 s + (1-\beta_2) \frac{\partial J}{\partial \theta} \odot \frac{\partial J}{\partial \theta}

m ^ = m 1 − β 1 T \hat{m} = \frac{m}{1-\beta_1^T}

s ^ = s 1 − β 2 T \hat{s} = \frac{s}{1-\beta_2^T}

θ = θ + α m ^ ⊘ s ^ + ϵ \theta = \theta + \alpha \hat{m} \oslash \sqrt{\hat{s}+\epsilon}

Python实现

def adam(a, b, ma, mb, sa, sb, t, x, y):
epsilon=1e-10
beta1 = 0.9
beta2 = 0.9
n = 5
alpha = 1e-1
y_hat = model(a,b,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta1 * ma - (1-beta1)*da
mb = beta1 * mb - (1-beta1)*db
sa = beta2 * sa + (1-beta2)*da*da
sb = beta2 * sb + (1-beta2)*db*db
ma_hat = ma/(1-beta1**t)
mb_hat = mb/(1-beta1**t)
sa_hat=sa/(1-beta2**t)
sb_hat=sb/(1-beta2**t)
a = a + alpha*ma_hat / np.sqrt(sa_hat)
b = b + alpha*mb_hat / np.sqrt(sb_hat)
return a, b, ma, mb, sa, sb


m = β 1 m − ( 1 − β 1 ) ∂ J ( θ + β 1 m ) ∂ θ m = \beta_1 m - (1-\beta_1)\frac{\partial J(\theta+\beta_1 m)}{\partial \theta}

s = β 2 s + ( 1 − β 2 ) ∂ J ( θ + β 1 m ) ∂ θ ⊙ ∂ J ( θ + β 1 m ) ∂ θ s = \beta_2 s + (1-\beta_2) \frac{\partial J(\theta+\beta_1 m)}{\partial \theta} \odot \frac{\partial J(\theta+\beta_1 m)}{\partial \theta}

m ^ = m 1 − β 1 T \hat{m} = \frac{m}{1-\beta_1^T}

s ^ = s 1 − β 2 T \hat{s} = \frac{s}{1-\beta_2^T}

θ = θ + α m ^ ⊘ s ^ + ϵ \theta = \theta + \alpha \hat{m} \oslash \sqrt{\hat{s}+\epsilon}

def nadam(a, b, ma, mb, sa, sb, t, x, y):
epsilon=1e-10
beta1 = 0.9
beta2 = 0.9
n = 5
alpha = 1e-1
# we only modify here
# with a = a + ma
# and b = b + mb
y_hat = model(a+ma,b+mb,x)
da = (1.0/n) * ((y_hat-y)*x).sum()
db = (1.0/n) * ((y_hat-y).sum())
ma = beta1 * ma - (1-beta1)*da
mb = beta1 * mb - (1-beta1)*db
sa = beta2 * sa + (1-beta2)*da*da
sb = beta2 * sb + (1-beta2)*db*db
ma_hat = ma/(1-beta1**t)
mb_hat = mb/(1-beta1**t)
sa_hat=sa/(1-beta2**t)
sb_hat=sb/(1-beta2**t)
a = a + alpha*ma_hat / np.sqrt(sa_hat)
b = b + alpha*mb_hat / np.sqrt(sb_hat)
return a, b, ma, mb, sa, sb


SGD 100
Momentum 46
Nesterov 21
RMSProp 11

# 动手写优化函数，你也可以

\odot
\oslash


May the machine learn with you.

# Github代码地址

https://github.com/EricWebsmith/machine_learning_from_scrach

https://nbviewer.jupyter.org/github/EricWebsmith/machine_learning_from_scrach/blob/master/optimizers.ipynb

# python机器学习手写算法系列

python机器学习手写算法系列——线性回归

python机器学习手写算法系列——逻辑回归

python机器学习手写算法系列——优化函数

python机器学习手写算法系列——决策树

python机器学习手写算法系列——kmeans聚类

python机器学习手写算法系列——GBDT梯度提升分类

python机器学习手写算法系列——GBDT梯度提升回归

python机器学习手写算法系列——贝叶斯优化 Bayesian Optimization

python机器学习手写算法系列——PageRank算法

https://blog.csdn.net/juwikuang/article/details/108039680