A series of handwriting algorithms based on Python machine learning -- gbdt gradient lifting classification

There are several evidences 2020-11-13 12:41:19
series handwriting algorithms based python

Gradient rise （Gradient Boosting） Train a series of weak learners （learners）, Each learner has pseudo residuals for the previous learner （ instead of y）, In order to improve the performance of the algorithm （performance）.

This is how Wikipedia describes gradient ascension

Gradient rise （ Gradient enhancement ） A classification problem for machine learning , The prediction model produced by this method is the integration of weak prediction models , For example, a typical decision tree is used As a weak prediction model , At this point, it's a gradient tree （GBT or GBDT）. Like any other method of ascension , It builds the model in phases , But it is an extension of the general lifting method by allowing optimization of any differentiable loss function .

Necessary knowledge

1. Logical regression
2. Linear regression
4. Decision tree


1. How gradient promotion is applied to classification
2. Handwritten gradient classification from zero


Algorithm

The following figure shows the gradient lifting classification algorithm well

( The picture is from Youtube channel StatQuest)

The first part of the picture above is a stump , Its value is log of odds of y, We remember it as l l . And then there are some trees . The trees in the back , When training them , Their goal is not y, It is y Residual of .

remnant Bad = really real value − pre measuring value residual = True value - Predictive value

This picture here , It's a little more complicated than gradient lift regression . here , The green part is the residual , The red one is called γ \gamma , The black one is the learning rate . The residuals here are not simply averaged γ \gamma . γ \gamma Used to update l l .

technological process

Step 1: Calculation log of odds l 0 l_0 . Or this is y The first prediction of . here n 1 n_1 yes y=1 The number of , n 0 n_0 yes y=0 The number of .

l 0 ( x ) = log ⁡ n 1 n 0 l_0(x)=\log \frac{n_1}{n_0}

For each x i x_i , The probability is :
p 0 i = e l 0 i 1 + e l 0 i p_{0i}=\frac{e^{l_{0i}}}{1+e^{l_{0i}}}

The forecast is :

f 0 i = { 0 p 0 i < 0.5 1 p 0 i > = 0.5 f_{0i}=\begin{cases} 0 & p_{0i}<0.5 \\ 1 & p_{0i}>=0.5 \end{cases}

Step 2 for m in 1 to M:

• Step 2.1: Calculate the so-called pseudo residuals ：

r i m = f i − p i r_{im}=f_i-p_i

• Step 2.2: Fitting a regression tree with pseudo residuals t m ( x ) t_m(x) , And identify the final leaf node R j m R_{jm} for j = 1... J m j=1...Jm

• Step 2.3: Calculate the... Of each leaf node γ \gamma

γ i m = ∑ r i m ∑ ( 1 − r i m − 1 ) ( r i m − 1 ) \gamma_{im}=\frac{\sum r_{im}}{\sum (1-r_{im-1})(r_{im-1})}

• Step 2.4: to update l l , p p , f f . α \alpha It's the learning rate :
l m ( x ) = l m − 1 + α γ m l_m(x)=l_{m-1}+\alpha \gamma_m

p m i = e l m i 1 + e l m i p_{mi}=\frac{e^{l_{mi}}}{1+e^{l_{mi}}}

f m i = { 0 p m i < 0.5 1 p m i > = 0.5 f_{mi}=\begin{cases} 0 & p_{mi}<0.5 \\ 1 & p_{mi}>=0.5 \end{cases}
Step 3. Output f M ( x ) f_M(x)

The above knowledge of simplifying the process , For handwritten gradient lifting classification algorithm , That's enough . If there is spare force , Gradient and I can rise together （GB） The gradient lifting classification is deduced （GBC）

First let's look at GB Steps for

Input : Training data { ( x i , y i ) } i = 1 n \{(x_i, y_i)\}_{i=1}^{n} , A differentiable loss function L ( y , F ( x ) ) L(y, F(x)) , cycles M.

Algorithm :

Step 1: With a constant F 0 ( x ) F_0(x) Start the algorithm , This constant satisfies the following formula ：

F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_0(x)=\underset{\gamma}{\operatorname{argmin}}\sum_{i=1}^{n}L(y_i, \gamma)

Step 2: for m in 1 to M:

• Step 2.1: Calculate the pseudo residuals （pseudo-residuals）:

r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{im}=-[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x)=F_{m-1}(x)}

• Step 2.2: Fitting weak learners with pseudo residuals h m ( x ) h_m(x) , Set up a destination area R j m ( j = 1... J m ) R_{jm}(j=1...J_m)

• Step 2.3: For each end area （ That is, every leaf ）, Calculation γ \gamma

γ j m = argmin ⁡ γ ∑ x i ∈ R j m n L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n}L(y_i, F_{m-1}(x_i)+\gamma)

• Step 2.4: Update algorithm （ Learning rate α \alpha ） :
F m ( x ) = F m − 1 + α γ m F_m(x)=F_{m-1}+\alpha\gamma_m

Step 3. Output algorithm F M ( x ) F_M(x)

Loss function

From gradient lifting deduction to gradient lifting classification , We need a loss function , And bring in Step 1, Step 2.1 and Step 2.3. here , We use it Log of Likelihood As a loss function .

L ( y , F ( x ) ) = − ∑ i = 1 N ( y i ∗ l o g ( p ) + ( 1 − y i ) ∗ l o g ( 1 − p ) ) L(y, F(x))=-\sum_{i=1}^{N}(y_i* log(p) + (1-y_i)*log(1-p))

This is about probability p Function of , It's not about log of odds （l） Function of , So we need to deform it .

Let's take out the middle part
− ( y ∗ log ⁡ ( p ) + ( 1 − y ) ∗ log ⁡ ( 1 − p ) ) = − y ∗ log ⁡ ( p ) − ( 1 − y ) ∗ log ⁡ ( 1 − p ) = − y log ⁡ ( p ) − log ⁡ ( 1 − p ) + y log ⁡ ( 1 − p ) = − y ( log ⁡ ( p ) − log ⁡ ( 1 − p ) ) − log ⁡ ( 1 − p ) = − y ( log ⁡ ( p 1 − p ) ) − log ⁡ ( 1 − p ) = − y log ⁡ ( o d d s ) − log ⁡ ( 1 − p ) -(y*\log(p)+(1-y)*\log(1-p)) \\ =-y * \log(p) - (1-y) * \log(1-p) \\ =-y\log(p)-\log(1-p)+y\log(1-p) \\ =-y(\log(p)-\log(1-p))-\log(1-p) \\ =-y(\log(\frac{p}{1-p}))-\log(1-p) \\ =-y \log(odds)-\log(1-p)
because
log ⁡ ( 1 − p ) = l o g ( 1 − e l o g ( o d d s ) 1 + e l o g ( o d d s ) ) = log ⁡ ( 1 + e l 1 + e l − e l 1 + e l ) = log ⁡ ( 1 1 + e l ) = log ⁡ ( 1 ) + log ⁡ ( 1 + e l ) = − l o g ( 1 + e log ⁡ ( o d d s ) ) \log(1-p)=log(1-\frac{e^{log(odds)}}{1+e^{log(odds)}}) \\ =\log(\frac{1+e^l}{1+e^l}-\frac{e^l}{1+e^l})\\ =\log(\frac{1}{1+e^l}) \\ =\log(1)+\log(1+e^l) \\ =-log(1+e^{\log(odds)})

We brought in from above , obtain

− ( y ∗ log ⁡ ( p ) + ( 1 − y ) ∗ log ⁡ ( 1 − p ) ) = − y log ⁡ ( o d d s ) + log ⁡ ( 1 + e log ⁡ ( o d d s ) ) -(y*\log(p)+(1-y)*\log(1-p)) \\ =-y\log(odds)+\log(1+e^{\log(odds)}) \\

Last , We got the use of l l The loss function represented by

L = − ∑ i = 1 N ( y l − log ⁡ ( 1 + e l ) ) L=-\sum_{i=1}^{N}(yl-\log(1+e^l))

Step 1:

In order to find the minimum of the loss function , We just need to find that its first derivative is equal to 0.
∂ L ( y , F 0 ) ∂ F 0 = − ∂ ∑ i = 1 N ( y log ⁡ ( o d d s ) − log ⁡ ( 1 + e log ⁡ ( o d d s ) ) ) ∂ l o g ( o d d s ) = − ∑ i = 1 n y i + ∑ i = 1 N ∂ l o g ( 1 + e l o g ( o d d s ) ) ∂ l o g ( o d d s ) = − ∑ i = 1 n y i + ∑ i = 1 N 1 1 + e log ⁡ ( o d d s ) ∂ ( 1 + e l ) ∂ l = − ∑ i = 1 n y i + ∑ i = 1 N 1 1 + e log ⁡ ( o d d s ) ∂ ( e l ) ∂ l = − ∑ i = 1 n y i + ∑ i = 1 N e l 1 + e l = − ∑ i = 1 n y i + N e l 1 + e l = 0 \frac{\partial L(y, F_0)}{\partial F_0} \\ =-\frac{\partial \sum_{i=1}^{N}(y\log(odds)-\log(1+e^{\log(odds)}))}{\partial log(odds)} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{\partial log(1+e^{log(odds)})}{\partial log(odds)} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{1}{1+e^{\log(odds)}} \frac{\partial (1+e^l)}{\partial l} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{1}{1+e^{\log(odds)}} \frac{\partial (e^l)}{\partial l} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{e^l}{1+e^l} \\ =-\sum_{i=1}^{n} y_i+N\frac{e^l}{1+e^l} =0
We get ( p It's the real probability )
e l 1 + e l = ∑ i = 1 N y i N = p e l = p + p ∗ e l ( 1 − p ) e l = p e l = p 1 − p log ⁡ ( o d d s ) = l o g ( p 1 − p ) \frac{e^l}{1+e^l}=\frac{\sum_{i=1}^{N}y_i}{N}=p \\ e^l=p+p*e^l \\ (1-p)e^l=p \\ e^l=\frac{p}{1-p} \\ \log(odds)=log(\frac{p}{1-p})

here , Let's just say l l .

Step 2.1

r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{im}=-[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x)=F_{m-1}(x)}

= − [ ∂ ( − ( y i ∗ l o g ( p ) + ( 1 − y i ) ∗ l o g ( 1 − p ) ) ) ∂ F m − 1 ( x i ) ] F ( x ) = F m − 1 ( x ) =-[\frac{\partial (-(y_i* log(p)+(1-y_i)*log(1-p)))}{\partial F_{m-1}(x_i)}]_{F(x)=F_{m-1}(x)}

Similarly , You can get

= y i − F m − 1 ( x i ) =y_i-F_{m-1}(x_i)

Step 2.3:

γ j m = argmin ⁡ γ ∑ x i ∈ R j m n L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n}L(y_i, F_{m-1}(x_i)+\gamma)

Bring in the loss function ：

γ j m = argmin ⁡ γ ∑ x i ∈ R j m n L ( y i , F m − 1 ( x i ) + γ ) = argmin ⁡ γ ∑ x i ∈ R j m n ( − y i ∗ ( F m − 1 + γ ) + log ⁡ ( 1 + e F m − 1 + γ ) ) \gamma_{jm} \\ =\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n}L(y_i, F_{m-1}(x_i)+\gamma) \\ =\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n} (-y_i * (F_{m-1}+\gamma)+\log(1+e^{F_{m-1}+\gamma})) \\

Let's solve the middle part .

− y i ∗ ( F m − 1 + γ ) + log ⁡ ( 1 + e F m − 1 + γ ) -y_i * (F_{m-1}+\gamma)+\log(1+e^{F_{m-1}+\gamma})

We use the second-order Taylor polynomials expansion ：

L ( y , F + γ ) ≈ L ( y , F ) + d L ( y , F + γ ) γ d F + 1 2 d 2 L ( y , F + γ ) γ 2 d 2 F L(y,F+\gamma) \approx L(y, F)+ \frac{d L(y, F+\gamma)\gamma}{d F}+\frac{1}{2} \frac{d^2 L(y, F+\gamma)\gamma^2}{d^2 F}

Derivation
∵ d L ( y , F + γ ) d γ ≈ d L ( y , F ) d F + d 2 L ( y , F ) γ d 2 F = 0 ∴ d L ( y , F ) d F + d 2 L ( y , F ) γ d 2 F = 0 ∴ γ = − d L ( y , F ) d F d 2 L ( y , F ) d 2 F ∴ γ = y − p d 2 ( − y ∗ l + log ⁡ ( 1 + e l ) ) d 2 l ∴ γ = y − p d ( − y + e l 1 + e l ) d l ∴ γ = y − p d e l 1 + e l d l \because \frac{d L(y, F+\gamma)}{d\gamma} \approx \frac{d L(y, F)}{d F}+\frac{d^2 L(y, F)\gamma}{d^2 F}=0 \\ \therefore \frac{d L(y, F)}{d F}+\frac{d^2 L(y, F)\gamma}{d^2 F}=0 \\ \therefore \gamma=-\frac{\frac{d L(y, F)}{d F}}{\frac{d^2 L(y, F)}{d^2 F}} \\ \therefore \gamma = \frac{y-p}{\frac{d^2 (-y * l + \log(1+e^l))}{d^2 l}} \\ \therefore \gamma = \frac{y-p}{\frac{d (-y + \frac{e^l}{1+e^l})}{d l}} \\ \therefore \gamma = \frac{y-p}{\frac{d \frac{e^l}{1+e^l}}{d l}} \\
( use product rule (ab)’=a’ b+a b’​)
∴ γ = y − p d e l d l ∗ 1 1 + e l − e l ∗ d d l 1 1 + e l = y − p e l 1 + e l − e l ∗ 1 ( 1 + e l ) 2 d d l ( 1 + e l ) = y − p e l 1 + e l − ( e l ) 2 ( 1 + e l ) 2 = y − p e l + ( e l ) 2 − + ( e l ) 2 = y − p e l ( 1 + e l ) 2 = y − p p ( 1 − p ) \therefore \gamma=\frac{y-p}{\frac{d e^l}{dl} * \frac{1}{1+e^l} - e^l * \frac{d }{d l} \frac{1}{1+e^l}} \\ =\frac{y-p}{\frac{e^l}{1+e^l}-e^l * \frac{1}{(1+e^l)^2} \frac{d}{dl} (1+e^l)} \\ =\frac{y-p}{\frac{e^l}{1+e^l}- \frac{(e^l)^2}{(1+e^l)^2}} \\ =\frac{y-p}{e^l+(e^l)^2-+(e^l)^2} \\ =\frac{y-p}{\frac{e^l}{(1+e^l)^2}} \\ =\frac{y-p}{p(1-p)}

Finally get γ \gamma as follows

γ = ∑ ( y − p ) ∑ p ( 1 − p ) \gamma = \frac{\sum (y-p)}{\sum p(1-p)}

Handwritten code

First create a table , Here we have to predict whether a person likes movies 《Troll 2》.

no name likes_popcorn age favorite_color loves_troll2
0 Alex 1 10 Blue 1
1 Brunei 1 90 Green 1
2 Candy 0 30 Blue 0
3 David 1 30 Red 0
4 Eric 0 30 Green 1
5 Felicity 0 10 Blue 1

Step 1 Calculation l 0 l_0 , p 0 p_0 , f 0 f_0

log_of_odds0=np.log(4 / 2)
probability0=np.exp(log_of_odds0)/(np.exp(log_of_odds0)+1)
print(f'the log_of_odds is : {log_of_odds0}')
print(f'the probability is : {probability0}')
predict0=1
print(f'the prediction is : 1')
n_samples=6
loss0=-(y*np.log(probability0)+(1-y)*np.log(1-probability0))


Output

the log_of_odds is : 0.6931471805599453
the probability is : 0.6666666666666666
the prediction is : 1

Step 2

Let's define a function first , We call him iteration, To run a iteration It's just a run for loop . The purpose of taking it apart is to make the fight clear .

def iteration(i):
#step 2.1 calculate the residuals
residuals[i] = y - probabilities[i]
#step 2.2 Fit a regression tree
dt = DecisionTreeRegressor(max_depth=1, max_leaf_nodes=3)
dt=dt.fit(X, residuals[i])
trees.append(dt.tree_)
#Step 2.3 Calculate gamma
leaf_indeces=dt.apply(X)
print(leaf_indeces)
unique_leaves=np.unique(leaf_indeces)
n_leaf=len(unique_leaves)
#for leaf 1
for ileaf in range(n_leaf):
leaf_index=unique_leaves[ileaf]
n_leaf=len(leaf_indeces[leaf_indeces==leaf_index])
previous_probability = probabilities[i][leaf_indeces==leaf_index]
denominator = np.sum(previous_probability * (1-previous_probability))
igamma = dt.tree_.value[ileaf+1][0][0] * n_leaf / denominator
gamma_value[i][ileaf]=igamma
print(f'for leaf {leaf_index}, we have {n_leaf} related samples. and gamma is {igamma}')
gamma[i] = [gamma_value[i][np.where(unique_leaves==index)] for index in leaf_indeces]
#Step 2.4 Update F(x)
log_of_odds[i+1] = log_of_odds[i] + learning_rate * gamma[i]
probabilities[i+1] = np.array([np.exp(odds)/(np.exp(odds)+1) for odds in log_of_odds[i+1]])
predictions[i+1] = (probabilities[i+1]>0.5)*1.0
score[i+1]=np.sum(predictions[i+1]==y) / n_samples
#residuals[i+1] = y - probabilities[i+1]
loss[i+1]=np.sum(-y * log_of_odds[i+1] + np.log(1+np.exp(log_of_odds[i+1])))
new_df=df.copy()
new_df.columns=['name', 'popcorn','age','color','y']
new_df[f'$p_{i}$']=probabilities[i]
new_df[f'$l_{i}$']=log_of_odds[i]
new_df[f'$r_{i}$']=residuals[i]
new_df[f'$\gamma_{i}$']=gamma[i]
new_df[f'$l_{i+1}$']=log_of_odds[i+1]
new_df[f'$p_{i+1}$']=probabilities[i+1]
display(new_df)
dot_data = tree.export_graphviz(dt, out_file=None, filled=True, rounded=True,feature_names=X.columns)
graph = graphviz.Source(dot_data)
display(graph)


Iteration 0

iteration(0)


Output ：

[1 2 2 2 2 1]
for leaf 1, we have 2 related samples. and gamma is 1.5
for leaf 2, we have 4 related samples. and gamma is -0.7499999999999998

no name popcorn age color y 𝑝0 𝑙0 𝑟0 𝛾0 𝑙1 𝑝1
0 Alex 1 10 Blue 1 0.666667 0.693147 0.333333 1.50 1.893147 0.869114
1 Brunei 1 90 Green 1 0.666667 0.693147 0.333333 -0.75 0.093147 0.523270
2 Candy 0 30 Blue 0 0.666667 0.693147 -0.666667 -0.75 0.093147 0.523270
3 David 1 30 Red 0 0.666667 0.693147 -0.666667 -0.75 0.093147 0.523270
4 Eric 0 30 Green 1 0.666667 0.693147 0.333333 -0.75 0.093147 0.523270
5 Felicity 0 10 Blue 1 0.666667 0.693147 0.333333 1.50 1.893147 0.869114

Let's look at each step separately

Step 2.1, Calculated residual error y − p 0 y-p_0 .

Step 2.2, Fit a regression tree .

Step 2.3, Calculation γ \gamma .

• For leaves 1, We have two samples (Alex and Felicity). γ \gamma yes : (1/3+1/3)/((1-2/3)*2/3+(1-2/3)*2/3)=1.5

• For leaves 2, We have four samples . γ \gamma yes :(1/3-2/3-2/3+1/3)/(4*(1-2/3)*2/3)=-0.75

Step 2.4, to update F(x).

Iteration 1

iteration(1)


Output ：

[1 2 1 1 1 1]
for leaf 1, we have 5 related samples. and gamma is -0.31564962030401844
for leaf 2, we have 1 related samples. and gamma is 1.9110594001952543

name popcorn age color y 𝑝1 𝑙1 𝑟1 𝛾1 𝑙2 𝑝2
0 Alex 1 10 Blue 1 0.869114 1.893147 0.130886 -0.315650 1.640627 0.837620
1 Brunei 1 90 Green 1 0.523270 0.093147 0.476730 1.911059 1.621995 0.835070
2 Candy 0 30 Blue 0 0.523270 0.093147 -0.523270 -0.315650 -0.159373 0.460241
3 David 1 30 Red 0 0.523270 0.093147 -0.523270 -0.315650 -0.159373 0.460241
4 Eric 0 30 Green 1 0.523270 0.093147 0.476730 -0.315650 -0.159373 0.460241
5 Felicity 0 10 Blue 1 0.869114 1.893147 0.130886 -0.315650 1.640627 0.837620

For leaves 1, Yes 5 Samples . γ \gamma yes ：

(0.130886±0.523270±0.523270+0.476730+0.130886)/(20.869114(1-0.869114)+30.523270(1-0.523270))=-0.3156498224562022

For leaves 2, Yes 1 Samples . γ \gamma yes ：

0.476730/(0.523270*(1-0.523270))=1.9110593001700842

Iteration 2

iteration(2)


Output

no name popcorn age color y 𝑝2 𝑙2 𝑟2 𝛾2 𝑙3 𝑝3
0 Alex 1 10 Blue 1 0.837620 1.640627 0.162380 1.193858 2.595714 0.930585
1 Brunei 1 90 Green 1 0.835070 1.621995 0.164930 -0.244390 1.426483 0.806353
2 Candy 0 30 Blue 0 0.460241 -0.159373 -0.460241 -0.244390 -0.354885 0.412198
3 David 1 30 Red 0 0.460241 -0.159373 -0.460241 -0.244390 -0.354885 0.412198
4 Eric 0 30 Green 1 0.460241 -0.159373 0.539759 -0.244390 -0.354885 0.412198
5 Felicity 0 10 Blue 1 0.837620 1.640627 0.162380 1.193858 2.595714 0.930585

Iteration 3 and 4 Omit ...

iteration(3)
iteration(4)


Accuracy :

Loss :