## A series of handwriting algorithms based on Python machine learning -- gbdt gradient lifting classification

There are several evidences 2020-11-13 12:41:19
series handwriting algorithms based python

Gradient rise （Gradient Boosting） Train a series of weak learners （learners）, Each learner has pseudo residuals for the previous learner （ instead of y）, In order to improve the performance of the algorithm （performance）.

This is how Wikipedia describes gradient ascension

Gradient rise （ Gradient enhancement ） A classification problem for machine learning , The prediction model produced by this method is the integration of weak prediction models , For example, a typical decision tree is used As a weak prediction model , At this point, it's a gradient tree （GBT or GBDT）. Like any other method of ascension , It builds the model in phases , But it is an extension of the general lifting method by allowing optimization of any differentiable loss function .

Necessary knowledge

1. Logical regression
2. Linear regression
4. Decision tree


1. How gradient promotion is applied to classification
2. Handwritten gradient classification from zero


# Algorithm

The following figure shows the gradient lifting classification algorithm well

( The picture is from Youtube channel StatQuest)

The first part of the picture above is a stump , Its value is log of odds of y, We remember it as l l . And then there are some trees . The trees in the back , When training them , Their goal is not y, It is y Residual of .

remnant Bad = really real value − pre measuring value residual = True value - Predictive value

This picture here , It's a little more complicated than gradient lift regression . here , The green part is the residual , The red one is called γ \gamma , The black one is the learning rate . The residuals here are not simply averaged γ \gamma . γ \gamma Used to update l l .

technological process

Step 1: Calculation log of odds l 0 l_0 . Or this is y The first prediction of . here n 1 n_1 yes y=1 The number of , n 0 n_0 yes y=0 The number of .

l 0 ( x ) = log ⁡ n 1 n 0 l_0(x)=\log \frac{n_1}{n_0}

For each x i x_i , The probability is :
p 0 i = e l 0 i 1 + e l 0 i p_{0i}=\frac{e^{l_{0i}}}{1+e^{l_{0i}}}

The forecast is :

f 0 i = { 0 p 0 i < 0.5 1 p 0 i > = 0.5 f_{0i}=\begin{cases} 0 & p_{0i}<0.5 \\ 1 & p_{0i}>=0.5 \end{cases}

Step 2 for m in 1 to M:

• Step 2.1: Calculate the so-called pseudo residuals ：

r i m = f i − p i r_{im}=f_i-p_i

• Step 2.2: Fitting a regression tree with pseudo residuals t m ( x ) t_m(x) , And identify the final leaf node R j m R_{jm} for j = 1... J m j=1...Jm

• Step 2.3: Calculate the... Of each leaf node γ \gamma

γ i m = ∑ r i m ∑ ( 1 − r i m − 1 ) ( r i m − 1 ) \gamma_{im}=\frac{\sum r_{im}}{\sum (1-r_{im-1})(r_{im-1})}

• Step 2.4: to update l l , p p , f f . α \alpha It's the learning rate :
l m ( x ) = l m − 1 + α γ m l_m(x)=l_{m-1}+\alpha \gamma_m

p m i = e l m i 1 + e l m i p_{mi}=\frac{e^{l_{mi}}}{1+e^{l_{mi}}}

f m i = { 0 p m i < 0.5 1 p m i > = 0.5 f_{mi}=\begin{cases} 0 & p_{mi}<0.5 \\ 1 & p_{mi}>=0.5 \end{cases}
Step 3. Output f M ( x ) f_M(x)

# (Optional) The classification of gradient regression is derived from gradient regression

The above knowledge of simplifying the process , For handwritten gradient lifting classification algorithm , That's enough . If there is spare force , Gradient and I can rise together （GB） The gradient lifting classification is deduced （GBC）

First let's look at GB Steps for

Input : Training data { ( x i , y i ) } i = 1 n \{(x_i, y_i)\}_{i=1}^{n} , A differentiable loss function L ( y , F ( x ) ) L(y, F(x)) , cycles M.

Algorithm :

Step 1: With a constant F 0 ( x ) F_0(x) Start the algorithm , This constant satisfies the following formula ：

F 0 ( x ) = argmin ⁡ γ ∑ i = 1 n L ( y i , γ ) F_0(x)=\underset{\gamma}{\operatorname{argmin}}\sum_{i=1}^{n}L(y_i, \gamma)

Step 2: for m in 1 to M:

• Step 2.1: Calculate the pseudo residuals （pseudo-residuals）:

r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{im}=-[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x)=F_{m-1}(x)}

• Step 2.2: Fitting weak learners with pseudo residuals h m ( x ) h_m(x) , Set up a destination area R j m ( j = 1... J m ) R_{jm}(j=1...J_m)

• Step 2.3: For each end area （ That is, every leaf ）, Calculation γ \gamma

γ j m = argmin ⁡ γ ∑ x i ∈ R j m n L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n}L(y_i, F_{m-1}(x_i)+\gamma)

• Step 2.4: Update algorithm （ Learning rate α \alpha ） :
F m ( x ) = F m − 1 + α γ m F_m(x)=F_{m-1}+\alpha\gamma_m

Step 3. Output algorithm F M ( x ) F_M(x)

## Loss function

From gradient lifting deduction to gradient lifting classification , We need a loss function , And bring in Step 1, Step 2.1 and Step 2.3. here , We use it Log of Likelihood As a loss function .

L ( y , F ( x ) ) = − ∑ i = 1 N ( y i ∗ l o g ( p ) + ( 1 − y i ) ∗ l o g ( 1 − p ) ) L(y, F(x))=-\sum_{i=1}^{N}(y_i* log(p) + (1-y_i)*log(1-p))

This is about probability p Function of , It's not about log of odds （l） Function of , So we need to deform it .

Let's take out the middle part
− ( y ∗ log ⁡ ( p ) + ( 1 − y ) ∗ log ⁡ ( 1 − p ) ) = − y ∗ log ⁡ ( p ) − ( 1 − y ) ∗ log ⁡ ( 1 − p ) = − y log ⁡ ( p ) − log ⁡ ( 1 − p ) + y log ⁡ ( 1 − p ) = − y ( log ⁡ ( p ) − log ⁡ ( 1 − p ) ) − log ⁡ ( 1 − p ) = − y ( log ⁡ ( p 1 − p ) ) − log ⁡ ( 1 − p ) = − y log ⁡ ( o d d s ) − log ⁡ ( 1 − p ) -(y*\log(p)+(1-y)*\log(1-p)) \\ =-y * \log(p) - (1-y) * \log(1-p) \\ =-y\log(p)-\log(1-p)+y\log(1-p) \\ =-y(\log(p)-\log(1-p))-\log(1-p) \\ =-y(\log(\frac{p}{1-p}))-\log(1-p) \\ =-y \log(odds)-\log(1-p)
because
log ⁡ ( 1 − p ) = l o g ( 1 − e l o g ( o d d s ) 1 + e l o g ( o d d s ) ) = log ⁡ ( 1 + e l 1 + e l − e l 1 + e l ) = log ⁡ ( 1 1 + e l ) = log ⁡ ( 1 ) + log ⁡ ( 1 + e l ) = − l o g ( 1 + e log ⁡ ( o d d s ) ) \log(1-p)=log(1-\frac{e^{log(odds)}}{1+e^{log(odds)}}) \\ =\log(\frac{1+e^l}{1+e^l}-\frac{e^l}{1+e^l})\\ =\log(\frac{1}{1+e^l}) \\ =\log(1)+\log(1+e^l) \\ =-log(1+e^{\log(odds)})

We brought in from above , obtain

− ( y ∗ log ⁡ ( p ) + ( 1 − y ) ∗ log ⁡ ( 1 − p ) ) = − y log ⁡ ( o d d s ) + log ⁡ ( 1 + e log ⁡ ( o d d s ) ) -(y*\log(p)+(1-y)*\log(1-p)) \\ =-y\log(odds)+\log(1+e^{\log(odds)}) \\

Last , We got the use of l l The loss function represented by

L = − ∑ i = 1 N ( y l − log ⁡ ( 1 + e l ) ) L=-\sum_{i=1}^{N}(yl-\log(1+e^l))

## Step 1:

In order to find the minimum of the loss function , We just need to find that its first derivative is equal to 0.
∂ L ( y , F 0 ) ∂ F 0 = − ∂ ∑ i = 1 N ( y log ⁡ ( o d d s ) − log ⁡ ( 1 + e log ⁡ ( o d d s ) ) ) ∂ l o g ( o d d s ) = − ∑ i = 1 n y i + ∑ i = 1 N ∂ l o g ( 1 + e l o g ( o d d s ) ) ∂ l o g ( o d d s ) = − ∑ i = 1 n y i + ∑ i = 1 N 1 1 + e log ⁡ ( o d d s ) ∂ ( 1 + e l ) ∂ l = − ∑ i = 1 n y i + ∑ i = 1 N 1 1 + e log ⁡ ( o d d s ) ∂ ( e l ) ∂ l = − ∑ i = 1 n y i + ∑ i = 1 N e l 1 + e l = − ∑ i = 1 n y i + N e l 1 + e l = 0 \frac{\partial L(y, F_0)}{\partial F_0} \\ =-\frac{\partial \sum_{i=1}^{N}(y\log(odds)-\log(1+e^{\log(odds)}))}{\partial log(odds)} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{\partial log(1+e^{log(odds)})}{\partial log(odds)} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{1}{1+e^{\log(odds)}} \frac{\partial (1+e^l)}{\partial l} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{1}{1+e^{\log(odds)}} \frac{\partial (e^l)}{\partial l} \\ =-\sum_{i=1}^{n} y_i+\sum_{i=1}^{N} \frac{e^l}{1+e^l} \\ =-\sum_{i=1}^{n} y_i+N\frac{e^l}{1+e^l} =0
We get ( p It's the real probability )
e l 1 + e l = ∑ i = 1 N y i N = p e l = p + p ∗ e l ( 1 − p ) e l = p e l = p 1 − p log ⁡ ( o d d s ) = l o g ( p 1 − p ) \frac{e^l}{1+e^l}=\frac{\sum_{i=1}^{N}y_i}{N}=p \\ e^l=p+p*e^l \\ (1-p)e^l=p \\ e^l=\frac{p}{1-p} \\ \log(odds)=log(\frac{p}{1-p})

here , Let's just say l l .

## Step 2.1

r i m = − [ ∂ L ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) r_{im}=-[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x)=F_{m-1}(x)}

= − [ ∂ ( − ( y i ∗ l o g ( p ) + ( 1 − y i ) ∗ l o g ( 1 − p ) ) ) ∂ F m − 1 ( x i ) ] F ( x ) = F m − 1 ( x ) =-[\frac{\partial (-(y_i* log(p)+(1-y_i)*log(1-p)))}{\partial F_{m-1}(x_i)}]_{F(x)=F_{m-1}(x)}

Similarly , You can get

= y i − F m − 1 ( x i ) =y_i-F_{m-1}(x_i)

## Step 2.3:

γ j m = argmin ⁡ γ ∑ x i ∈ R j m n L ( y i , F m − 1 ( x i ) + γ ) \gamma_{jm}=\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n}L(y_i, F_{m-1}(x_i)+\gamma)

Bring in the loss function ：

γ j m = argmin ⁡ γ ∑ x i ∈ R j m n L ( y i , F m − 1 ( x i ) + γ ) = argmin ⁡ γ ∑ x i ∈ R j m n ( − y i ∗ ( F m − 1 + γ ) + log ⁡ ( 1 + e F m − 1 + γ ) ) \gamma_{jm} \\ =\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n}L(y_i, F_{m-1}(x_i)+\gamma) \\ =\underset{\gamma}{\operatorname{argmin}}\sum_{x_i \in R_{jm}}^{n} (-y_i * (F_{m-1}+\gamma)+\log(1+e^{F_{m-1}+\gamma})) \\

Let's solve the middle part .

− y i ∗ ( F m − 1 + γ ) + log ⁡ ( 1 + e F m − 1 + γ ) -y_i * (F_{m-1}+\gamma)+\log(1+e^{F_{m-1}+\gamma})

We use the second-order Taylor polynomials expansion ：

L ( y , F + γ ) ≈ L ( y , F ) + d L ( y , F + γ ) γ d F + 1 2 d 2 L ( y , F + γ ) γ 2 d 2 F L(y,F+\gamma) \approx L(y, F)+ \frac{d L(y, F+\gamma)\gamma}{d F}+\frac{1}{2} \frac{d^2 L(y, F+\gamma)\gamma^2}{d^2 F}

Derivation
∵ d L ( y , F + γ ) d γ ≈ d L ( y , F ) d F + d 2 L ( y , F ) γ d 2 F = 0 ∴ d L ( y , F ) d F + d 2 L ( y , F ) γ d 2 F = 0 ∴ γ = − d L ( y , F ) d F d 2 L ( y , F ) d 2 F ∴ γ = y − p d 2 ( − y ∗ l + log ⁡ ( 1 + e l ) ) d 2 l ∴ γ = y − p d ( − y + e l 1 + e l ) d l ∴ γ = y − p d e l 1 + e l d l \because \frac{d L(y, F+\gamma)}{d\gamma} \approx \frac{d L(y, F)}{d F}+\frac{d^2 L(y, F)\gamma}{d^2 F}=0 \\ \therefore \frac{d L(y, F)}{d F}+\frac{d^2 L(y, F)\gamma}{d^2 F}=0 \\ \therefore \gamma=-\frac{\frac{d L(y, F)}{d F}}{\frac{d^2 L(y, F)}{d^2 F}} \\ \therefore \gamma = \frac{y-p}{\frac{d^2 (-y * l + \log(1+e^l))}{d^2 l}} \\ \therefore \gamma = \frac{y-p}{\frac{d (-y + \frac{e^l}{1+e^l})}{d l}} \\ \therefore \gamma = \frac{y-p}{\frac{d \frac{e^l}{1+e^l}}{d l}} \\
( use product rule (ab)’=a’ b+a b’​)
∴ γ = y − p d e l d l ∗ 1 1 + e l − e l ∗ d d l 1 1 + e l = y − p e l 1 + e l − e l ∗ 1 ( 1 + e l ) 2 d d l ( 1 + e l ) = y − p e l 1 + e l − ( e l ) 2 ( 1 + e l ) 2 = y − p e l + ( e l ) 2 − + ( e l ) 2 = y − p e l ( 1 + e l ) 2 = y − p p ( 1 − p ) \therefore \gamma=\frac{y-p}{\frac{d e^l}{dl} * \frac{1}{1+e^l} - e^l * \frac{d }{d l} \frac{1}{1+e^l}} \\ =\frac{y-p}{\frac{e^l}{1+e^l}-e^l * \frac{1}{(1+e^l)^2} \frac{d}{dl} (1+e^l)} \\ =\frac{y-p}{\frac{e^l}{1+e^l}- \frac{(e^l)^2}{(1+e^l)^2}} \\ =\frac{y-p}{e^l+(e^l)^2-+(e^l)^2} \\ =\frac{y-p}{\frac{e^l}{(1+e^l)^2}} \\ =\frac{y-p}{p(1-p)}

Finally get γ \gamma as follows

γ = ∑ ( y − p ) ∑ p ( 1 − p ) \gamma = \frac{\sum (y-p)}{\sum p(1-p)}

# Handwritten code

First create a table , Here we have to predict whether a person likes movies 《Troll 2》.

no name likes_popcorn age favorite_color loves_troll2
0 Alex 1 10 Blue 1
1 Brunei 1 90 Green 1
2 Candy 0 30 Blue 0
3 David 1 30 Red 0
4 Eric 0 30 Green 1
5 Felicity 0 10 Blue 1

# Step 1 Calculation l 0 l_0 , p 0 p_0 , f 0 f_0

log_of_odds0=np.log(4 / 2)
probability0=np.exp(log_of_odds0)/(np.exp(log_of_odds0)+1)
print(f'the log_of_odds is : {log_of_odds0}')
print(f'the probability is : {probability0}')
predict0=1
print(f'the prediction is : 1')
n_samples=6
loss0=-(y*np.log(probability0)+(1-y)*np.log(1-probability0))


Output

the log_of_odds is : 0.6931471805599453
the probability is : 0.6666666666666666
the prediction is : 1

# Step 2

Let's define a function first , We call him iteration, To run a iteration It's just a run for loop . The purpose of taking it apart is to make the fight clear .

def iteration(i):
#step 2.1 calculate the residuals
residuals[i] = y - probabilities[i]
#step 2.2 Fit a regression tree
dt = DecisionTreeRegressor(max_depth=1, max_leaf_nodes=3)
dt=dt.fit(X, residuals[i])
trees.append(dt.tree_)
#Step 2.3 Calculate gamma
leaf_indeces=dt.apply(X)
print(leaf_indeces)
unique_leaves=np.unique(leaf_indeces)
n_leaf=len(unique_leaves)
#for leaf 1
for ileaf in range(n_leaf):
leaf_index=unique_leaves[ileaf]
n_leaf=len(leaf_indeces[leaf_indeces==leaf_index])
previous_probability = probabilities[i][leaf_indeces==leaf_index]
denominator = np.sum(previous_probability * (1-previous_probability))
igamma = dt.tree_.value[ileaf+1][0][0] * n_leaf / denominator
gamma_value[i][ileaf]=igamma
print(f'for leaf {leaf_index}, we have {n_leaf} related samples. and gamma is {igamma}')
gamma[i] = [gamma_value[i][np.where(unique_leaves==index)] for index in leaf_indeces]
#Step 2.4 Update F(x)
log_of_odds[i+1] = log_of_odds[i] + learning_rate * gamma[i]
probabilities[i+1] = np.array([np.exp(odds)/(np.exp(odds)+1) for odds in log_of_odds[i+1]])
predictions[i+1] = (probabilities[i+1]>0.5)*1.0
score[i+1]=np.sum(predictions[i+1]==y) / n_samples
#residuals[i+1] = y - probabilities[i+1]
loss[i+1]=np.sum(-y * log_of_odds[i+1] + np.log(1+np.exp(log_of_odds[i+1])))
new_df=df.copy()
new_df.columns=['name', 'popcorn','age','color','y']
new_df[f'$p_{i}$']=probabilities[i]
new_df[f'$l_{i}$']=log_of_odds[i]
new_df[f'$r_{i}$']=residuals[i]
new_df[f'$\gamma_{i}$']=gamma[i]
new_df[f'$l_{i+1}$']=log_of_odds[i+1]
new_df[f'$p_{i+1}$']=probabilities[i+1]
display(new_df)
dot_data = tree.export_graphviz(dt, out_file=None, filled=True, rounded=True,feature_names=X.columns)
graph = graphviz.Source(dot_data)
display(graph)


## Iteration 0

iteration(0)


Output ：

[1 2 2 2 2 1]
for leaf 1, we have 2 related samples. and gamma is 1.5
for leaf 2, we have 4 related samples. and gamma is -0.7499999999999998

no name popcorn age color y 𝑝0 𝑙0 𝑟0 𝛾0 𝑙1 𝑝1
0 Alex 1 10 Blue 1 0.666667 0.693147 0.333333 1.50 1.893147 0.869114
1 Brunei 1 90 Green 1 0.666667 0.693147 0.333333 -0.75 0.093147 0.523270
2 Candy 0 30 Blue 0 0.666667 0.693147 -0.666667 -0.75 0.093147 0.523270
3 David 1 30 Red 0 0.666667 0.693147 -0.666667 -0.75 0.093147 0.523270
4 Eric 0 30 Green 1 0.666667 0.693147 0.333333 -0.75 0.093147 0.523270
5 Felicity 0 10 Blue 1 0.666667 0.693147 0.333333 1.50 1.893147 0.869114

Let's look at each step separately

Step 2.1, Calculated residual error y − p 0 y-p_0 .

Step 2.2, Fit a regression tree .

Step 2.3, Calculation γ \gamma .

• For leaves 1, We have two samples (Alex and Felicity). γ \gamma yes : (1/3+1/3)/((1-2/3)*2/3+(1-2/3)*2/3)=1.5

• For leaves 2, We have four samples . γ \gamma yes :(1/3-2/3-2/3+1/3)/(4*(1-2/3)*2/3)=-0.75

Step 2.4, to update F(x).

## Iteration 1

iteration(1)


Output ：

[1 2 1 1 1 1]
for leaf 1, we have 5 related samples. and gamma is -0.31564962030401844
for leaf 2, we have 1 related samples. and gamma is 1.9110594001952543

name popcorn age color y 𝑝1 𝑙1 𝑟1 𝛾1 𝑙2 𝑝2
0 Alex 1 10 Blue 1 0.869114 1.893147 0.130886 -0.315650 1.640627 0.837620
1 Brunei 1 90 Green 1 0.523270 0.093147 0.476730 1.911059 1.621995 0.835070
2 Candy 0 30 Blue 0 0.523270 0.093147 -0.523270 -0.315650 -0.159373 0.460241
3 David 1 30 Red 0 0.523270 0.093147 -0.523270 -0.315650 -0.159373 0.460241
4 Eric 0 30 Green 1 0.523270 0.093147 0.476730 -0.315650 -0.159373 0.460241
5 Felicity 0 10 Blue 1 0.869114 1.893147 0.130886 -0.315650 1.640627 0.837620

For leaves 1, Yes 5 Samples . γ \gamma yes ：

(0.130886±0.523270±0.523270+0.476730+0.130886)/(20.869114(1-0.869114)+30.523270(1-0.523270))=-0.3156498224562022

For leaves 2, Yes 1 Samples . γ \gamma yes ：

0.476730/(0.523270*(1-0.523270))=1.9110593001700842

### Iteration 2

iteration(2)


Output

no name popcorn age color y 𝑝2 𝑙2 𝑟2 𝛾2 𝑙3 𝑝3
0 Alex 1 10 Blue 1 0.837620 1.640627 0.162380 1.193858 2.595714 0.930585
1 Brunei 1 90 Green 1 0.835070 1.621995 0.164930 -0.244390 1.426483 0.806353
2 Candy 0 30 Blue 0 0.460241 -0.159373 -0.460241 -0.244390 -0.354885 0.412198
3 David 1 30 Red 0 0.460241 -0.159373 -0.460241 -0.244390 -0.354885 0.412198
4 Eric 0 30 Green 1 0.460241 -0.159373 0.539759 -0.244390 -0.354885 0.412198
5 Felicity 0 10 Blue 1 0.837620 1.640627 0.162380 1.193858 2.595714 0.930585

Iteration 3 and 4 Omit ...

iteration(3)
iteration(4)


Accuracy :

Loss :