Pan Chuang AI
2020-11-13 12:52:33

python
machine
learning
algorithm
linear

author |Vagif Aliyev compile |VK source |Towards Data Science

Linear regression is probably one of the most common algorithms , Linear regression is what machine learning practitioners must know . This is usually the first time a beginner comes into contact with machine learning algorithms , Understanding how it works is essential to better understand it .

therefore , In short , Let's break down the real problem ： What is linear regression ？

Linear regression is a supervised learning algorithm , The purpose of this paper is to use a linear method to model the relationship between dependent variables and independent variables . let me put it another way , Its goal is to fit a linear trend line that best captures data relationships , also , From this line , It can predict what the target value might be .

Great , I know the definition of it , But how does it work ？ Good question ！ To answer this question , Let's take a closer look at how linear regression works ：

Fitting data （ As shown in the figure above ）.

Calculate the distance between points （ The red dots on the picture are dots , The green line is the distance ）, And then square it , Then sum it （ These values are squared , To ensure that negative values do not produce incorrect values and hinder the calculation of ）. This is the error of the algorithm , Or better known as residual

Store the residuals of iterations

Based on an optimization algorithm , Make the line a little bit “ Move ”, So that the line can better fit the data .

Repeat step 2-5, Until the desired result is achieved , Or the residual error is reduced to zero .

This method of fitting straight lines is called the least square method .

If you already understand, please feel free to skip this section

The linear regression algorithm is as follows ：

It can be simplified as ：

The following algorithm will basically complete the following operations ：

- Accept one Y vector （ Your data tags ,（ housing price , Stock price , wait …）

This is your target vector , It will be used later to evaluate your data （ I'll talk about it in detail later ）.

- matrix X（ The characteristics of the data ）：

This is the characteristic of the data , Age 、 Gender 、 Gender 、 Height, etc . This is the data that the algorithm actually uses to predict . Notice how there is a feature 0. This is called the intercept term , And always equal to 1.

- Take a weight vector , And transpose it ：

This is the magic of the algorithm . All the eigenvectors are multiplied by these weights . This is called dot product . actually , You will try to find the best combination of these values for a given dataset . This is called optimization .

- Get the output vector ：

This is the prediction vector from the data . then , You can use the cost function to evaluate the performance of the model .

This is basically the whole algorithm expressed mathematically . Now you should have a solid understanding of the function of linear regression . But the problem is , What is an optimization algorithm ？ How do we choose the best weight ？ How we evaluate performance ？

The cost function is essentially a formula , Used to measure the loss of a model or “ cost ”. If you've ever been to any Kaggle match , You may have come across some . Some common methods include ：

Mean square error

Root mean square error

Mean absolute error

These functions are essential for model training and development , Because they answer “ How well does my model predict new instances ” This basic question ？”. Remember that , Because it's about our next topic .

Optimization is usually defined as improving something , The process of bringing it to its full potential . This also applies to machine learning . stay ML In the world of , Optimization is essentially trying to find the best combination of parameters for a data set . It's basically machine learning “ Study ” part .

I'll talk about the two most common algorithms ： Gradient descent method and standard equation .

Gradient descent is an optimization algorithm , To find the minimum value of a function . It does this by iteratively taking steps in the negative direction of the gradient . In our case , Gradient descent constantly updates the weight by moving the slope of the function tangent .

To better illustrate the gradient descent , Let's take a simple example . Imagine a man on the top of a mountain , He / She wants to climb to the bottom of the mountain . What they might do is look around , See which way to take a step , To get down faster . then , They may take a step in this direction , Now they're closer to the goal . However , They have to be careful when they come down , Because they can get stuck at some point , So we have to make sure that we choose our step size accordingly .

Again , The goal of gradient descent is to minimize the function . In our case , This is to minimize the cost of our model . It does this by finding the tangent of the function and moving in that direction . Algorithm “ step ” The size of is defined by the known learning rate . This basically controls the distance we move down . Use this parameter , We have to pay attention to two situations ：

Too fast to learn , The algorithm may not converge （ Minimum value reached ） And bounce around the minimum , But it will never reach that value

The learning rate is too low , The algorithm will take too long to reach the minimum , It may be “ card ” On a secondary advantage .

We have another parameter , It controls how many times the algorithm iterates the dataset .

Visually , The algorithm will do the following ：

Because this algorithm is very important for machine learning , Let's take a look at what it does ：

Random initialization weights . This is called random initialization

then , The model uses these random weights to predict

The prediction of the model is evaluated by the cost function

Then the model runs gradient down , Find the tangent of the function , And then take a step on the slope of the tangent

The process will repeat N Sub iteration , Or if a condition is met .

It is possible to reduce the cost function to a global minimum （ Very close to or =0）

One of the most effective optimization algorithms

It can be slow on large datasets , Because it uses the entire dataset to calculate the gradient of the function tangent

It's easy to fall into a secondary advantage （ Or local minima ）

The user must manually select the learning rate and the number of iterations , It can be time-consuming

Now that we've introduced gradient descent , Now let's introduce the standard equation .

If we go back to our example , Instead of taking the next step , We will be able to get to the bottom immediately . That's what the standard equation looks like . It uses linear algebra to generate weights , It can produce as good a result as a gradient descent in a very short time .

There is no need to choose the learning rate or the number of iterations

Very fast

It doesn't scale well to large datasets

Tends to produce good weight , But not the best weight

This is an important preprocessing step for many machine learning algorithms , Especially those that use distance measurement and calculation （ Such as linear regression and gradient descent ） The algorithm of . It's essentially scaling our features , Make them in similar ranges . Think of it as a house , A scale model of a house . The shape of the two is the same （ They are all houses ）, But the size is different （5 rice ！=500 rice ）. We do this for the following reasons ：

It speeds up the algorithm

Some algorithms are scale sensitive . In other words , If features have different scales , It is possible to give a higher weight to a feature with a higher order of magnitude . This will affect the performance of machine learning algorithms , obviously , We don't want our algorithm to be biased towards a feature .

To demonstrate this , Suppose we have three characteristics , Named as A、B and C：

- Before zooming AB distance =>

- Before zooming BC distance =>

- After zooming AB distance =>

- After zooming BC Distance of =>

We can see clearly that , These features are more comparable and unbiased than before scaling .

ok , Now the moment you've been waiting for ; Realization ！

Be careful ： All the code can be taken from this Github repo download . however , I suggest you follow the tutorial before you do this , Because then you'll have a better understanding of what code you're actually writing ：

First , Let's do some basic import ：

```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
```

Yes , That's all you need to import ！ We use numpy As a mathematical realization ,matplotlib Used to draw graphics , as well as scikitlearn Of boston Data sets .

```
# Load and split data
data = load_boston()
X,y = data['data'],data['target']
```

Next , Let's create a custom train_test_split function , Split our data into a training and test set ：

```
# Split training and test sets
def train_test_divide(X,y,test_size=0.3,random_state=42):
np.random.seed(random_state)
train_size = 1 - test_size
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand,(100*train_size))
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
return X_train, X_test, y_train, y_test
X_train,X_test,y_train,y_test = train_test_divide(X,y,test_size=0.3,random_state=42)
```

Basically , We're doing

Get the test set size .

Set a random seed , To ensure our results and repeatability .

According to the test set size, get the training set size

Random sampling from our features

The randomly selected instances are divided into training set and test set

We will achieve MSE Or mean square error , A common cost function for regression tasks ：

```
def mse(preds,y):
m = len(y)
return 1/(m) * np.sum(np.square((y - preds)))
```

M It refers to the number of training instances

yi It refers to an instance of our tag vector

preds It's our prediction

In order to write clean 、 Repeatable and efficient code , And follow software development practices , We're going to create a class of linear regression ：

```
class LinReg:
def __init__(self,X,y):
self.X = X
self.y = y
self.m = len(y)
self.bgd = False
```

- bgd It's a parameter , It defines whether we should use batch gradient descent .

Now we're going to create a method to add a intercept entry ：

```
def add_intercept_term(self,X):
X = np.insert(X,1,np.ones(X.shape[0:1]),axis=1).copy()
return X
```

This is basically inserting a column at the beginning of our feature . It's just for matrix multiplication .

If we don't add that , So we're going to force the hyperplane through the origin , And it tilts so much , So the data can't be fitted correctly

Zoom our features ：

```
def feature_scale(self,X):
X = (X - X.mean()) / (X.std())
return X
```

Next , We're going to initialize the weights randomly ：

```
def initialise_thetas(self):
np.random.seed(42)
self.thetas = np.random.rand(self.X.shape[1])
```

Now? , We will use the following formula to write standard equations from scratch ：

```
def normal_equation(self):
A = np.linalg.inv(np.dot(self.X.T,self.X))
B = np.dot(self.X.T,self.y)
thetas = np.dot(A,B)
return thetas
```

Basically , We divide the algorithm into three parts ：

We got X After transposition and X The inverse of the dot product of

We get the dot product of the weight and the label

We get the dot product of two calculated values

This is the standard equation ！ Not bad ！ Now? , We will use the following formula to achieve batch gradient descent ：

```
def batch_gradient_descent(self,alpha,n_iterations):
self.cost_history = [0] * (n_iterations)
self.n_iterations = n_iterations
for i in range(n_iterations):
h = np.dot(self.X,self.thetas.T)
gradient = alpha * (1/self.m) * ((h - self.y)).dot(self.X)
self.thetas = self.thetas - gradient
self.cost_history[i] = mse(np.dot(self.X,self.thetas.T),self.y)
return self.thetas
```

ad locum , We do the following ：

We set up alpha, Or the learning rate , And the number of iterations

We create a list to store our cost function history , In order to draw in the line chart later

loop n_iterations Time ,

We get predictions , And calculate the gradient （ The slope of the tangent to the function ）.

We update the weights to move in the negative direction of the gradient

We use our custom MSE Function records values

repeat , After completion , Return results

Let's define a fitting function for our data ：

```
def fit(self,bgd=False,alpha=0.158,n_iterations=4000):
self.X = self.add_intercept_term(self.X)
self.X = self.feature_scale(self.X)
if bgd == False:
self.thetas = self.normal_equation()
else:
self.bgd = True
self.initialise_thetas()
self.thetas = self.batch_gradient_descent(alpha,n_iterations)
```

ad locum , We just need to check whether users need gradient descent , And follow our steps accordingly .

Let's build a function to plot the cost function ：

```
def plot_cost_function(self):
if self.bgd == True:
plt.plot(range((self.n_iterations)),self.cost_history)
plt.xlabel('No. of iterations')
plt.ylabel('Cost Function')
plt.title('Gradient Descent Cost Function Line Plot')
plt.show()
else:
print('Batch Gradient Descent was not used!')
```

The last way to predict unmarked instances ：

```
def predict(self,X_test):
self.X_test = X_test.copy()
self.X_test = self.add_intercept_term(self.X_test)
self.X_test = self.feature_scale(self.X_test)
predictions = np.dot(self.X_test,self.thetas.T)
return predictions
```

Now? , Let's see which optimization produces better results . First , Let's try gradient descent ：

```
lin_reg_bgd = LinReg(X_train,y_train)
lin_reg_bgd.fit(bgd=True)
mse(y_test,lin_reg_bgd.predict(X_test))
OUT:
28.824024414708344
```

Let's draw our function , See how the cost function is reduced ：

So we can see , At about 1000 Times of iteration , It's starting to converge .

Now the standard equation is ：

```
lin_reg_normal = LinReg(X_train,y_train)
lin_reg_normal.fit()
mse(y_test,lin_reg_normal.predict(X_test))
OUT:
22.151417764247284
```

So we can see , The performance of the standard equation is slightly better than that of the gradient descent method . This may be because the data set is very small , And we didn't choose the best parameter for the learning rate .

Greatly improve the learning rate . What's going to happen ？

Do not apply feature scaling . Is there a difference ？

Try to study , See if you can implement a better optimization algorithm . Evaluate your model in the test set

It's really interesting to write this article , Although it's a little long , But I hope you learned something today .

Link to the original text ：https://towardsdatascience.com/machine-learning-algorithms-from-start-to-finish-in-python-linear-regression-aa8c1d6b1169

Welcome to join us AI Blog station ： http://panchuang.net/

sklearn Machine learning Chinese official documents ： http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource summary station ： http://docs.panchuang.net/

- 利用Python爬虫获取招聘网站职位信息
- Using Python crawler to obtain job information of recruitment website
- Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
- Python装饰器
- Python实现LDAP认证
- Python decorator
- Implementing LDAP authentication with Python
- Vscode configures Python development environment!
- In Python, how dare you say you can't log module? ️
- 我收藏的有关Python的电子书和资料
- python 中 lambda的一些tips
- python中字典的一些tips
- python 用生成器生成斐波那契数列
- python脚本转pyc踩了个坑。。。
- My collection of e-books and materials about Python
- Some tips of lambda in Python
- Some tips of dictionary in Python
- Using Python generator to generate Fibonacci sequence
- The conversion of Python script to PyC stepped on a pit...
- Python游戏开发，pygame模块，Python实现扫雷小游戏
- Python game development, pyGame module, python implementation of minesweeping games
- Python实用工具，email模块，Python实现邮件远程控制自己电脑
- Python utility, email module, python realizes mail remote control of its own computer
- 毫无头绪的自学Python，你可能连门槛都摸不到！【最佳学习路线】
- Python读取二进制文件代码方法解析
- Python字典的实现原理
- Without a clue, you may not even touch the threshold【 Best learning route]
- Parsing method of Python reading binary file code
- Implementation principle of Python dictionary
- You must know the function of pandas to parse JSON data - JSON_ normalize()
- Python实用案例，私人定制，Python自动化生成爱豆专属2021日历
- Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
- 《Python实例》震惊了，用Python这么简单实现了聊天系统的脏话，广告检测
- "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
- Convolutional neural network processing sequence for Python deep learning
- Python data structure and algorithm (1) -- enum type enum
- 超全大厂算法岗百问百答（推荐系统/机器学习/深度学习/C++/Spark/python）
- 【Python进阶】你真的明白NumPy中的ndarray吗？
- All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
- [advanced Python] do you really understand ndarray in numpy?
- 【Python进阶】Python进阶专栏栏主自述：不忘初心，砥砺前行
- [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
- python垃圾回收和缓存管理
- java调用Python程序
- java调用Python程序
- Python常用函数有哪些?Python基础入门课程
- Python garbage collection and cache management
- Java calling Python program
- Java calling Python program
- What functions are commonly used in Python? Introduction to Python Basics
- Python basic knowledge
- Anaconda5.2 安装 Python 库（MySQLdb）的方法
- Python实现对脑电数据情绪分析
- Anaconda 5.2 method of installing Python Library (mysqldb)
- Python implements emotion analysis of EEG data
- Master some advanced usage of Python in 30 seconds, which makes others envy it
- python爬取百度图片并对图片做一系列处理
- Python crawls Baidu pictures and does a series of processing on them
- python链接mysql数据库
- Python link MySQL database