## Process implementation of k-nearest neighbor algorithm in Python

Coke 2020-11-13 03:53:05
process implementation k-nearest nearest neighbor

# One 、K- Nearest neighbor algorithm

K- a near neighbor （K-NN） The algorithm can be said to be the simplest machine algorithm . To build the model, we only need to save the training data set . Want to predict new data points , The algorithm will find the nearest data point in the training data set , That's what it is “ Nearest neighbor ”.

What we implement here is a classification in supervised learning （ Two classification ） problem . We need to predict the category of test data .

# Two 、 Implementation steps

## 1. Get data set

Import Numpy Easy to operate data ,pyplot For drawing explain ：np.random.uniform(1,5,(50,2))： Generate a 50X2 Matrix
The element value is 1~5 Random number between .
x_data = np.concatenate([x_data1,x_data2])： Merge two matrices into one matrix .
The final source dataset is ：
x_data( For coordinates ): … altogether 100 Group data
y_data( For categories ): It's also 100 Group data . A coordinate corresponds to a category .

## 2. The data set is divided into training set and test set

Training set acquisition ： explain ：np.split(x_data,) It means that you will x_data Is divided into 2 Parts of , We take the subscript 0 Of , Forefinger 75 Group data .

Test set data acquisition ： explain ： Corresponding test data, we take the following 25 Group data as test set .

# 3. Build the model

1. Draw a scatter plot ：  2.k-NN The process of ：  explain ： The blue dots indicate the data to be predicted , We need to find the category of the points closest to it , Use voting to determine the category of blue dots , According to the figure, the final category of the blue dot is the same as that of the red dot 1

Concrete algorithm ：（ Distance is the distance between two coordinate points ） The final prediction is 1 The same category as the red dot, and this is the end of building the model .

## 4. Using the data of the test set to test the accuracy explain ： This is actually to repeat the previous operation on the test set data , Then compare the predicted results of the test set data with its correct category , Record the number of correct predictions , Finally, the accuracy of this model is calculated by dividing the total test data . # 3、 ... and 、 Algorithm implementation

? The code is sequential

``````# The module that generates the data
import numpy as np
import matplotlib.pyplot as plt
# The generation of source data
x_data1 = np.random.uniform(1,5,(50,2))
x_data2 = np.random.uniform(3,9,(50,2))
x_data = np.concatenate([x_data1,x_data2])
# Represents the category of the source data （K-NN The algorithm deals with the classification problem based on supervised learning ）
y_data = [0,1]*50
y_data = np.sort(y_data)
# Generating training sets （ Percent of the source data 75）
x_train = np.split(x_data,)
y_train = np.split(y_data,)
# Generating test sets （ Percent of the source data 25）
x_test = np.split(x_data,)
y_test = np.split(y_data,)
``````
``````# Drawing training sets
plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0")
plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1")
plt.title("k-NN view")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.legend()
plt.show()
``````
``````# Add a new data to judge its category as 1 still 0
x = np.array([4.1227123,7.26324127])
plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0")
plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1")
plt.scatter(x,x,color='b',label="symbol ?")
plt.title("k-NN view")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.legend()
plt.show()
``````
``````#K-NN The process （ Calculated distance , And store it in a list ）
from math import sqrt
distances = []
for x0 in x_train:
d = sqrt(np.sum(x-x0)**2)
distances.append(d)
# Sort the distances （ From big to small ） Returns the element subscript
near = np.argsort(distances)
k = 5
# Before selection 5 Categories of the most recent elements
topK_y = [y_train[i] for i in near[:k]]
``````
``````from collections import Counter
# Count the number of elements
``````
``````# Find the element with the most votes , This method returns a tuple , We just need the abscissa of the first element （ That's the category ）
result
``````
``````# Note that the data above is used here , But the steps start again, because you have to traverse the test set one by one and compare the prediction results with the results in the source data to get the correct rate
# The accuracy of statistical test data
count = 0
index = 0
for j in x_test:
distance = []
x = j;
# Calculated distance
for x1 in x_train:
t = sqrt(np.sum((x-x1)**2))
distance.append(t)
near = np.argsort(distance)
topK_y = [y_train[i] for i in near[:k]]