K- a near neighbor （K-NN） The algorithm can be said to be the simplest machine algorithm . To build the model, we only need to save the training data set . Want to predict new data points , The algorithm will find the nearest data point in the training data set , That's what it is “ Nearest neighbor ”.
What we implement here is a classification in supervised learning （ Two classification ） problem . We need to predict the category of test data .
Import Numpy Easy to operate data ,pyplot For drawing
explain ：np.random.uniform(1,5,(50,2))： Generate a 50X2 Matrix
The element value is 1~5 Random number between .
x_data = np.concatenate([x_data1,x_data2])： Merge two matrices into one matrix .
The final source dataset is ：
x_data( For coordinates ):
… altogether 100 Group data
y_data( For categories ):
It's also 100 Group data . A coordinate corresponds to a category .
Training set acquisition ：
explain ：np.split(x_data,) It means that you will x_data Is divided into 2 Parts of , We take the subscript 0 Of , Forefinger 75 Group data .
Test set data acquisition ：
explain ： Corresponding test data, we take the following 25 Group data as test set .
1. Draw a scatter plot ：
2.k-NN The process of ：
explain ： The blue dots indicate the data to be predicted , We need to find the category of the points closest to it , Use voting to determine the category of blue dots , According to the figure, the final category of the blue dot is the same as that of the red dot 1
Concrete algorithm ：（ Distance is the distance between two coordinate points ）
The final prediction is 1 The same category as the red dot, and this is the end of building the model .
explain ： This is actually to repeat the previous operation on the test set data , Then compare the predicted results of the test set data with its correct category , Record the number of correct predictions , Finally, the accuracy of this model is calculated by dividing the total test data .
? The code is sequential
# The module that generates the data import numpy as np import matplotlib.pyplot as plt # The generation of source data x_data1 = np.random.uniform(1,5,(50,2)) x_data2 = np.random.uniform(3,9,(50,2)) x_data = np.concatenate([x_data1,x_data2]) # Represents the category of the source data （K-NN The algorithm deals with the classification problem based on supervised learning ） y_data = [0,1]*50 y_data = np.sort(y_data) # Generating training sets （ Percent of the source data 75） x_train = np.split(x_data,) y_train = np.split(y_data,) # Generating test sets （ Percent of the source data 25） x_test = np.split(x_data,) y_test = np.split(y_data,)
# Drawing training sets plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0") plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1") plt.title("k-NN view") plt.xlabel("x axis") plt.ylabel("y axis") plt.legend() plt.show()
# Add a new data to judge its category as 1 still 0 x = np.array([4.1227123,7.26324127]) plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0") plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1") plt.scatter(x,x,color='b',label="symbol ?") plt.title("k-NN view") plt.xlabel("x axis") plt.ylabel("y axis") plt.legend() plt.show()
#K-NN The process （ Calculated distance , And store it in a list ） from math import sqrt distances =  for x0 in x_train: d = sqrt(np.sum(x-x0)**2) distances.append(d) # Sort the distances （ From big to small ） Returns the element subscript near = np.argsort(distances) k = 5 # Before selection 5 Categories of the most recent elements topK_y = [y_train[i] for i in near[:k]]
from collections import Counter # Count the number of elements votes = Counter(topK_y)
# Find the element with the most votes , This method returns a tuple , We just need the abscissa of the first element （ That's the category ） result = votes.most_common(1) result
# Note that the data above is used here , But the steps start again, because you have to traverse the test set one by one and compare the prediction results with the results in the source data to get the correct rate # The accuracy of statistical test data count = 0 index = 0 for j in x_test: distance =  x = j; # Calculated distance for x1 in x_train: t = sqrt(np.sum((x-x1)**2)) distance.append(t) near = np.argsort(distance) topK_y = [y_train[i] for i in near[:k]] votes = Counter(topK_y) result = votes.most_common(1) if y_test[index]==result: count=count+1 index=index+1 else: index=index+1 score=count/25 score