K- a near neighbor （K-NN） The algorithm can be said to be the simplest machine algorithm . To build the model, we only need to save the training data set . Want to predict new data points , The algorithm will find the nearest data point in the training data set , That's what it is “ Nearest neighbor ”.
What we implement here is a classification in supervised learning （ Two classification ） problem . We need to predict the category of test data .
Import Numpy Easy to operate data ,pyplot For drawing
(1)load_iris()： Import iris data
(2)x_data = datas[‘data’][0:150],y_data= datas[‘target’][0:150]：
Using slicing operations to get data sets .‘data’ The corresponding is the iris data .target The corresponding category is .
The final source dataset is ：
x_data(150 Group source data )：
… altogether 150 Group data
y_data( For categories ):
It's also 150 Group data .
Training set acquisition ：
explain ： Using slicing operation, half of the source data is used as the training set . from 0 Start to 150, Left closed right away , In steps of 2. obtain 75 Group data .
Test set data acquisition ：
explain ： Using slicing operation, half of the source data is used as the training set . from 1 Start to 150, Left closed right away , In steps of 2. obtain 75 Group data .
1. Draw a scatter plot ：
Be careful ： The abscissa and ordinate here correspond to the length and width of the calyx of iris respectively , Because four dimensional vectors can't be drawn , So take the first two elements for illustration .
2.k-NN The process of ：
explain ： The black dots indicate the data to be predicted , We need to find the category of the points closest to it , Use voting to determine the category of black dots , According to the figure, the final category of black dots should be the same as that of green dots 0
Concrete algorithm ：（ Distance is the distance between two coordinate points ）
The final prediction is 0 This is the same category as the green dot, and this is the end of building the model .
explain ： This is actually to repeat the previous operation on the test set data , Then compare the predicted results of the test set data with its correct category , Record the number of correct predictions , Finally, the accuracy of this model is calculated by dividing the total test data .
? The code is sequential
from sklearn.datasets import load_iris import numpy as np import matplotlib.pyplot as plt # Load iris data datas = load_iris() # Using slices to collect 150 Group Iris data (datas['data'] Express datas Of ‘data’key The data corresponding to the value , That is, the data of petals and calyx of iris ) x_data = datas['data'][0:150] # Represents the category of iris source data ,0 representative setosa,1 representative versicolor,2 representative virginica（K-NN The algorithm deals with the classification problem based on supervised learning ） #datas['target'] Express datas Of ‘target’key The data corresponding to the value , The category of iris （ label ） y_data = datas['target'][0:150] # Generating training sets （ Percent of iris source data 50） x_train = x_data[0:150:2] y_train = y_data[0:150:2] # Generating test sets （ Percent of iris source data 50） x_test = x_data[1:150:2] y_test = y_data[1:150:2] y_data
# Drawing training sets （ Here, we only use the first two of iris data as coordinates to illustrate the principle of the algorithm , Because four dimensions can't be drawn ） plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0") plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1") plt.scatter(x_train[y_train==2,0],x_train[y_train==2,1],color='b',label="symbol 2") plt.title("k-NN view") plt.xlabel("x axis") plt.ylabel("y axis") plt.legend() plt.show()
# Add a new data ( This data is an example ) Judge its category as 0 or 1 or 2（ According to distance ） x = np.array([4.5023242,3.03123123,1.3023123,0.102123123]) plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0") plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1") plt.scatter(x_train[y_train==2,0],x_train[y_train==2,1],color='b',label="symbol 2") plt.scatter(x,x,color='black',label="symbol ?") plt.title("k-NN view") plt.xlabel("x axis") plt.ylabel("y axis") plt.legend() plt.show()
#K-NN The process （ Calculated distance , And store it in a list ） from math import sqrt distances =  for x0 in x_train: d = sqrt(np.sum((x-x0)**2)) distances.append(d) # Sort the distances （ From big to small ） Returns the element subscript near = np.argsort(distances) k = 3 # Before selection 5 Categories of the most recent elements topK_y = [y_train[i] for i in near[:k]] topK_y
from collections import Counter # Count the number of elements ( That is, the number of numbers representing the category of iris ) votes = Counter(topK_y) votes
# Find the element with the most votes , This method returns a tuple , We just need key value （ That's the category ） # The result of the prediction ,0 representative setosa,1 representative versicolor,2 representative virginica result = votes.most_common(1) result
# Note that the data above is used here , But the steps start again, because you have to traverse the test set one by one and compare the prediction results with the results in the source data to get the correct rate # The accuracy of statistical test data count = 0 index = 0 for j in x_test: distance =  x = j; # Calculated distance for x1 in x_train: t = sqrt(np.sum((x-x1)**2)) distance.append(t) near = np.argsort(distance) topK_y = [y_train[i] for i in near[:k]] votes = Counter(topK_y) result = votes.most_common(1) if y_test[index]==result: count=count+1 index=index+1 else: index=index+1 score=count/75 score
If the code is wrong, please refer to ： Code