## Implementation of k-NN algorithm in python (IRIS data)

Coke 2020-11-13 03:49:59
implementation k-nn nn algorithm python

# One 、K- Nearest neighbor algorithm

K- a near neighbor （K-NN） The algorithm can be said to be the simplest machine algorithm . To build the model, we only need to save the training data set . Want to predict new data points , The algorithm will find the nearest data point in the training data set , That's what it is “ Nearest neighbor ”.

What we implement here is a classification in supervised learning （ Two classification ） problem . We need to predict the category of test data .

# Two 、 Implementation steps

## 1. Get data set

Import Numpy Easy to operate data ,pyplot For drawing explain ：
(2)x_data = datas[‘data’][0:150],y_data= datas[‘target’][0:150]：
Using slicing operations to get data sets .‘data’ The corresponding is the iris data .target The corresponding category is .

The final source dataset is ：
x_data(150 Group source data )： … altogether 150 Group data
y_data( For categories ): It's also 150 Group data .

## 2. The data set is divided into training set and test set

Training set acquisition ： explain ： Using slicing operation, half of the source data is used as the training set . from 0 Start to 150, Left closed right away , In steps of 2. obtain 75 Group data .

Test set data acquisition ： explain ： Using slicing operation, half of the source data is used as the training set . from 1 Start to 150, Left closed right away , In steps of 2. obtain 75 Group data .

# 3. Build the model

1. Draw a scatter plot ：  Be careful ： The abscissa and ordinate here correspond to the length and width of the calyx of iris respectively , Because four dimensional vectors can't be drawn , So take the first two elements for illustration .
2.k-NN The process of ：  explain ： The black dots indicate the data to be predicted , We need to find the category of the points closest to it , Use voting to determine the category of black dots , According to the figure, the final category of black dots should be the same as that of green dots 0

Concrete algorithm ：（ Distance is the distance between two coordinate points ）  The final prediction is 0 This is the same category as the green dot, and this is the end of building the model .

## 4. Using the data of the test set to test the accuracy explain ： This is actually to repeat the previous operation on the test set data , Then compare the predicted results of the test set data with its correct category , Record the number of correct predictions , Finally, the accuracy of this model is calculated by dividing the total test data . # 3、 ... and 、 Algorithm implementation

? The code is sequential

``````from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
# Using slices to collect 150 Group Iris data (datas['data'] Express datas Of ‘data’key The data corresponding to the value , That is, the data of petals and calyx of iris )
x_data = datas['data'][0:150]
# Represents the category of iris source data ,0 representative setosa,1 representative versicolor,2 representative virginica（K-NN The algorithm deals with the classification problem based on supervised learning ）
#datas['target'] Express datas Of ‘target’key The data corresponding to the value , The category of iris （ label ）
y_data = datas['target'][0:150]
# Generating training sets （ Percent of iris source data 50）
x_train = x_data[0:150:2]
y_train = y_data[0:150:2]
# Generating test sets （ Percent of iris source data 50）
x_test = x_data[1:150:2]
y_test = y_data[1:150:2]
y_data
``````
``````# Drawing training sets （ Here, we only use the first two of iris data as coordinates to illustrate the principle of the algorithm , Because four dimensions can't be drawn ）
plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0")
plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1")
plt.scatter(x_train[y_train==2,0],x_train[y_train==2,1],color='b',label="symbol 2")
plt.title("k-NN view")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.legend()
plt.show()
``````
``````# Add a new data ( This data is an example ) Judge its category as 0 or 1 or 2（ According to distance ）
x = np.array([4.5023242,3.03123123,1.3023123,0.102123123])
plt.scatter(x_train[y_train==0,0],x_train[y_train==0,1],color='g',label="symbol 0")
plt.scatter(x_train[y_train==1,0],x_train[y_train==1,1],color='r',label="symbol 1")
plt.scatter(x_train[y_train==2,0],x_train[y_train==2,1],color='b',label="symbol 2")
plt.scatter(x,x,color='black',label="symbol ?")
plt.title("k-NN view")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.legend()
plt.show()
``````
``````#K-NN The process （ Calculated distance , And store it in a list ）
from math import sqrt
distances = []
for x0 in x_train:
d = sqrt(np.sum((x-x0)**2))
distances.append(d)
# Sort the distances （ From big to small ） Returns the element subscript
near = np.argsort(distances)
k = 3
# Before selection 5 Categories of the most recent elements
topK_y = [y_train[i] for i in near[:k]]
topK_y
``````
``````from collections import Counter
# Count the number of elements ( That is, the number of numbers representing the category of iris )
``````
``````# Find the element with the most votes , This method returns a tuple , We just need key value （ That's the category ）
# The result of the prediction ,0 representative setosa,1 representative versicolor,2 representative virginica
result
``````
``````# Note that the data above is used here , But the steps start again, because you have to traverse the test set one by one and compare the prediction results with the results in the source data to get the correct rate
# The accuracy of statistical test data
count = 0
index = 0
for j in x_test:
distance = []
x = j;
# Calculated distance
for x1 in x_train:
t = sqrt(np.sum((x-x1)**2))
distance.append(t)
near = np.argsort(distance)
topK_y = [y_train[i] for i in near[:k]]