A series of handwriting algorithms based on Python machine learning -- cblof anomaly detection

There are several evidences 2020-11-13 12:37:47
series handwriting algorithms based python

I'm currently doing some work related to anomaly detection , And participated in pyod Inside CBLOF Algorithm development . therefore , I'm concerned about CBLOF And pyod The implementation is very familiar .

For the handwritten algorithm series of articles . Interested can study the code . Time tight can also skip code not to see . however , Even if you don't want to see the code , I also suggest that you read my article . After all , I implemented the code , I understand these algorithms first , I just want to write articles . And in the process of implementation , As in this article , Some intermediate steps can be plotted and printed out .

Anomaly detection

Anomaly detection is generally an unsupervised learning method . He didn't have a label , Using a variety of algorithms to distinguish normal values from outliers .

Of course , If there's a label , We can use the classification algorithm to solve the anomaly detection . Although it's also anomaly detection , But such an algorithm is still a classification algorithm . Only unsupervised anomaly detection , It is generally considered as an anomaly detection algorithm .

Anomaly detection algorithm , From the traditional statistical method , To machine learning , Then go to deep learning , have everything that one expects to find .

A lot of anomaly detection algorithms , They are all based on existing machine learning algorithms . such as KNN Anomaly detection algorithm of , Is to see a point from near to far K The distance of points , Then find the few points with the largest distance as the anomaly .

CBLOF Local anomaly factors based on clustering

What we're going to learn today CBLOF It is also an anomaly detection algorithm based on other machine learning algorithms . When it comes to the basis of , Namely CBLOF In the name B~Based. And he's based on other clustering algorithms , So he is Cluster-Based.LOF The three letters are Local Outlier Factor, Local anomaly factors . Close CBLOF Namely Cluster-based Local Outlier Factor, Local anomaly factors based on clustering .

One of his basic understandings is ： Data can be aggregated in many different places , Forming clusters . The closer a point is to a large cluster , The more likely he is to be normal , On the contrary, the lower . that , All we have to do is CBLOF Inside , Set up a clustering algorithm , You can get these clusters . commonly , We will use k-means Algorithm , The distance is directly Euclidean distance . therefore , We have the four clusters in the picture above . here , We can clearly see that ,C2 and C4 It's two big clusters , The rest C1 and C3 It's two small clusters . that , distance C2 perhaps C4 Center of （ namely K-Means The center inside , Also called Centroids） Closer , The more normal , On the contrary, the more abnormal .

data

here , Let's take a practical example , To introduce the algorithm in detail .

We use sales and profit data here . You can find... Here ：

We can see , Most of the points are concentrated in a small area in the middle left . Very few points are distributed in other areas .

clustering

We said to the , The first step of this algorithm is clustering , So we're going to cluster them .

Before clustering , Let's normalize the data first . This is a very necessary step . Especially in the case of different variable ranges .

（ And the limit of the length , Most of the code is not pasted , The code address is given at the end of the paper .）

then , We just use k-means clustering .

km = KMeans(n_clusters=8)
km.fit(X)


After clustering , We get the following figure . Each cluster is distinguished by color

Big clusters, small clusters

I'm going to ask you a question , Which cluster is the largest ？

I hope you're right , The smallest looking blue cluster is the largest . The smallest cluster is the purple one , Although he is very large , But he has only one point . therefore , Our size here , How many cluster members are used （size） To measure .

So the other clusters , Which is a big cluster , Which is a small cluster ？

Let's first define a ∣ C ∣ |C| , Represents a cluster C C Size （size）. We put these clusters in order of size , Then there are ：

∣ C 1 ∣ ≥ ∣ C 2 ∣ ≥ ∣ C 3 ∣ ≥ . . . ≥ ∣ C k ∣ |C_1|\ge|C_2|\ge|C_3|\ge...\ge|C_k|

The absolute majority

This is a , We choose big and small clusters , There are two principles , Let's start with the first one , It's the absolute majority principle . in other words , We start with the largest cluster , Take their... One by one size Add up , their size To achieve the total size The absolute majority of . And the absolute majority , It is a parameter that can be set α \alpha , His range is 0.5 To 1, Usually take 0.9.

Sudden descent

The second principle is sudden descent , A sudden drop is a multiple ( β \beta ) To measure . The default value is 5, It's a sudden drop 5 times .

pyod This is how it works ： It is best to find a partition that satisfies both conditions . Suboptimal is to find a partition that satisfies the principle of absolute majority . The worst thing is that we only find the partition that satisfies the principle of sudden drop .

Here is my simple implementation .

large_clusters=[]
small_clusters=[]
found_b= False
count=0
clusters = df_cluster_sizes['cluster'].values
n_clusters = len(clusters)
sizes = df_cluster_sizes['size'].values
for i in range(n_clusters):
print(f"-----------iterration {i}--------------")
satisfy_alpha=False
satisfy_beta=False
if found_b:
small_clusters.append(clusters[i])
continue
count+=sizes[i]
print(count)
if count>n_points_in_large_clusters:
satisfy_alpha=True
print(sizes[i]/sizes[i+1])
if i<n_clusters-1 and sizes[i]/sizes[i+1]>beta:
print("beta")
satisfy_beta=True
print(satisfy_alpha, satisfy_beta)
if satisfy_alpha and satisfy_beta:
found_b=True
large_clusters.append(clusters[i])


We show large clusters and small clusters in color , So it's more like normal in the big cluster , Small clusters are more like outliers .

Factor factor

Here we have to calculate CBLOF factor , That is, the distance from a point to the nearest large cluster .

If this point is a point in a large cluster , Then directly calculate the distance from him to the center of the cluster . If this point is not a large cluster , Then we have to calculate the distance to all large clusters separately , Take the smallest one as the factor .

def get_distance(a,b):
return np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)
def decision_function(X, labels):
n=len(labels)
distances=[]
for i in range(n):
p=X[i]
label = labels[i]
if label in large_clusters:
center = km.cluster_centers_[label]
d=get_distance(p, center)
else:
d=None
for center in large_cluster_centers:
d_temp = get_distance(p, center)
if d is None:
d=d_temp
elif d_temp<d:
d=d_temp
distances.append(d)
distances=np.array(distances)
return distances
distances = decision_function(X, km.labels_)


here , We need to introduce a pollution contamination The concept of . It is the main parameter of anomaly detection algorithm . In fact, he means the abnormal proportion . Generally, this value is 1%, We use the default value directly here .

We have it. contamination 1%, You can use it python The percentile function in it , Find the biggest distance 1% The point of , Take them as exceptions .

threshold=np.percentile(distances, 99)
print(f"threshold is {threshold}")
anomaly_labels = (distances>threshold)*1


here , We get the outliers . I print out my results , And on and on pyod Inside CBLOF Compare the printed pictures , The two are the same .

( The code above is copied from Susan Li The article )

Source code

https://github.com/EricWebsmith/machine_learning_from_scrach

If you like this series , You can collect my github project . All the algorithms , First in github Above realization , When I'm free , Will be in csdn There's an article on it . At present, there are a lot of algorithms implemented, there is no time to write articles in the future . Include XGB, CatBoost, Zhou's Isolation Forest etc. . among Isolation Forest Algorithm , As in this article , I spent a lot of work on the map .

Reference resources

He, Zengyou, Xiaofei Xu, and Shengchun Deng. “Discovering cluster-based local outliers.” Pattern Recognition Letters 24.9-10 (2003): 1641-1650.

https://github.com/yzhao062/pyod

towardsdatascience.com Website Susan Li The article Anomaly Detection for Dummies