A series of handwriting algorithms based on Python machine learning -- cblof anomaly detection

There are several evidences 2020-11-13 12:37:47
series handwriting algorithms based python


I'm currently doing some work related to anomaly detection , And participated in pyod Inside CBLOF Algorithm development . therefore , I'm concerned about CBLOF And pyod The implementation is very familiar .

For the handwritten algorithm series of articles . Interested can study the code . Time tight can also skip code not to see . however , Even if you don't want to see the code , I also suggest that you read my article . After all , I implemented the code , I understand these algorithms first , I just want to write articles . And in the process of implementation , As in this article , Some intermediate steps can be plotted and printed out .

Anomaly detection

Anomaly detection is generally an unsupervised learning method . He didn't have a label , Using a variety of algorithms to distinguish normal values from outliers .

Of course , If there's a label , We can use the classification algorithm to solve the anomaly detection . Although it's also anomaly detection , But such an algorithm is still a classification algorithm . Only unsupervised anomaly detection , It is generally considered as an anomaly detection algorithm .

Anomaly detection algorithm , From the traditional statistical method , To machine learning , Then go to deep learning , have everything that one expects to find .

A lot of anomaly detection algorithms , They are all based on existing machine learning algorithms . such as KNN Anomaly detection algorithm of , Is to see a point from near to far K The distance of points , Then find the few points with the largest distance as the anomaly .

CBLOF Local anomaly factors based on clustering

What we're going to learn today CBLOF It is also an anomaly detection algorithm based on other machine learning algorithms . When it comes to the basis of , Namely CBLOF In the name B~Based. And he's based on other clustering algorithms , So he is Cluster-Based.LOF The three letters are Local Outlier Factor, Local anomaly factors . Close CBLOF Namely Cluster-based Local Outlier Factor, Local anomaly factors based on clustering .

 Insert picture description here

One of his basic understandings is : Data can be aggregated in many different places , Forming clusters . The closer a point is to a large cluster , The more likely he is to be normal , On the contrary, the lower . that , All we have to do is CBLOF Inside , Set up a clustering algorithm , You can get these clusters . commonly , We will use k-means Algorithm , The distance is directly Euclidean distance . therefore , We have the four clusters in the picture above . here , We can clearly see that ,C2 and C4 It's two big clusters , The rest C1 and C3 It's two small clusters . that , distance C2 perhaps C4 Center of ( namely K-Means The center inside , Also called Centroids) Closer , The more normal , On the contrary, the more abnormal .

data

here , Let's take a practical example , To introduce the algorithm in detail .

We use sales and profit data here . You can find... Here :

https://pyod.readthedocs.io/en/latest/_modules/pyod/models/cblof.html

 Insert picture description here
We can see , Most of the points are concentrated in a small area in the middle left . Very few points are distributed in other areas .

clustering

We said to the , The first step of this algorithm is clustering , So we're going to cluster them .

Before clustering , Let's normalize the data first . This is a very necessary step . Especially in the case of different variable ranges .

( And the limit of the length , Most of the code is not pasted , The code address is given at the end of the paper .)

then , We just use k-means clustering .

km = KMeans(n_clusters=8)
km.fit(X)

After clustering , We get the following figure . Each cluster is distinguished by color
 Insert picture description here

Big clusters, small clusters

I'm going to ask you a question , Which cluster is the largest ?

I hope you're right , The smallest looking blue cluster is the largest . The smallest cluster is the purple one , Although he is very large , But he has only one point . therefore , Our size here , How many cluster members are used (size) To measure .

So the other clusters , Which is a big cluster , Which is a small cluster ?

Let's first define a ∣ C ∣ |C| C, Represents a cluster C C C Size (size). We put these clusters in order of size , Then there are :

∣ C 1 ∣ ≥ ∣ C 2 ∣ ≥ ∣ C 3 ∣ ≥ . . . ≥ ∣ C k ∣ |C_1|\ge|C_2|\ge|C_3|\ge...\ge|C_k| C1C2C3...Ck

The absolute majority

This is a , We choose big and small clusters , There are two principles , Let's start with the first one , It's the absolute majority principle . in other words , We start with the largest cluster , Take their... One by one size Add up , their size To achieve the total size The absolute majority of . And the absolute majority , It is a parameter that can be set α \alpha α, His range is 0.5 To 1, Usually take 0.9.

Sudden descent

The second principle is sudden descent , A sudden drop is a multiple ( β \beta β) To measure . The default value is 5, It's a sudden drop 5 times .

pyod This is how it works : It is best to find a partition that satisfies both conditions . Suboptimal is to find a partition that satisfies the principle of absolute majority . The worst thing is that we only find the partition that satisfies the principle of sudden drop .

Here is my simple implementation .

large_clusters=[]
small_clusters=[]
found_b= False
count=0
clusters = df_cluster_sizes['cluster'].values
n_clusters = len(clusters)
sizes = df_cluster_sizes['size'].values
for i in range(n_clusters):
print(f"-----------iterration {i}--------------")
satisfy_alpha=False
satisfy_beta=False
if found_b:
small_clusters.append(clusters[i])
continue
count+=sizes[i]
print(count)
if count>n_points_in_large_clusters:
satisfy_alpha=True
print(sizes[i]/sizes[i+1])
if i<n_clusters-1 and sizes[i]/sizes[i+1]>beta:
print("beta")
satisfy_beta=True
print(satisfy_alpha, satisfy_beta)
if satisfy_alpha and satisfy_beta:
found_b=True
large_clusters.append(clusters[i])

We show large clusters and small clusters in color , So it's more like normal in the big cluster , Small clusters are more like outliers .

 Insert picture description here

Factor factor

Here we have to calculate CBLOF factor , That is, the distance from a point to the nearest large cluster .

If this point is a point in a large cluster , Then directly calculate the distance from him to the center of the cluster . If this point is not a large cluster , Then we have to calculate the distance to all large clusters separately , Take the smallest one as the factor .

def get_distance(a,b):
return np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)
def decision_function(X, labels):
n=len(labels)
distances=[]
for i in range(n):
p=X[i]
label = labels[i]
if label in large_clusters:
center = km.cluster_centers_[label]
d=get_distance(p, center)
else:
d=None
for center in large_cluster_centers:
d_temp = get_distance(p, center)
if d is None:
d=d_temp
elif d_temp<d:
d=d_temp
distances.append(d)
distances=np.array(distances)
return distances
distances = decision_function(X, km.labels_)

here , We need to introduce a pollution contamination The concept of . It is the main parameter of anomaly detection algorithm . In fact, he means the abnormal proportion . Generally, this value is 1%, We use the default value directly here .

We have it. contamination 1%, You can use it python The percentile function in it , Find the biggest distance 1% The point of , Take them as exceptions .

threshold=np.percentile(distances, 99)
print(f"threshold is {threshold}")
anomaly_labels = (distances>threshold)*1

here , We get the outliers . I print out my results , And on and on pyod Inside CBLOF Compare the printed pictures , The two are the same .

 Insert picture description here
( The code above is copied from Susan Li The article )

Source code

https://github.com/EricWebsmith/machine_learning_from_scrach

If you like this series , You can collect my github project . All the algorithms , First in github Above realization , When I'm free , Will be in csdn There's an article on it . At present, there are a lot of algorithms implemented, there is no time to write articles in the future . Include XGB, CatBoost, Zhou's Isolation Forest etc. . among Isolation Forest Algorithm , As in this article , I spent a lot of work on the map .

Reference resources

He, Zengyou, Xiaofei Xu, and Shengchun Deng. “Discovering cluster-based local outliers.” Pattern Recognition Letters 24.9-10 (2003): 1641-1650.

https://github.com/yzhao062/pyod

towardsdatascience.com Website Susan Li The article Anomaly Detection for Dummies

版权声明
本文为[There are several evidences]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database