Case: a beginner's Guide to hierarchical clustering and python implementation (with links)

osc_ 8mr9obdl 2020-11-15 14:35:03
case beginner guide hierarchical clustering

author :Pulkit Sharma   translate : Chen Chao   proofreading : Wu Zhendong

this paper about 4700 word , Recommended reading 15 minute

This paper starts from the comparison between unsupervised learning and supervised learning , Combined with specific cases to introduce the concept of hierarchical clustering 、 Application scenarios 、 Main types and Python Realization .


Understanding customer behavior is essential in any industry , I didn't realize it until last year . At that time, my CMO(chief marketing officer, Chief Marketing Officer ) Ask me :“ Can you tell me , What group of users should our new product target ?”

It's a learning process for me . I soon realized , As a data scientist , How important is it to segment customers so that companies can customize their customers and establish target strategies . This is where the clustering concept comes in handy !

User classification is often tricky , Because we don't have any target variables in mind . We are now officially entering the field of unsupervised learning , Discover patterns and structures without any set results . It's challenging for data scientists, but it's exciting .

There are several different clustering methods here ( You'll see in the following section ). I'll introduce you to one of them —— Hierarchical clustering .

We're going to learn what hierarchical clustering is , It is superior to other clustering algorithms , Different levels of clustering methods and development steps . In the end, we will adopt a customer classification database and implement Python Hierarchical clustering . I love this method and I'm sure you'll like it after reading this article !

notes : As mentioned above , There are many clustering methods . I encourage you to look at our guidelines for different types of clustering :

  • An Introduction to Clustering and different methods of clustering

Want to learn more about clustering and other machine learning algorithms ( Supervised and unsupervised ) Take a look at the following project -


1.  supervise vs Unsupervised learning

2. Why use hierarchical clustering ?

3. What is hierarchical clustering ?

4.  The types of hierarchical clustering

(1)  Polymerization (Agglomerative) Hierarchical clustering

(2)  Split (Divisive) Hierarchical clustering

5.  The steps of hierarchical clustering

6. How to select the number of classes in hierarchical clustering ?

7.  Using hierarchical clustering to solve a wholesale customer classification problem

supervise vs Unsupervised learning

Before we go into hierarchical clustering , It is important to understand the difference between supervised learning and unsupervised learning . Let me explain the difference with a simple example .

Suppose I want to estimate the number of bikes that will be rented each day :

perhaps , We want to predict whether a person will survive on the Titanic :

In both cases, there is a fixed goal to achieve :

  • In the first example , Based on seasons 、 During the holiday 、 Working day 、 The weather 、 Temperature and other characteristics to predict the number of bicycle rental .

  • In the second example, predict whether the passenger will survive . stay “ To survive ” variable ,0 It means that the person is not alive ,1 It means that this person survived . The independent variables here include cabin class 、 Gender 、 Age 、 Ticket prices and so on .

So , When we have a target variable ( In both cases, the number and survival ), Based on a series of predictors or independent variables ( season , During the holiday , Gender , Age etc. ) To predict the , This problem is called supervised learning .

Let's take a look at the figure below to better understand it :

ad locum ,y It's a dependent variable or a target variable ,X For the independent variable . The target variable depends on X, So it's also called a dependent variable . We use independent variables to train the model under the supervision of the target variable , So it's called supervised learning .

Our goal in training models is to generate a function , Ability to map arguments to desired goals . Once the model training is complete , We can put new observations in , The model can predict the target itself . To make a long story short , This process is called supervised learning .

Sometimes we don't have any target variables to predict . There is no explicit target variable for this problem , It's called unsupervised learning . We have only independent variables .

We try to divide the whole data into a series of groups . These groups are called clusters , This process is called clustering .

This technique is often used to cluster populations into different groups . Common examples include customer clustering 、 Clustering similar files 、 Recommend similar songs or movies, etc .

Now there are many algorithms that can help us to cluster . The most commonly used clustering algorithm is K-means And hierarchical clustering .

Why hierarchical clustering ?

Before that , We need to know first K-means How it works . believe me , This will make the concept of hierarchical clustering easier .

There's one right here K-means An overview of how algorithms work :

1.  Determine the number of clusters (k)

2.  choice k Random points as the center

3.  Put all the points in the nearest Center

4.  Calculate the center of the new cluster

5.  Repeat step 3 and 4

This is an iterative process . It will continue to run , Until the center point of the newly formed cluster does not change , Or the maximum number of iterations .

however K-means There have also been some doubts . It usually tries to generate clusters of the same specification . also , We need to determine the number of clusters before the algorithm starts . Ideally , We don't know how many clusters we need at the beginning of the algorithm , So this is also K-means A kind of question faced by .

This is exactly the advantage of hierarchical clustering . It solves the problem of setting the number of clusters in advance . It sounds like a dream ! therefore , Let's look at what hierarchical clustering is and how it improves K-means Of .

What is hierarchical clustering ?

We have the following points , We want to cluster them :

We can take each point as a separate cluster :

Now? , Based on the similarity of these clusters , We can put the most similar clusters together , And repeat this process until you have a single cluster left :

We need to build a hierarchy of clusters , That's why this algorithm is called hierarchical clustering . We'll discuss how to determine the number of clusters in the next section . Now? , Let's look at different types of hierarchical clustering .

The types of hierarchical clustering

There are two main hierarchical clustering :

1.  Aggregate hierarchical clustering

2.  Split hierarchical clustering

Let's take a look at each one in detail :


Aggregate hierarchical clustering

Put each point in a separate cluster . Suppose there are four data points here . We divide each point into a cluster , At the beginning, there will be four clusters :

then , In each iteration , We fuse the most similar pairs of points , Then repeat the above steps until there is only a single cluster :

Every step of the way we're merging ( Or add ) cluster , Right ? therefore , This kind of clustering is also called cumulative hierarchical clustering .

Split hierarchical clustering

Split hierarchical clustering is the opposite idea . And divide it from the beginning n A cluster of (n An observation ) Different , We started with just one cluster , And put all the points in this cluster .

therefore , We have 10 A or 1000 It doesn't matter . All the points are in the same cluster at the beginning :

Now? , In each iteration , We separate the farthest points in the cluster , And repeat the process until each cluster has only one point :

We're splitting at every step of the way ( Or divide ) cluster , So it's called split hierarchical clustering .

Aggregation clustering is widely used in industry , In this paper, we will also focus on . Once we've got the aggregation , Split hierarchical clustering will also become very simple .

The steps of hierarchical clustering


We fuse the most similar points or classes in hierarchical clustering —— We already know that . Now the question is —— How to decide which points are similar and which are not ? This is one of the most important problems in clustering !

Here's a way to calculate similarity —— Calculate the distance between cluster centers . The nearest point is considered a similar point , We can fuse them . We can call this a distance based algorithm ( Because we calculated the distance between clusters ).

In hierarchical clustering, there is a proximity matrix (proximity matrix) The concept of . This matrix stores the distance between each pair of points . Let's use an example to understand the matrix and hierarchical clustering method .


Suppose a teacher wants to divide her students into different groups . She has the scores each student gets on an assignment , Based on these scores , She wants to divide the students into different groups . There is no fixed goal for grouping . Because the teacher doesn't know which kind of students should be assigned to which group , It can't be described as a supervised learning problem . therefore , We will use hierarchical clustering to divide students into different groups .

Our examples are 5 A student :

Create an adjacency matrix

First , We create an adjacency matrix , This matrix will tell us the distance between these points . We calculated the distance between every two points , You'll get one n*n The square matrix of (n It's the number of observations ).

Let's take a look at the adjacency matrix between these five points :

The diagonal of this matrix is always 0, Because the distance between each point and itself is always 0. We will use the Euclidean distance formula to calculate the remaining distance . therefore , Let's look at the points we want to calculate 1 and 2 Distance between :

√(10-7)^2 = √9 = 3

Similarly , We can calculate the distance between all the points , And fill in the adjacency matrix .



First step : First , We put all the points in a cluster :

Different colors represent different clusters . You can see in the data 5 Points make up five different clusters .

The second step : Next , Find the shortest distance point in adjacency matrix , And put these dots together . Then update the adjacency matrix .

ad locum , The minimum distance is 3, So put some 1 and 2 To merge :

Let's look at the updated clusters and update the adjacency matrix accordingly :

ad locum , We took two points (7,10) To replace the tag of the cluster . We can also use the minimum or average instead of . Now? , We will calculate the adjacency matrix of these clusters again :

The third step : Repeat step 2 Until there is only 1 A cluster of .

Let's look at the smallest value in the adjacency matrix , Then merge the closest pair in the cluster . After repeating the above steps, we will get the following fused clusters :

We started to have 5 A cluster of , In the end, there's only a single cluster . This is how aggregate hierarchical clustering works . But there are still thorny problems —— How to determine the number of clusters ? Let's look at the next part .

In hierarchical clustering , How should we choose the number of clusters ?

Are you ready to answer the question that has been asked since the beginning ? In order to obtain the number of hierarchical clusters , We used a wonderful concept called a tree .

The tree is a tree chart , Be able to record the order of fusion or division .

Let's go back to the teacher - Examples of students . Whenever you merge two classes , A tree will record the distance between these classes and represent it as a graph . Let's take a look at what the tree looks like :

We put the sample in x Axis , Distance as y Axis . Whenever two clusters merge , We're all going to join the tree , The height between the connecting points is the distance between these points . Let's create a tree of examples :

It takes some time to process the above picture . We started to fuse the samples 1 and 2, The distance between these two points is 3( It refers to the first adjacency matrix that appears in the previous section ). Let's put it on the tree :

ad locum , We can look at the fusion sample 1 and 2. The vertical line represents the distance between two points . alike , Let's draw all the steps of merging clusters onto the graph , Finally, we can get the following tree :

We can clearly visualize the steps of hierarchical clustering . The longer the vertical line is , The further apart the clusters are .

Now? , We can set a distance threshold , And draw a horizontal line ( General , We're going to set thresholds in this way , It cuts off the most common vertical line ). Let's set the threshold to 12, Then draw a horizontal line .

The number of classes is the number of vertical lines that intersect the threshold first . In the above example , Because the red line crosses two vertical lines , We will have 2 Classes . A class includes samples (1,2,4), Another class includes samples (3,5). Very clear, right ?

This is how we use the tree to determine the number of classes in hierarchical clustering . In the next section , We will apply hierarchical clustering to help you understand the concepts learned in this article .

Using hierarchical clustering to solve the wholesale customer classification problem

It's time to start Python 了 !

We're going to start solving a wholesale customer classification problem . You can download datasets here (

This data is hosted in UCI Machine learning knowledge base . The goal of this question is to a wholesaler's customers based on their different product types ( Like milk 、 The groceries 、 Area, etc ) The annual expenses are classified .

Let's explore the data first , Then we use hierarchical clustering to classify customers .

First import the required Library :

view hosted with  by GitHub

Load the dataset and look at the first few lines :

view raw

There are many kinds of products here —— fresh 、 milk 、 Groceries and so on . The number represents the quantity purchased by each customer . Our goal is to divide classes from this data , Similar customers can be grouped into the same category . We will use hierarchical clustering to solve this problem .

But before the practical application of hierarchical clustering solution , We need to normalize the data set so that all variables have the same scale . Why is this step important ? Because if the variable scales are different , The model tends to favor variables with a higher order of magnitude, such as fresh or milk ( The table above ).

therefore , First normalize the data , Put all the variables on the same scale .

ad locum , You can see that the scales of all variables are almost the same . Now? , We can start hierarchical clustering . First draw a tree to help us determine the number of clusters in this problem :

X The axis is the sample ,y The axis represents the distance between samples . The vertical line with the greatest distance is the blue line , So we can decide that the threshold is 6, And cut off the tree :

This line has two intersections , So we have two clusters . Let's use hierarchical clustering :

In our definition of 2 After a cluster , We can see in the output 0  and 1 Value .0 Represents the value that belongs to the first cluster , and 1 Represents the value belonging to the second cluster . Now visualize the two clusters :

fantastic ! We can now see two clusters clearly . This is what we use Python To achieve hierarchical clustering process .

Write it at the end

Hierarchical clustering is a very useful way to divide observations . The advantage is that there is no need to predefine the number of clusters , This makes it more than k-Means More advantage .

If you're new to data science , It is highly recommended that you take a practical machine learning course ( This is one of the most comprehensive end-to-end machine learning courses you can find anywhere . Hierarchical clustering is just one of the many topics covered in the course .

Original title :

A Beginner’s Guide to Hierarchical Clustering and how to Perform it in Python

Link to the original text :

Introduction to translator : Chen Chao , Master of applied psychology, Peking University . I majored in computer science , After that, he made unremitting efforts on the road of psychology . It is increasingly found that data analysis and programming have become two required survival skills , So make every effort to better contact and understand the relevant knowledge in daily life , But the road is long , I'm still on my way .

from :  Data pie THU official account ;


Copyright notice : Part of the content of this issue comes from the Internet , Please indicate the original link and author , If there is any infringement or source error, please contact us .

Cooperation please add QQ:365242293  

Data analysis (ID : ecshujufenxi ) Internet technology and wechat of data circle itself , It's also WeMedia A member of the We Media Alliance ,WeMedia Alliance coverage 5000 10000 people .

本文为[osc_ 8mr9obdl]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database