Sorting appointment data using machine learning and python

Pan Chuang AI 2020-11-15 06:23:05
sorting appointment data using machine

author |Marco Santos compile |Flin source |towardsdatascience

Browsing hundreds and thousands of dating profiles endlessly , And none of them matches it , People may start to wonder how these files appear on mobile phones . None of these profiles are of the type they're looking for . They've been painting for hours or even days , I didn't find any success . They may ask :

“ Why do these dating apps show me people I know I'm not suitable for ?”

In the eyes of many people , The appointment algorithm used to display the appointment file may have failed , They're tired of sliding to the left when they should match . Every dating website and application may use its own secret dating algorithm to optimize the matching between users . But sometimes it feels like it's just showing random users to others , Without any explanation . How can we learn more about this problem , And fight against it ? You can use a method called machine learning .

We can use machine learning to speed up the pairing process between users in dating applications . Use machine learning , Profiles can potentially be clustered with other similar profiles . This will reduce the number of incompatible configuration files . From these clusters , Users can find other users who are more like them .

Cluster profile data

Using the data from the above article , We can succeed in getting convenient panda DataFrame Cluster appointment profile in .

Here it is DataFrame in , Each line has a configuration file , Last , Will be Hierarchical Agglomerative Clustering( After applying to the dataset , We can see the cluster group they belong to . Each profile belongs to a specific cluster number or group .

however , These groups can make some improvements .

Sort the cluster configuration files

Using cluster file data , We can sort the results according to how similar each file is , So as to further refine the results . The process may be faster and easier than you think .

import random
# Randomly select a cluster
rand_cluster = random.choice(df['Cluster #'].unique())
# Assign the cluster configuration file as new DF
group = df[df['Cluster #']==rand_cluster].drop('Cluster #', axis=1)
## Vectorize... In the selected cluster BIOS
# take Vectorizer Fit to BIOS
cluster_x = vectorizer.fit_transform(group['Bios'])
# Create a new... Containing vectorized words DF
cluster_v = pd.DataFrame(cluster_x.toarray(), index=group.index, columns=vectorizer.get_feature_names())
# Connection vector DF And primitive DF
group = group.join(cluster_v)
# Delete BIOS, Because it's no longer needed to replace vectorization
group.drop('Bios', axis=1, inplace=True)
## Looking for connections between users
# location DF, So we can index ( user ) relation
corr_group = group.T.corr()
## Looking for the top 10 Users like this
# Randomly select a user
random_user = random.choice(corr_group.index)
print("Top 10 most similar users to User #", random_user, '\n')
# Create the front most similar to the selected user 10 Of users DF
top_10_sim = corr_group[[random_user]].sort_values(by=[random_user],axis=0, ascending=False)[1:11]
# Print the results
print("\nThe most similar user to User #", random_user, "is User #", top_10_sim.index[0])

Code decomposition

Let's break the code down into random Simple steps to start , Use in the whole code random To simply select clusters and users . This is done so that our code can be applied to any user in the dataset . Once we have randomly selected clusters , We can narrow down the entire dataset , Make it include only those rows with the selected cluster .


After narrowing down the scope of the selected cluster group , The next step involves working on bios Vectorization .

The vectorizer used for this operation is the same as the vectorizer used to create the initial cluster data frame -CountVectorizer().( Vectorizer variables are pre instantiated when we vectorize the first dataset , This can be seen in the article above ).

# To fit a quantizer to Bios
cluster_x = vectorizer.fit_transform(group['Bios'])
# Create a new DF, It contains vectorized words
cluster_v = pd.DataFrame(cluster_x.toarray(),

Through to Bios Do vectorization , We created a binary matrix , It contains each bio The words in .

then , We're going to vectorize this DataFrame Join the selected group / colony DataFrame in .

# Vector DF And primitive DF Connect
group = group.join(cluster_v)
# Delete Bios, Because it's no longer needed
group.drop('Bios', axis=1, inplace=True)

Put two DataFrame After being combined , The rest is vectorized bios And sort Columns :

From here we can start to find the most similar users .

Look for correlations between appointment files

Create... Filled with binary values and numbers DataFrame after , We can start looking for correlations between appointment profiles . Each appointment file has a unique index number , We can use it as a reference .

In limine , We share 6600 A date file . After clustering and narrowing the data frame to the selected cluster , The number of appointment profiles can be downloaded from 100 To 1000 Unequal . In the whole process , The index number of the appointment profile remains unchanged . Now? , We can use each index number to refer to each appointment profile .

Each index number represents a unique date profile , We can find similar or related users for each profile . This can be done by running a line of code to create a correlation matrix .

corr_group = group.T.corr()

The first thing we need to do is transpose DataFrame To switch columns and indexes . This is done to make the relevant methods we use apply to indexes rather than Columns . Once we switch DF, We can apply .corr() Method , It creates a correlation matrix between indexes .

The correlation matrix contains the use of Pearson The values calculated by the correlation method . near 1 The values of are positively correlated with each other , That's why you'll see that the index associated with your own index is 1.0000 Why .

Find the top 10 Similar date information of

Now? , We have one that contains every index / The correlation matrix of the correlation score of the appointment file , We can start sorting files based on their similarity .

random_user = random.choice(corr_group.index)
print("Top 10 most similar users to User #", random_user, '\n')
top_10_sim = corr_group[[random_user]].sort_values(by=
[random_user],axis=0, ascending=False)[1:11]
print("\nThe most similar user to User #", random_user, "is User #", top_10_sim.index[0])

The first line in the block above selects a random appointment profile or user from the correlation matrix . From there, , We can select the column with the selected user , And sort the users in the column , So that it only goes back to the front 10 The most relevant users ( Does not include the selected index itself ).

success !—— When we run the code above , We'll get a list of users , Rank them according to their respective relevant scores . We can see that the most similar to randomly selected users is before 10 Users . This can be run again with another cluster group and another profile or user .


If this is a dating app , Users will be able to see before 10 The user whose name is most similar to himself . This will hopefully reduce the time spent swiping the screen , Reduce frustration , And increase the matching between our hypothetical dating application users . The algorithm of the hypothetical dating application will implement unsupervised machine learning clustering , To create a set of appointment profiles . In these groups , The algorithm will sort the profile according to the relevant score . Last , It will be able to show users the most similar appointment profile to themselves .

Link to the original text :

Welcome to join us AI Blog station :

sklearn Machine learning Chinese official documents :

Welcome to pay attention to pan Chuang blog resource summary station :

本文为[Pan Chuang AI]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database