author |Marco Santos compile |Flin source |towardsdatascience
Browsing hundreds and thousands of dating profiles endlessly , And none of them matches it , People may start to wonder how these files appear on mobile phones . None of these profiles are of the type they're looking for . They've been painting for hours or even days , I didn't find any success . They may ask ：
“ Why do these dating apps show me people I know I'm not suitable for ？”
In the eyes of many people , The appointment algorithm used to display the appointment file may have failed , They're tired of sliding to the left when they should match . Every dating website and application may use its own secret dating algorithm to optimize the matching between users . But sometimes it feels like it's just showing random users to others , Without any explanation . How can we learn more about this problem , And fight against it ？ You can use a method called machine learning .
We can use machine learning to speed up the pairing process between users in dating applications . Use machine learning , Profiles can potentially be clustered with other similar profiles . This will reduce the number of incompatible configuration files . From these clusters , Users can find other users who are more like them .
Using the data from the above article , We can succeed in getting convenient panda DataFrame Cluster appointment profile in .
Here it is DataFrame in , Each line has a configuration file , Last , Will be Hierarchical Agglomerative Clustering（https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/） After applying to the dataset , We can see the cluster group they belong to . Each profile belongs to a specific cluster number or group .
however , These groups can make some improvements .
Using cluster file data , We can sort the results according to how similar each file is , So as to further refine the results . The process may be faster and easier than you think .
import random # Randomly select a cluster rand_cluster = random.choice(df['Cluster #'].unique()) # Assign the cluster configuration file as new DF group = df[df['Cluster #']==rand_cluster].drop('Cluster #', axis=1) ## Vectorize... In the selected cluster BIOS # take Vectorizer Fit to BIOS cluster_x = vectorizer.fit_transform(group['Bios']) # Create a new... Containing vectorized words DF cluster_v = pd.DataFrame(cluster_x.toarray(), index=group.index, columns=vectorizer.get_feature_names()) # Connection vector DF And primitive DF group = group.join(cluster_v) # Delete BIOS, Because it's no longer needed to replace vectorization group.drop('Bios', axis=1, inplace=True) ## Looking for connections between users # location DF, So we can index （ user ） relation corr_group = group.T.corr() ## Looking for the top 10 Users like this # Randomly select a user random_user = random.choice(corr_group.index) print("Top 10 most similar users to User #", random_user, '\n') # Create the front most similar to the selected user 10 Of users DF top_10_sim = corr_group[[random_user]].sort_values(by=[random_user],axis=0, ascending=False)[1:11] # Print the results print(top_10_sim) print("\nThe most similar user to User #", random_user, "is User #", top_10_sim.index)
Let's break the code down into random Simple steps to start , Use in the whole code random To simply select clusters and users . This is done so that our code can be applied to any user in the dataset . Once we have randomly selected clusters , We can narrow down the entire dataset , Make it include only those rows with the selected cluster .
After narrowing down the scope of the selected cluster group , The next step involves working on bios Vectorization .
The vectorizer used for this operation is the same as the vectorizer used to create the initial cluster data frame -
CountVectorizer().（ Vectorizer variables are pre instantiated when we vectorize the first dataset , This can be seen in the article above ）.
# To fit a quantizer to Bios cluster_x = vectorizer.fit_transform(group['Bios']) # Create a new DF, It contains vectorized words cluster_v = pd.DataFrame(cluster_x.toarray(), index=group.index, columns=vectorizer.get_feature_names())
Through to Bios Do vectorization , We created a binary matrix , It contains each bio The words in .
then , We're going to vectorize this DataFrame Join the selected group / colony DataFrame in .
# Vector DF And primitive DF Connect group = group.join(cluster_v) # Delete Bios, Because it's no longer needed group.drop('Bios', axis=1, inplace=True)
Put two DataFrame After being combined , The rest is vectorized bios And sort Columns ：
From here we can start to find the most similar users .
Create... Filled with binary values and numbers DataFrame after , We can start looking for correlations between appointment profiles . Each appointment file has a unique index number , We can use it as a reference .
In limine , We share 6600 A date file . After clustering and narrowing the data frame to the selected cluster , The number of appointment profiles can be downloaded from 100 To 1000 Unequal . In the whole process , The index number of the appointment profile remains unchanged . Now? , We can use each index number to refer to each appointment profile .
Each index number represents a unique date profile , We can find similar or related users for each profile . This can be done by running a line of code to create a correlation matrix .
corr_group = group.T.corr()
The first thing we need to do is transpose DataFrame To switch columns and indexes . This is done to make the relevant methods we use apply to indexes rather than Columns . Once we switch DF, We can apply
.corr() Method , It creates a correlation matrix between indexes .
The correlation matrix contains the use of Pearson The values calculated by the correlation method . near 1 The values of are positively correlated with each other , That's why you'll see that the index associated with your own index is 1.0000 Why .
Now? , We have one that contains every index / The correlation matrix of the correlation score of the appointment file , We can start sorting files based on their similarity .
random_user = random.choice(corr_group.index) print("Top 10 most similar users to User #", random_user, '\n') top_10_sim = corr_group[[random_user]].sort_values(by= [random_user],axis=0, ascending=False)[1:11] print(top_10_sim) print("\nThe most similar user to User #", random_user, "is User #", top_10_sim.index)
The first line in the block above selects a random appointment profile or user from the correlation matrix . From there, , We can select the column with the selected user , And sort the users in the column , So that it only goes back to the front 10 The most relevant users （ Does not include the selected index itself ）.
success ！—— When we run the code above , We'll get a list of users , Rank them according to their respective relevant scores . We can see that the most similar to randomly selected users is before 10 Users . This can be run again with another cluster group and another profile or user .
If this is a dating app , Users will be able to see before 10 The user whose name is most similar to himself . This will hopefully reduce the time spent swiping the screen , Reduce frustration , And increase the matching between our hypothetical dating application users . The algorithm of the hypothetical dating application will implement unsupervised machine learning clustering , To create a set of appointment profiles . In these groups , The algorithm will sort the profile according to the relevant score . Last , It will be able to show users the most similar appointment profile to themselves .
Link to the original text ：https://towardsdatascience.com/sorting-dating-profiles-with-machine-learning-and-python-51db7a074a25
Welcome to join us AI Blog station ： http://panchuang.net/
sklearn Machine learning Chinese official documents ： http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station ： http://docs.panchuang.net/