author |Marco Santos
compile |Flin
source |towardsdatascience
Browsing hundreds and thousands of dating profiles endlessly , And none of them matches it , People may start to wonder how these files appear on mobile phones . None of these profiles are of the type they're looking for . They've been painting for hours or even days , I didn't find any success . They may ask :
“ Why do these dating apps show me people I know I'm not suitable for ?”
In the eyes of many people , The appointment algorithm used to display the appointment file may have failed , They're tired of sliding to the left when they should match . Every dating website and application may use its own secret dating algorithm to optimize the matching between users . But sometimes it feels like it's just showing random users to others , Without any explanation . How can we learn more about this problem , And fight against it ? You can use a method called machine learning .
We can use machine learning to speed up the pairing process between users in dating applications . Use machine learning , Profiles can potentially be clustered with other similar profiles . This will reduce the number of incompatible configuration files . From these clusters , Users can find other users who are more like them .
Cluster profile data
Using the data from the above article , We can succeed in getting convenient panda DataFrame Cluster appointment profile in .
Here it is DataFrame in , Each line has a configuration file , Last , Will be Hierarchical Agglomerative Clustering(https://www.datanovia.com/en/...) After applying to the dataset , We can see the cluster group they belong to . Each profile belongs to a specific cluster number or group .
however , These groups can make some improvements .
Sort the cluster configuration files
Using cluster file data , We can sort the results according to how similar each file is , So as to further refine the results . The process may be faster and easier than you think .
import random
# Randomly select a cluster
rand_cluster = random.choice(df['Cluster #'].unique())
# Assign the cluster configuration file as new DF
group = df[df['Cluster #']==rand_cluster].drop('Cluster #', axis=1)
## Vectorize... In the selected cluster BIOS
# take Vectorizer Fit to BIOS
cluster_x = vectorizer.fit_transform(group['Bios'])
# Create a new... Containing vectorized words DF
cluster_v = pd.DataFrame(cluster_x.toarray(), index=group.index, columns=vectorizer.get_feature_names())
# Connection vector DF And primitive DF
group = group.join(cluster_v)
# Delete BIOS, Because it's no longer needed to replace vectorization
group.drop('Bios', axis=1, inplace=True)
## Looking for connections between users
# location DF, So we can index ( user ) relation
corr_group = group.T.corr()
## Looking for the top 10 Users like this
# Randomly select a user
random_user = random.choice(corr_group.index)
print("Top 10 most similar users to User #", random_user, '\n')
# Create the front most similar to the selected user 10 Of users DF
top_10_sim = corr_group[[random_user]].sort_values(by=[random_user],axis=0, ascending=False)[1:11]
# Print the results
print(top_10_sim)
print("\nThe most similar user to User #", random_user, "is User #", top_10_sim.index[0])
Code decomposition
Let's break the code down into random Simple steps to start , Use in the whole code random To simply select clusters and users . This is done so that our code can be applied to any user in the dataset . Once we have randomly selected clusters , We can narrow down the entire dataset , Make it include only those rows with the selected cluster .
Vectorization
After narrowing down the scope of the selected cluster group , The next step involves working on bios Vectorization .
The vectorizer used for this operation is the same as the vectorizer used to create the initial cluster data frame -CountVectorizer()
.( Vectorizer variables are pre instantiated when we vectorize the first dataset , This can be seen in the article above ).
# To fit a quantizer to Bios
cluster_x = vectorizer.fit_transform(group['Bios'])
# Create a new DF, It contains vectorized words
cluster_v = pd.DataFrame(cluster_x.toarray(),
index=group.index,
columns=vectorizer.get_feature_names())
Through to Bios Do vectorization , We created a binary matrix , It contains each bio The words in .
then , We're going to vectorize this DataFrame Join the selected group / colony DataFrame in .
# Vector DF And primitive DF Connect
group = group.join(cluster_v)
# Delete Bios, Because it's no longer needed
group.drop('Bios', axis=1, inplace=True)
Put two DataFrame After being combined , The rest is vectorized bios And sort Columns :
From here we can start to find the most similar users .
Look for correlations between appointment files
Create... Filled with binary values and numbers DataFrame after , We can start looking for correlations between appointment profiles . Each appointment file has a unique index number , We can use it as a reference .
In limine , We share 6600 A date file . After clustering and narrowing the data frame to the selected cluster , The number of appointment profiles can be downloaded from 100 To 1000 Unequal . In the whole process , The index number of the appointment profile remains unchanged . Now? , We can use each index number to refer to each appointment profile .
Each index number represents a unique date profile , We can find similar or related users for each profile . This can be done by running a line of code to create a correlation matrix .
corr_group = group.T.corr()
The first thing we need to do is transpose DataFrame To switch columns and indexes . This is done to make the relevant methods we use apply to indexes rather than Columns . Once we switch DF, We can apply .corr()
Method , It creates a correlation matrix between indexes .
The correlation matrix contains the use of Pearson The values calculated by the correlation method . near 1 The values of are positively correlated with each other , That's why you'll see that the index associated with your own index is 1.0000 Why .
Find the top 10 Similar date information of
Now? , We have one that contains every index / The correlation matrix of the correlation score of the appointment file , We can start sorting files based on their similarity .
random_user = random.choice(corr_group.index)
print("Top 10 most similar users to User #", random_user, '\n')
top_10_sim = corr_group[[random_user]].sort_values(by=
[random_user],axis=0, ascending=False)[1:11]
print(top_10_sim)
print("\nThe most similar user to User #", random_user, "is User #", top_10_sim.index[0])
The first line in the block above selects a random appointment profile or user from the correlation matrix . From there, , We can select the column with the selected user , And sort the users in the column , So that it only goes back to the front 10 The most relevant users ( Does not include the selected index itself ).
success !—— When we run the code above , We'll get a list of users , Rank them according to their respective relevant scores . We can see that the most similar to randomly selected users is before 10 Users . This can be run again with another cluster group and another profile or user .
Conclusion
If this is a dating app , Users will be able to see before 10 The user whose name is most similar to himself . This will hopefully reduce the time spent swiping the screen , Reduce frustration , And increase the matching between our hypothetical dating application users . The algorithm of the hypothetical dating application will implement unsupervised machine learning clustering , To create a set of appointment profiles . In these groups , The algorithm will sort the profile according to the relevant score . Last , It will be able to show users the most similar appointment profile to themselves .
Link to the original text :https://towardsdatascience.co...
Welcome to join us AI Blog station :
http://panchuang.net/
sklearn Machine learning Chinese official documents :
http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station :
http://docs.panchuang.net/