## Exercise question Python collaborative filtering ALS model implementation: product recommendation + user population enlargement

Understanding oneself 2020-11-13 10:09:13
exercise question python collaborative filtering

The previous exercise ： Exercises ︱ Recommendation and search of Douban books 、 Simple knowledge engine construction （neo4j） Several simple ways to recommend are mentioned .
But on very large-scale sparse data , Generally, some large-scale models will be adopted , for example spark-ALS It's one of them .
here , I also want to investigate the operability of this model , All with the stand-alone version of the test ; Corresponding spark.mlib There are distributed versions .

The exercise code is visible ：mattzheng/pyALS

# 1 ALS Algorithm - Alternating Least Square - Alternating least squares

## 1.1 The theory is introduced

It's a kind of collaborative filtering , And integrated into Spark Of Mllib In the library .
For one users-products-rating The score data set of ,ALS Will create a userproduct Of mn Where ,m by users The number of ,n by products But in this dataset , Not every user has scored every product , So the matrix is often sparse ,
user i For products j The score is often empty ALS What we do is to fill the sparse matrix with some rules , So you can get any one of these from the matrix user To any one product The score ,ALS The filled rating items are also called users i For products j That's why ,ALS The core of the algorithm is what kind of rules to fill in .

Matrix factorization （ Such as singular value decomposition , Singular value decomposition + +） Convert both the item and the user into the same potential space , It represents the potential interaction between users and items . The principle behind matrix decomposition is that latent features represent how users rate items . A potential description of a given user and item , We can predict how much users will rate items that have not yet been evaluated .

• Support training
• You don't have to input a super large matrix matrix

Inferiority ：

• New content input is not supported , user ID It can only be the existing
• Incremental training does not support

Input training ：

``````# [[1, 1, 4.0], [1, 3, 4.0], [1, 6, 4.0], [1, 47, 5.0], [1, 50, 5.0]]
# ( user ID, shopping ID, score )
``````

All application users ID/ shopping ID that will do .

About incremental training ：
In the article The implementation of online book recommendation system includes source code （ Collaborative filtering ） Medium is , We borrow Spark Of ALS Training and prediction functions of the algorithm , Every time new data is received , Update it to the training dataset , And then update ALS The training model .
It feels like a total retraining ？

## 1.2 58 The recommended scene in the same city

relatively speaking , In some recommendation scenarios, this method still has some effect 【 Reference resources ：Embedding The application of technology in real estate recommendation 】：

In these recommendation scenarios, two kinds of similarity calculation are indispensable ：

• One is the correlation between the user and the listing
• The other is the correlation between the two houses .

How to calculate these two kinds of correlations ？ First of all, we need to represent the source with these two vectors , And then by calculating the difference between the vectors , Measure users and homes 、 Whether the house supply is similar to the house supply .

Both the user matrix and the rating matrix have “ Luxury index ” and “ Just need index ” These two dimensions . Of course, the expression of these two dimensions is after the completion of matrix decomposition , Artificially summed up . Actually , The user matrix and the item matrix can be understood as for the user and the source Embedding. You can see from the user matrix that ,User1 They have a high preference for luxury houses , So he's talking to Yaohua Road 550 I'm not very interested in . meanwhile , You can see from the item matrix that , The similarity between Tangchen Yipin and Shanghai Kangcheng should be greater than that between Tomson Yipin and Yaohua Road 550 Make the similarity .

# 2 pyALS

This way, thank you Collaborative filtering (ALS) The principle and Python Realization Handwritten a version , It's easy to do small-scale tests als.py
On this basis, the author has done some testing work .
Training steps ：

• Data preprocessing
• Variable k Legality check
• Generate random matrix U
• Alternate matrix U And matrices I, And print RMSE Information , Until the number of iterations reaches max_iter
• Save the final RMSE

## 2.1 recommendation

The data used is 【 user ID, The movie ID, Rating data 】

First, training ：

``````# Load data
path = 'data/movie_ratings.csv'
X = load_movie_ratings(path) # 100836
# [[1, 1, 4.0], [1, 3, 4.0], [1, 6, 4.0], [1, 47, 5.0], [1, 50, 5.0]]
# ( user ID, shopping ID, score )
# Training models
from ALS.pyALS import ALS
model = ALS()
model.fit(X, k=20, max_iter=2)
>>> Iterations: 1, RMSE: 3.207636
>>> Iterations: 2, RMSE: 0.353680
``````

among X The format of is an ordered list ,`K` Represents the dimension of representation ,`max_iter` Represents the number of iterations .
k / max_iter The larger the iteration time is .

Then there's prediction ：

``````# recommendation
print("Showing the predictions of users...")
# Predictions
user_ids = range(1, 5)
predictions = model.predict(user_ids, n_items=2)
for user_id, prediction in zip(user_ids, predictions):
_prediction = [format_prediction(item_id, score)
for item_id, score in prediction]
print("User id:%d recommedation: %s" % (user_id, _prediction))
``````

## 2.2 Crowd enlargement

This module is actually experimented with the help of user Of embedding As a user vector to solve people with high similarity .
The general procedure is ：

• We will train the users first user_embedding And merchandise item_embedding both .txt preservation
• Ask the crowd to be similar

I'm lazy here , Directly with the help of gensim To solve the similarity problem .

``````# The user matrix + Commodity matrix , image word2vec Keep it the same way
user_matrix = np.array(model.user_matrix.data)
item_matrix = np.array(model.item_matrix.data)
print(user_matrix.shape,item_matrix.shape) # ((20, 610), (20, 9724))
user_embedding = {model.user_ids[n]:user_matrix.T[n] for n in range(len(model.user_ids))}
item_embedding = {model.item_ids[n]:item_matrix.T[n] for n in range(len(model.user_ids))}
wordvec_save2txt(user_embedding,save_path = 'w2v/user_embedding_10w_50k_10i.txt',encoding = 'utf-8-sig')
wordvec_save2txt(item_embedding,save_path = 'w2v/item_embedding_10w_50k_10i.txt',encoding = 'utf-8-sig')
``````

Then according to this user vector, similar users are solved

``````embedding = gensim.models.KeyedVectors.load_word2vec_format('w2v/user_embedding_10w_50k_10i.txt',binary=False)
embedding.init_sims(replace=True) # magical , It saves memory , Can operation most_similar
# Vector similarity
item_a = 1
simi = embedding.most_similar(str(item_a), topn=50)
#[('79', 0.9031778573989868),
# ('27', 0.882379412651062)]
``````

Of course, single person solution is not feasible in practical use , Because the seed crowd / The size of the total population is relatively large , So clustering is needed in the beginning .
About Look - Like There's still a lot to pay attention to .