The previous exercise : Exercises ︱ Recommendation and search of Douban books 、 Simple knowledge engine construction (neo4j) Several simple ways to recommend are mentioned .
But on very large-scale sparse data , Generally, some large-scale models will be adopted , for example spark-ALS It's one of them .
here , I also want to investigate the operability of this model , All with the stand-alone version of the test ; Corresponding spark.mlib There are distributed versions .
The exercise code is visible :mattzheng/pyALS
Reference resources :
The implementation of online book recommendation system includes source code ( Collaborative filtering )
How to explain spark mllib in ALS Principle of algorithm ?
It's a kind of collaborative filtering , And integrated into Spark Of Mllib In the library .
For one users-products-rating The score data set of ,ALS Will create a userproduct Of mn Where ,m by users The number of ,n by products But in this dataset , Not every user has scored every product , So the matrix is often sparse ,
user i For products j The score is often empty ALS What we do is to fill the sparse matrix with some rules , So you can get any one of these from the matrix user To any one product The score ,ALS The filled rating items are also called users i For products j That's why ,ALS The core of the algorithm is what kind of rules to fill in .
Matrix factorization ( Such as singular value decomposition , Singular value decomposition + +) Convert both the item and the user into the same potential space , It represents the potential interaction between users and items . The principle behind matrix decomposition is that latent features represent how users rate items . A potential description of a given user and item , We can predict how much users will rate items that have not yet been evaluated .
advantage :
Inferiority :
Input training :
# [[1, 1, 4.0], [1, 3, 4.0], [1, 6, 4.0], [1, 47, 5.0], [1, 50, 5.0]]
# ( user ID, shopping ID, score )
All application users ID/ shopping ID that will do .
About incremental training :
In the article The implementation of online book recommendation system includes source code ( Collaborative filtering ) Medium is , We borrow Spark Of ALS Training and prediction functions of the algorithm , Every time new data is received , Update it to the training dataset , And then update ALS The training model .
It feels like a total retraining ?
relatively speaking , In some recommendation scenarios, this method still has some effect 【 Reference resources :Embedding The application of technology in real estate recommendation 】:
In these recommendation scenarios, two kinds of similarity calculation are indispensable :
How to calculate these two kinds of correlations ? First of all, we need to represent the source with these two vectors , And then by calculating the difference between the vectors , Measure users and homes 、 Whether the house supply is similar to the house supply .
Both the user matrix and the rating matrix have “ Luxury index ” and “ Just need index ” These two dimensions . Of course, the expression of these two dimensions is after the completion of matrix decomposition , Artificially summed up . Actually , The user matrix and the item matrix can be understood as for the user and the source Embedding. You can see from the user matrix that ,User1 They have a high preference for luxury houses , So he's talking to Yaohua Road 550 I'm not very interested in . meanwhile , You can see from the item matrix that , The similarity between Tangchen Yipin and Shanghai Kangcheng should be greater than that between Tomson Yipin and Yaohua Road 550 Make the similarity .
This way, thank you Collaborative filtering (ALS) The principle and Python Realization Handwritten a version , It's easy to do small-scale tests als.py
On this basis, the author has done some testing work .
Training steps :
The data used is 【 user ID, The movie ID, Rating data 】
First, training :
# Load data
path = 'data/movie_ratings.csv'
X = load_movie_ratings(path) # 100836
# [[1, 1, 4.0], [1, 3, 4.0], [1, 6, 4.0], [1, 47, 5.0], [1, 50, 5.0]]
# ( user ID, shopping ID, score )
# Training models
from ALS.pyALS import ALS
model = ALS()
model.fit(X, k=20, max_iter=2)
>>> Iterations: 1, RMSE: 3.207636
>>> Iterations: 2, RMSE: 0.353680
among X The format of is an ordered list ,K
Represents the dimension of representation ,max_iter
Represents the number of iterations .
k / max_iter The larger the iteration time is .
Then there's prediction :
# recommendation
print("Showing the predictions of users...")
# Predictions
user_ids = range(1, 5)
predictions = model.predict(user_ids, n_items=2)
for user_id, prediction in zip(user_ids, predictions):
_prediction = [format_prediction(item_id, score)
for item_id, score in prediction]
print("User id:%d recommedation: %s" % (user_id, _prediction))
This module is actually experimented with the help of user Of embedding As a user vector to solve people with high similarity .
The general procedure is :
I'm lazy here , Directly with the help of gensim To solve the similarity problem .
# The user matrix + Commodity matrix , image word2vec Keep it the same way
user_matrix = np.array(model.user_matrix.data)
item_matrix = np.array(model.item_matrix.data)
print(user_matrix.shape,item_matrix.shape) # ((20, 610), (20, 9724))
user_embedding = {model.user_ids[n]:user_matrix.T[n] for n in range(len(model.user_ids))}
item_embedding = {model.item_ids[n]:item_matrix.T[n] for n in range(len(model.user_ids))}
wordvec_save2txt(user_embedding,save_path = 'w2v/user_embedding_10w_50k_10i.txt',encoding = 'utf-8-sig')
wordvec_save2txt(item_embedding,save_path = 'w2v/item_embedding_10w_50k_10i.txt',encoding = 'utf-8-sig')
Then according to this user vector, similar users are solved
embedding = gensim.models.KeyedVectors.load_word2vec_format('w2v/user_embedding_10w_50k_10i.txt',binary=False)
embedding.init_sims(replace=True) # magical , It saves memory , Can operation most_similar
# Vector similarity
item_a = 1
simi = embedding.most_similar(str(item_a), topn=50)
#[('79', 0.9031778573989868),
# ('27', 0.882379412651062)]
Of course, single person solution is not feasible in practical use , Because the seed crowd / The size of the total population is relatively large , So clustering is needed in the beginning .
About Look - Like There's still a lot to pay attention to .