Like watching , Develop habits
In the game, chat function is almost a necessary function , There are some problems with this function, that is, it will cause the world channel to be very chaotic , There are often some sensitive words , Or chat that some game manufacturers don't want to see , We had this problem in the game before , Our company has done reporting and background monitoring , Let's realize this kind of monitoring today .
Because deep learning is not good , Although I have written about reinforcement learning before , But the results of reinforcement learning are not particularly satisfactory , So study the simpler method to achieve .
There are ready-made solutions for this classification task , For example, the classification of spam is the same problem , Although there are different solutions , But I chose the simplest naive Bayesian classification . Mainly do some exploration ,
Because most of our games are in Chinese , So we need to segment Chinese words , For example, I'm a handsome guy , Break it up .
Naive bayes algorithm , It is a method of searching in a data set according to the existing characteristics of a new sample Conditional probability An algorithm to determine the category of the new sample ; It assumes that ① Each feature is independent of each other 、② Each feature is equally important . It can also be understood as judging the probability when the current characteristics are satisfied at the same time according to the past probability . Specific math companies can baidu themselves , The data formula is too hard to write , Just know about it .
Use the right algorithm at the right time .
jieba The principle of word segmentation ：jieba Word segmentation belongs to probabilistic language model . The task of probabilistic language model word segmentation is ： In all the results of total segmentation, find a segmentation scheme S, bring P(S) Maximum .
You can see jieba I brought some phrases , During segmentation, these phrases will be split as the base unit .
notes ： I just briefly introduce the principles of the above two technologies , If you want to fully understand, you have to write another big article , Can baidu next , Everywhere, , Just find something you can understand . If you can use it, use it first .
Chinese word segmentation bag the most famous word segmentation bag is jieba, As for whether it is the best, I don't know , I think fire has its reason , Do it first .jieba Don't delve into the principle of , Give priority to solving problems , When you encounter problems, you can learn from the problem points , Such a learning model is the most efficient .
Because I've been doing voice related things recently , A big man recommended Library nltk, Looking up the relevant information , It seems to be a well-known library for language processing , Very powerful , It's very powerful , I mainly chose his classification algorithm here , So I don't have to focus on the specific implementation , You don't have to build wheels again , Besides, it's not as good as others , Just use it .
python It's really nice , All kinds of bags , All kinds of wheels .
Installation command ：
pip install jieba pip install nltk
Enter the above two codes respectively , After running , The package is installed successfully , You can test happily
""" #Author: Coriander @time: 2021/8/5 0005 Afternoon 10:26 """ import jieba if __name__ == '__main__': result = " | ".join(jieba.cut(" I love tian 'anmen square in Beijing ,very happy")) print(result)
Look at the word segmentation results , It can be said that it is very good , Sure enough, a major is a major .
Simple tests were done , It can be found that we basically have everything to complete , Now start working directly on the code .
1、 Load the initial text resource .
2、 Remove punctuation marks from text
3、 Feature extraction of text
4、 Training data set , Training out models （ That is, the prediction model ）
5、 Start testing new words
#!/usr/bin/env python # encoding: utf-8 import re import jieba from nltk.classify import NaiveBayesClassifier """ #Author: Coriander @time: 2021/8/5 0005 Afternoon 9:29 """ rule = re.compile(r"[^a-zA-Z0-9\u4e00-\u9fa5]") def delComa(text): text = rule.sub('', text) return text def loadData(fileName): text1 = open(fileName, "r", encoding='utf-8').read() text1 = delComa(text1) list1 = jieba.cut(text1) return " ".join(list1) # feature extraction def word_feats(words): return dict([(word, True) for word in words]) if __name__ == '__main__': adResult = loadData(r"ad.txt") yellowResult = loadData(r"yellow.txt") ad_features = [(word_feats(lb), 'ad') for lb in adResult] yellow_features = [(word_feats(df), 'ye') for df in yellowResult] train_set = ad_features + yellow_features # Training decisions classifier = NaiveBayesClassifier.train(train_set) # Analysis test sentence = input(" Please enter a sentence ：") sentence = delComa(sentence) print("\n") seg_list = jieba.cut(sentence) result1 = " ".join(seg_list) words = result1.split(" ") print(words) # The statistical results ad = 0 yellow = 0 for word in words: classResult = classifier.classify(word_feats(word)) if classResult == 'ad': ad = ad + 1 if classResult == 'ye': yellow = yellow + 1 # The proportion x = float(str(float(ad) / len(words))) y = float(str(float(yellow) / len(words))) print(' The possibility of advertising ：%.2f%%' % (x * 100)) print(' The possibility of swearing ：%.2f%%' % (y * 100))
Look at the results of the operation
1、 The data source can be modified , The monitored data can be stored in the database for loading
2、 You can classify more data , It is convenient for customer service to handle , For example, it is divided into advertising , dirty language , Advice to officials, etc , Define according to business requirements
3、 Data with high probability can be automatically processed by other systems , Improve the speed of dealing with problems
4、 You can use the player's report , Increase the accumulation of data
5、 This idea can be used as a treatment of sensitive words , Provide a dictionary of sensitive words , And then match , testing
6、 It can be made into web service , Play a callback game
7、 The model can be made to predict while learning , For example, some cases need to be handled manually by customer service , After marking, it is directly added to the dataset , In this way, the data model can be learned all the time s
1、 Problems encountered , Punctuation problem , If punctuation is not removed, it will lead to matching. Punctuation is also regarded as matching , unreasonable .
2、 Coding problem , It reads binary , It took a long time to solve
3、 Technology selection , At the beginning, I wanted to use deep learning to solve , I also saw some solutions , However, my computer training is too slow , First choose this way to practice
4、 The code is simple , But it's hard to explain Technology , The code is already written , But it took a weekend to write this article
If you encounter problems, go to the technical solution , If you know the plan, implement it , encounter bug Just go and check , If you can't forget, there will be echoes , Any attempt you make is a good opportunity to learn .
Originality is not easy. , Ask for a favor to forward , support .