## [Python artificial intelligence] 23. Sentiment classification based on machine learning and TFIDF (including detailed NLP data cleaning)

Eastmount 2020-11-13 00:01:10
python artificial intelligence sentiment classification

Start with this column , The author studies Python Deep learning 、 Knowledge of neural networks and artificial intelligence . The previous article shared a custom emotion dictionary （ Dalian Polytechnic Dictionary ） The process of emotion analysis and classification . This article will explain the natural language processing process in detail , Based on machine learning and TFIDF Emotion classification algorithm of , And carried out a variety of classification algorithms （SVM、RF、LR、Boosting） contrast . This article is mainly combined with the author's books 《Python Network data crawling and analysis from the beginning to proficient （ Analysis ）》 Explain , Let's take a look at it again Python The basic steps of Chinese text analysis . I feel pretty good , Basic articles , I hope it will be of some help to you ~

This column mainly combines the author's previous blog 、AI Experience and Related videos and papers , I'll talk more about it later Python Artificial intelligence case and Application . Basic articles , I hope it will be of some help to you , If there are mistakes or shortcomings in the article , Please ask Hai Han. ~ As a rookie of artificial intelligence , I hope you can grow up with me in this one stroke blog . I've been blogging for so many years , Try the first paid column , But more blogs, especially basic articles , Will continue to share free , But the column will also be carefully written , It's worthy of the reader , Mutual encouragement ！

TF Download address ：https://github.com/eastmountyxz/AI-for-TensorFlow
Keras Download address ：https://github.com/eastmountyxz/AI-for-Keras
Emotional analysis address ：https://github.com/eastmountyxz/Sentiment-Analysis

At the same time, five other authors are recommended Python Series articles . from 2014 Year begins , The author mainly wrote three Python Series articles , They are basic knowledge 、 Web crawlers and data analysis .2018 Years have increased Python Image recognition and Python AI column .

In data analysis and data mining , Usually it takes preparation 、 Data crawling 、 Data preprocessing 、 Data analysis 、 Data visualization 、 Evaluation analysis and other steps , The work before data analysis takes almost half of the working time of data engineers , The data preprocessing will also directly affect the quality of subsequent model analysis . Graph is the basic step of data preprocessing , Including Chinese participle 、 Part of speech tagging 、 Data cleaning 、 feature extraction （ Vector space model storage ）、 Weight calculation （TF-IDF） etc. .

# One . Chinese word segmentation

When readers use Python After crawling the Chinese dataset , First of all, we need to process the Chinese word segmentation of the data set . Because the words in English are connected by spaces , According to the space can be directly divided into phrases , So there is no need for word segmentation , And Chinese characters are closely linked , And there is semantics , There is no obvious separation between words , So we need to use Chinese word segmentation technology to segment the sentences in the corpus according to the space , Turn it into a sequence of words . The following is a detailed introduction to Chinese word segmentation technology and Jiaba Chinese word segmentation tools .

Chinese word segmentation （Chinese Word Segmentation） It refers to the segmentation of a Chinese character sequence into a single word or word string sequence , It can create a separator in a Chinese string without word boundaries , It is usually separated by spaces . Here's a simple example , To the sentence “ I'm a programmer ” Do word segmentation .

 Input ： I'm a programmer
Output 1： I \ yes \ cheng \ order \ member
Output 2： I am a \ It's Cheng \ Program \ Preface part
Output 3： I \ yes \ The programmer


Just a quick example , The code mainly imports Jieba Expansion pack, , And then call its function for Chinese word segmentation .

#encoding=utf-8
import jieba
text = " Beijing University of science and technology came to apply "
data = jieba.cut(text,cut_all=True) # All model
print("[ All model ]: ", " ".join(data))
data = jieba.cut(text,cut_all=False) # Accurate model
print("[ Accurate model ]: ", " ".join(data))
data = jieba.cut(text) # The default is exact mode
print("[ The default mode ]: ", " ".join(data))
data = jieba.cut_for_search(text) # Search engine model
print("[ Search engine model ]: ", " ".join(data))


The above code output is as follows , Including full mode 、 Precise pattern and search engine pattern output results .

# Two . Data cleaning

In the process of analyzing the corpus , Usually, there are some dirty data or noisy phrases interfering with our experimental results , This requires data cleaning after word segmentation （Data Cleaning）. For example, we use Jieba Tools for Chinese word segmentation , It may have dirty data or stop words , Such as “ We ”、“ Of ”、“ Do you ” etc. . These words reduce the quality of the data , In order to get better analysis results , It is necessary to clean the data set or stop word filtering .

• Incomplete data
• Duplicate data
• Wrong data
• Stop words

Here we mainly explain the stop word filtering , Delete these stop words that appear frequently but do not affect the theme of the text . stay Jieb Introduce in the process of word segmentation stop_words.txt A dictionary of stop words , If it exists, filter it .

Here's a review from the public 、 Meituan and other websites grab “ Huangguoshu Waterfall ” Comments on , We go through Jieba Tool for Chinese word segmentation .

• Praise ：5000 strip
• Bad review ：1000 strip

Complete code ：

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
# Add custom dictionaries and disable Dictionaries
jieba.load_userdict("user_dict.txt")
stop_list = pd.read_csv('stop_words.txt',
engine='python',
encoding='utf-8',
delimiter="\n",
names=['t'])['t'].tolist()
# Chinese word segmentation function
def txt_cut(juzi):
return [w for w in jieba.lcut(juzi) if w not in stop_list]
# Write the result of word segmentation
fw = open('fenci_data.csv', "a+", newline = '',encoding = 'gb18030')
writer = csv.writer(fw)
writer.writerow(['content','label'])
# Use csv.DictReader Read the information in the file
labels = []
contents = []
file = "data.csv"
with open(file, "r", encoding="UTF-8") as f:
reader = csv.DictReader(f)
for row in reader:
# Data element acquisition
if row['label'] == ' Praise ':
res = 0
else:
res = 1
labels.append(res)
content = row['content']
seglist = txt_cut(content)
output = ' '.join(list(seglist)) # Space splicing
contents.append(output)
# File is written to
tlist = []
tlist.append(output)
tlist.append(res)
writer.writerow(tlist)
print(labels[:5])
print(contents[:5])
fw.close()


The results are shown in the following figure , On the one hand, it will be special punctuation 、 Stop word filtering , On the other hand, it imported user_dict.txt The dictionary , take “ Huangguoshu Waterfall ”、“ Scenic Spot ” Such as proper noun participle , Otherwise, it may be divided into “ Huangguoshu ” and “ The waterfall ”、“ scenery ” and “ District ”.

• Before data cleaning

Remember when I was a kid , Always in front of the TV , Wait 《 Journey to the west 》 Broadcast .“ You carry the load , I lead the horse . Over the mountains and water, double shoulder skating ……" Familiar songs , When it rings again . The water in the lyrics , There is water in Guizhou , Accurately speaking , It's Huangguoshu waterfall in Guizhou ; That curtain waterfall , Into our childhood , Let's linger on . Huangguoshu waterfall is not only one waterfall , It's a big scenic spot , Including doupo Tang waterfall 、 Tianxingqiao scenic spot 、 Huangguoshu waterfall , Huangguoshu waterfall is the most famous one .

• After data cleaning

Remember When I was a child keep The TV front Wait Journey to the west Broadcast pick Dan Pull Horse Climb the mountain Wade in the water Two shoulders Double slide be familiar with song round the ear sound when The lyrics in water guizhou water accuracy say guizhou Huangguoshu Waterfall That curtain The waterfall Influx childhood linger Huangguoshu Waterfall The waterfall The scenic spot Include steep slope Pond The waterfall Star Bridge The scenic spot Huangguoshu The waterfall Huangguoshu The waterfall famous

# 3、 ... and . Feature extraction and TF-IDF Calculation

## 1. Basic concepts

Weight calculation is to measure the importance of feature items in document representation by feature weight , Give the feature words a certain weight to measure the statistical text feature words .TF-IDF（Term Frequency-Invers Document Frequency） It is a classical weight calculation technology used in data analysis and information processing in recent years . This technique calculates the importance of a feature word in the whole corpus according to the frequency of the feature word appearing in the text and the frequency of the document appearing in the whole corpus , Its advantage is that it can filter out some common but unimportant words , Keep as many feature words as possible with high influence .

TF-IDF The calculation formula of is as follows , In style TF-IDF Word frequency representation TF Frequency of words in reversed text IDF The product of the ,TF-IDF The weight is proportional to the frequency of the feature appearing in the document , It is inversely proportional to the number of documents in the whole corpus .TF-IDF The higher the value, the more important the feature word is to the text .

among ,TF The formula for calculating word frequency is as follows ,ni,j It is a characteristic word ti In the training text Dj Is the number of times , The denominator is text Dj The number of all the characteristic words in , The result of calculation is the frequency of a feature word .

Reverse document frequency （Inverse Document Frequency, abbreviation IDF） yes Spark Jones stay 1972 Put forward in , The classical method used to calculate the weight of words and documents . The calculation formula is as follows , Parameters |D| Represents the total number of texts in the corpus ,|Dt| Indicates the feature words contained in the text tj The number of .

In the inverted document frequency method , The weight changes inversely with the number of documents of feature words . Such as some common words “ We ”、“ however ”、“ Of ” etc. , It's very common in all documents , But its IDF The value is very low . Even if it appears in every document , be log1 The result of the calculation is 0, This reduces the use of these common words ; contrary , If an introduction “ Artificial intelligence ” The word , It only appears many times in this document , It's very powerful .

TF-IDF The core idea of technology is how often a feature word appears in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification , Suitable for weight calculation .TF-IDF The algorithm is simple and fast , The results are in line with the actual situation , It's text mining 、 Sentiment analysis 、 Topic distribution and other areas of common means .

## 2. Code implementation

Scikit-Learn Medium main use Scikit-Learn Two classes in CountVectorizer and TfidfTransformer, Used to calculate word frequency and TF-IDF value .

• CountVectorizer
This class is to convert text words into the form of word frequency matrix . such as “I am a teacher” The text contains four words , The frequency of their corresponding words is 1,“I”、“am”、“a”、“teacher” Once in a while .CountVectorizer Will generate a matrix a[M][N], common M A corpus of texts ,N Word , such as a[i][j] Means the word j stay i Frequency of words in similar texts . Call again fit_transform() Function to calculate the number of times each word appears ,get_feature_names() Function to get all text keywords in thesaurus .

• TfidTransformer
When using CountVectorizer Class to get the word frequency matrix , Next through TfidfTransformer Class implementation Statistics vectorizer For each word in the variable TF-IDF value .TF-IDF Values are stored in the form of matrix arrays , Each row of data represents a text corpus , Each column in each row represents the weight of one of the features , obtain TF-IDF After that, we can use various data analysis algorithms to analyze , For example, cluster analysis 、LDA Theme distribution 、 Public opinion analysis and so on .

Complete code ：

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#---------------------------------- First step Read the file --------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
reader = csv.DictReader(f)
labels = []
contents = []
for row in reader:
labels.append(row['label']) #0- Praise 1- Bad review
contents.append(row['content'])
print(labels[:5])
print(contents[:5])
#---------------------------------- The second step Data preprocessing --------------------------------
# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts
vectorizer = CountVectorizer()
# This class will count the tf-idf A weight
transformer = TfidfTransformer()
# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
print(n)
print(type(tfidf))
# Get all the words in the bag model
word = vectorizer.get_feature_names()
for n in word[:10]:
print(n)
print(" Number of words :", len(word))
# take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() # sparse matrix Be careful float
print(X.shape)
print(X[:10])


The output is as follows ：

<class 'scipy.sparse.csr.csr_matrix'>
aaaaa
achievements
amazing
ananananan
ancient
anshun
aperture
app
Number of words : 20254
(6074, 20254)
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]


## 3.MemoryError Memory overflow error

When we have a lot of data , Matrices often don't store such a large amount of data , The following error occurs ：

• ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.
• MemoryError: Unable to allocate array with shape (26771, 69602) and data type float64

The solutions I offer are as follows ：

• Stop word filtering reduces unnecessary feature words
• scipy Package provides the creation of sparse matrix , Use coo_matrix(tfidf, dtype=np.float32) transformation tfidf
• CountVectorizer(min_df=5) increase min_df Parameters , Filter out the less frequent feature words , This parameter can be continuously debugged
max_df Used to delete terms that appear too often , They are called corpus specific stop words , default max_df yes 1.0 That is, neglect appears in 100％ Document terminology ;min_df Used to delete infrequent terms min_df=5 Neglect less than 5 Terms that appear in documents .
• Use GPU Or expand the memory to solve

# Four . Emotion classification based on logistic regression

Get text TF-IDF The value of , This section briefly explains the use of TF-IDF The process of classifying emotions by value , It mainly includes the following steps ：

• After Chinese word segmentation and data cleaning, the word frequency matrix is generated . Main call CountVectorizer Class computes the word frequency matrix , The generated matrix is X.
• call TfidfTransformer Class computes the word frequency matrix X Of TF-IDF value , obtain Weight Weight matrices .
• call Sklearn Machine learning packages perform classification operations , call fit() Function training , And assign the predicted class label to pre Array .
• call Sklearn library PCA() Function to reduce the dimension operation , Reduce these features to two dimensions , Corresponding X and Y Axis , And then visualize it .
• Algorithm optimization and evaluation .

Logic regression complete code ：

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB
#---------------------------------- First step Read the file --------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
reader = csv.DictReader(f)
labels = []
contents = []
for row in reader:
labels.append(row['label']) #0- Praise 1- Bad review
contents.append(row['content'])
print(labels[:5])
print(contents[:5])
#---------------------------------- The second step Data preprocessing --------------------------------
# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts
vectorizer = CountVectorizer(min_df=5)
# This class will count the tf-idf A weight
transformer = TfidfTransformer()
# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
print(n)
print(type(tfidf))
# Get all the words in the bag model
word = vectorizer.get_feature_names()
for n in word[:10]:
print(n)
print(" Number of words :", len(word))
# take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() # sparse matrix Be careful float
print(X.shape)
print(X[:10])
#---------------------------------- The third step Data partitioning --------------------------------
# Use train_test_split Division X y list
X_train, X_test, y_train, y_test = train_test_split(X,
labels,
test_size=0.3,
random_state=1)
#-------------------------------- Step four Machine learning classification --------------------------------
# Logistic regression classification method model
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print(" Logistic regression classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
print("\n")


The results are shown in the following figure ：

# 5、 ... and . Algorithm performance evaluation

Algorithm evaluation a lot of real-time, we need to write our own program to achieve , Like drawing ROC curve 、 Statistics of various characteristics 、 Show 4 Digit results . Here the author tries to customize the accuracy （Precision）、 Recall rate （Recall） and F The eigenvalue （F-measure）, Its calculation formula is as follows ：

P r e c i s i o n = just indeed By pre measuring Of total Count pre measuring Out Of branch class total Count Precision = \frac{ The total number is correctly predicted }{ The total number of categories predicted }

R e c a l l = just indeed By pre measuring Of total Count measuring try Set in save stay Of branch class total Count Recall = \frac{ The total number is correctly predicted }{ The total number of categories in the test set }

F − m e a s u r e = 2 ∗ P r e c i s i o n ∗ R e c a l l ( P r e c i s i o n + R e c a l l ) F-measure = \frac{2*Precision*Recall}{(Precision+Recall)}

Because this article mainly aims at 2 Classification problem , The experimental evaluation is mainly divided into 0 and 1 Two types of , The complete code is as follows ：

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB
#---------------------------------- First step Read the file --------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
reader = csv.DictReader(f)
labels = []
contents = []
for row in reader:
labels.append(row['label']) #0- Praise 1- Bad review
contents.append(row['content'])
print(labels[:5])
print(contents[:5])
#---------------------------------- The second step Data preprocessing --------------------------------
# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts
vectorizer = CountVectorizer(min_df=5)
# This class will count the tf-idf A weight
transformer = TfidfTransformer()
# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
print(n)
print(type(tfidf))
# Get all the words in the bag model
word = vectorizer.get_feature_names()
for n in word[:10]:
print(n)
print(" Number of words :", len(word))
# take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() # sparse matrix Be careful float
print(X.shape)
print(X[:10])
#---------------------------------- The third step Data partitioning --------------------------------
# Use train_test_split Division X y list
X_train, X_test, y_train, y_test = train_test_split(X,
labels,
test_size=0.3,
random_state=1)
#-------------------------------- Step four Machine learning classification --------------------------------
# Logistic regression classification method model
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print(" Logistic regression classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
#---------------------------------- Step five Evaluation results --------------------------------
def classification_pj(name, y_test, pre):
print(" Algorithm evaluation :", name)
# Accuracy rate Precision = The total number of individuals correctly identified / The total number of individuals identified
# Recall rate Recall = The total number of individuals correctly identified / The total number of individuals present in the test set
# F value F-measure = Accuracy rate * Recall rate * 2 / ( Accuracy rate + Recall rate )
YC_B, YC_G = 0,0 # forecast bad good
ZQ_B, ZQ_G = 0,0 # correct
CZ_B, CZ_G = 0,0 # There is
#0-good 1-bad At the same time, it is calculated to prevent the class label from changing
i = 0
while i<len(pre):
z = int(y_test[i]) # real
y = int(pre[i]) # forecast
if z==0:
CZ_G += 1
else:
CZ_B += 1
if y==0:
YC_G += 1
else:
YC_B += 1
if z==y and z==0 and y==0:
ZQ_G += 1
elif z==y and z==1 and y==1:
ZQ_B += 1
i = i + 1
print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, CZ_G)
print("")
# Results output
P_G = ZQ_G * 1.0 / YC_G
P_B = ZQ_B * 1.0 / YC_B
print("Precision Good 0:", P_G)
print("Precision Bad 1:", P_B)
R_G = ZQ_G * 1.0 / CZ_G
R_B = ZQ_B * 1.0 / CZ_B
print("Recall Good 0:", R_G)
print("Recall Bad 1:", R_B)
F_G = 2 * P_G * R_G / (P_G + R_G)
F_B = 2 * P_B * R_B / (P_B + R_B)
print("F-measure Good 0:", F_G)
print("F-measure Bad 1:", F_B)
# Function call
classification_pj("LogisticRegression", y_test, pre)


The output is as follows ：

 Logistic regression classification
1823 1823
precision recall f1-score support
0 0.94 0.99 0.97 1520
1 0.93 0.70 0.80 303
accuracy 0.94 1823
macro avg 0.94 0.85 0.88 1823
weighted avg 0.94 0.94 0.94 1823
Algorithm evaluation : LogisticRegression
213 1504 229 1594 303 1520
Precision Good 0: 0.9435382685069009
Precision Bad 1: 0.9301310043668122
Recall Good 0: 0.9894736842105263
Recall Bad 1: 0.7029702970297029
F-measure Good 0: 0.9659601798330122
F-measure Bad 1: 0.800751879699248


# 6、 ... and . Algorithm comparison experiment

## 1.RandomForest

The code is as follows ：

# Random forest classification model n_estimators： The number of trees in the forest
clf = RandomForestClassifier(n_estimators=20)
clf.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(clf.score(X_test, y_test)))
print("\n")
pre = clf.predict(X_test)
print(' Predicted results :', pre[:10])
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("RandomForest", y_test, pre)
print("\n")


Output results ：

## 2.SVM

The code is as follows ：

# SVM Classification model
SVM = svm.LinearSVC() # Support vector machine classifier LinearSVC
SVM.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(SVM.score(X_test, y_test)))
pre = SVM.predict(X_test)
print(" Support vector machine classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("LinearSVC", y_test, pre)
print("\n")


Output results ：

## 3. Naive Bayes

The code is as follows ：

# Naive bayesian model
nb = MultinomialNB()
nb.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(nb.score(X_test, y_test)))
pre = nb.predict(X_test)
print(" naive bayesian classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MultinomialNB", y_test, pre)
print("\n")


Output results ：

## 4.KNN

The accuracy of the algorithm is not high , And the execution time is long , It's not recommended for text analysis . In some cases, the algorithm comparison is OK , The core code is as follows ：

# Nearest neighbor algorithm
knn = neighbors.KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(knn.score(X_test, y_test)))
pre = knn.predict(X_test)
print(" Nearest neighbor classification ")
print(classification_report(y_test, pre))
classification_pj("KNeighbors", y_test, pre)
print("\n")


Output results ：

## 5. Decision tree

The code is as follows ：

# Decision tree algorithm
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(dtc.score(X_test, y_test)))
pre = dtc.predict(X_test)
print(" Decision tree classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("DecisionTreeClassifier", y_test, pre)
print("\n")


Output results ：

## 6.SGD

The code is as follows ：

#SGD Classification model
from sklearn.linear_model.stochastic_gradient import SGDClassifier
sgd = SGDClassifier()
sgd.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(sgd.score(X_test, y_test)))
pre = sgd.predict(X_test)
print("SGD classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("SGDClassifier", y_test, pre)
print("\n")


Output results ：

## 7.MLP

The algorithm is slow , The core code is as follows ：

#MLP Classification model
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(mlp.score(X_test, y_test)))
pre = mlp.predict(X_test)
print("MLP classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MLPClassifier", y_test, pre)
print("\n")


Output results ：

## 8.GradientBoosting

The algorithm is slow , The code is as follows ：

#GradientBoosting Classification model
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(gb.score(X_test, y_test)))
pre = gb.predict(X_test)
print("GradientBoosting classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("GradientBoostingClassifier", y_test, pre)
print("\n")


Output results ：

## 9.AdaBoost

The code is as follows ：

#AdaBoost Classification model
from sklearn.ensemble import AdaBoostClassifier
AdaBoost = AdaBoostClassifier()
AdaBoost.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(AdaBoost.score(X_test, y_test)))
pre = AdaBoost.predict(X_test)
print("AdaBoost classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("AdaBoostClassifier", y_test, pre)
print("\n")


Output results ：

# 7、 ... and . summary

Write here , This article is over , In the next article, I'll take you through deep learning （BiLSTM-CNN） The emotion classification method of . I hope it will be of some help to you , At the same time, the inadequacies or mistakes in the article , Readers are welcome to propose . These experiments are some of the common problems in my thesis research or project evaluation , I hope the reader will bring these questions , Think deeply about your own needs , I hope you can apply what you have learned . Finally, if the article helps you , Please thumb up 、 Comment on 、 Collection , This will be my biggest motivation to share .

All in all , This passage Sklearn We have realized various emotion classification algorithms of machine learning , And we can do experimental comparisons , As shown in the figure below , Find random forests 、SVM、SGD、MLP It's not bad , Of course, different datasets have different effects , We need to combine our own data sets to complete .github Download code , Remember to pay attention to it .

Last , As a rookie of artificial intelligence , I hope I can make progress and go deeper , Then it is applied to image recognition 、 Network security 、 Against samples and other fields , Guide you to write simple academic papers , Come on together ！ Thank you for meeting many bloggers who have made progress in the past few years , Mutual encouragement ~

Recently, I participated in the big data security competition held by Qianxin and Tsinghua University , There's a lot to gain , Also aware of the chasm of the gap . My main analysis is HC And malicious family website classification , Probably from 200 10000 real website clocks have identified more than 100000 HC Website , It's about data grabbing 、 Malicious traffic detection 、 Jump hijack judgment 、NLP And big data . In the last five directions, Tsinghua University won the prize 、 Institute of information technology, Chinese Academy of Sciences 、 Alibaba team , There is also Peking University 、 Zhejiang University 、 Hand in and other teams , Awesome! , I really want to learn from them writeup. I really cherish this opportunity of actual combat , Hope to continue to refuel in the future , One year, I can break into the top three and get a prize . Although I'm a vegetable , But next, I will share my big data analysis method , Make progress with everyone . The unknown attack , How prevent , On the safe way, please ask a friend and a big man for advice , I also hope that I can make progress in both academic and practical directions . It's not terrible to have a gap , The important thing is that I tried to , Shared , come on. . Finally, thank you for your guidance and guidance , ha-ha ~

(By:Eastmount 2020-08-17 Monday afternoon 3 Written in Wuhan http://blog.csdn.net/eastmount/ )

2020 year 8 month 18 A new one “ Na Zhang AI Safe house ”, centered Python Big data analysis 、 Cyberspace Security 、 Artificial intelligence 、Web Penetration and attack and defense techniques are explained , Share... At the same time CCF、SCI、 The implementation of the algorithm of South core and North nuclear paper . Nazhang's house will be more systematic , And reconstruct all of the author's articles , Explain from scratch Python And security , I have been writing articles for nearly ten years , I really want to share what I have learned, felt and done , Please give me more advice , We sincerely invite your attention ！ thank you .

reference ：
[1] Yang xiuzhang 《Python Network data crawling and analysis from the beginning to proficient （ Analysis ）》
[2] https://blog.csdn.net/WANG_hl/article/details/105234432
[3] https://blog.csdn.net/qq_27590277/article/details/106894245
[4] https://www.cnblogs.com/alivinfer/p/12892147.html
[5] https://blog.csdn.net/qq_28626909/article/details/80382029
[6] https://www.jianshu.com/p/3da3f5608a7c