[Python artificial intelligence] 23. Sentiment classification based on machine learning and TFIDF (including detailed NLP data cleaning)

Eastmount 2020-11-13 00:01:10
python artificial intelligence sentiment classification


Start with this column , The author studies Python Deep learning 、 Knowledge of neural networks and artificial intelligence . The previous article shared a custom emotion dictionary ( Dalian Polytechnic Dictionary ) The process of emotion analysis and classification . This article will explain the natural language processing process in detail , Based on machine learning and TFIDF Emotion classification algorithm of , And carried out a variety of classification algorithms (SVM、RF、LR、Boosting) contrast . This article is mainly combined with the author's books 《Python Network data crawling and analysis from the beginning to proficient ( Analysis )》 Explain , Let's take a look at it again Python The basic steps of Chinese text analysis . I feel pretty good , Basic articles , I hope it will be of some help to you ~

 Insert picture description here

This column mainly combines the author's previous blog 、AI Experience and Related videos and papers , I'll talk more about it later Python Artificial intelligence case and Application . Basic articles , I hope it will be of some help to you , If there are mistakes or shortcomings in the article , Please ask Hai Han. ~ As a rookie of artificial intelligence , I hope you can grow up with me in this one stroke blog . I've been blogging for so many years , Try the first paid column , But more blogs, especially basic articles , Will continue to share free , But the column will also be carefully written , It's worthy of the reader , Mutual encouragement !

TF Download address :https://github.com/eastmountyxz/AI-for-TensorFlow
Keras Download address :https://github.com/eastmountyxz/AI-for-Keras
Emotional analysis address :https://github.com/eastmountyxz/Sentiment-Analysis


At the same time, five other authors are recommended Python Series articles . from 2014 Year begins , The author mainly wrote three Python Series articles , They are basic knowledge 、 Web crawlers and data analysis .2018 Years have increased Python Image recognition and Python AI column .

 Insert picture description here

above :
[Python Artificial intelligence ] One .TensorFlow2.0 Introduction to environment building and neural network
[Python Artificial intelligence ] Two .TensorFlow Basic and univariate linear prediction cases
[Python Artificial intelligence ] 3、 ... and .TensorFlow The foundation Session、 Variable 、 Input values and excitation functions
[Python Artificial intelligence ] Four .TensorFlow Create recurrent neural networks and Optimizer Optimizer
[Python Artificial intelligence ] 5、 ... and .Tensorboard Visualizing basic usage and drawing the whole neural network
[Python Artificial intelligence ] 6、 ... and .TensorFlow Realize classified learning and MNIST Handwriting recognition cases
[Python Artificial intelligence ] 7、 ... and . What is over fitting and dropout Solve the over fitting problem in neural network
[Python Artificial intelligence ] 8、 ... and . Convolutional neural networks CNN A detailed explanation of the principles and TensorFlow To write CNN
[Python Artificial intelligence ] Nine .gensim The word vector Word2Vec Installation and 《 Celebrate more than 》 Similarity calculation of Chinese short text
[Python Artificial intelligence ] Ten .Tensorflow+Opencv Realization CNN Custom image classification case and machine learning KNN Image classification algorithm comparison
[Python Artificial intelligence ] 11、 ... and .Tensorflow How to save neural network parameters
[Python Artificial intelligence ] Twelve . Cyclic neural network RNN and LSTM A detailed explanation of the principles and TensorFlow To write RNN Classify cases
[Python Artificial intelligence ] 13、 ... and . How to evaluate neural networks 、loss Plot a curve 、 Image classification case F Value calculation
[Python Artificial intelligence ] fourteen . Cyclic neural network LSTM RNN Return to the case sin Curve prediction
[Python Artificial intelligence ] 15、 ... and . Unsupervised learning Autoencoder Detailed explanation of principles and Clustering Visualization cases
[Python Artificial intelligence ] sixteen .Keras Environment building 、 Introduction and regression neural network cases
[Python Artificial intelligence ] seventeen .Keras Building classified neural networks and MNIST Digital image case analysis
[Python Artificial intelligence ] eighteen .Keras Building convolutional neural networks and CNN The principle,
[Python Artificial intelligence ] nineteen .Keras Set up the classification case of cyclic neural network and RNN The principle,
[Python Artificial intelligence ] twenty . be based on Keras+RNN Text classification vs Text classification based on traditional machine learning
[Python Artificial intelligence ] The 21st .Word2Vec+CNN Chinese text classification and machine learning (RF\DTC\SVM\KNN\NB\LR) Classification and comparison
[Python Artificial intelligence ] Twenty-two . Emotion analysis and calculation based on Dalian Polytechnic emotion dictionary
《 The wave of artificial intelligence 》 Journal entry —— What is artificial intelligence ?( One )



In data analysis and data mining , Usually it takes preparation 、 Data crawling 、 Data preprocessing 、 Data analysis 、 Data visualization 、 Evaluation analysis and other steps , The work before data analysis takes almost half of the working time of data engineers , The data preprocessing will also directly affect the quality of subsequent model analysis . Graph is the basic step of data preprocessing , Including Chinese participle 、 Part of speech tagging 、 Data cleaning 、 feature extraction ( Vector space model storage )、 Weight calculation (TF-IDF) etc. .

 Insert picture description here


One . Chinese word segmentation

When readers use Python After crawling the Chinese dataset , First of all, we need to process the Chinese word segmentation of the data set . Because the words in English are connected by spaces , According to the space can be directly divided into phrases , So there is no need for word segmentation , And Chinese characters are closely linked , And there is semantics , There is no obvious separation between words , So we need to use Chinese word segmentation technology to segment the sentences in the corpus according to the space , Turn it into a sequence of words . The following is a detailed introduction to Chinese word segmentation technology and Jiaba Chinese word segmentation tools .

Chinese word segmentation (Chinese Word Segmentation) It refers to the segmentation of a Chinese character sequence into a single word or word string sequence , It can create a separator in a Chinese string without word boundaries , It is usually separated by spaces . Here's a simple example , To the sentence “ I'm a programmer ” Do word segmentation .

 Input : I'm a programmer
Output 1: I \ yes \ cheng \ order \ member
Output 2: I am a \ It's Cheng \ Program \ Preface part
Output 3: I \ yes \ The programmer

Just a quick example , The code mainly imports Jieba Expansion pack, , And then call its function for Chinese word segmentation .

#encoding=utf-8 
import jieba
text = " Beijing University of science and technology came to apply "
data = jieba.cut(text,cut_all=True) # All model 
print("[ All model ]: ", " ".join(data))
data = jieba.cut(text,cut_all=False) # Accurate model 
print("[ Accurate model ]: ", " ".join(data))
data = jieba.cut(text) # The default is exact mode 
print("[ The default mode ]: ", " ".join(data))
data = jieba.cut_for_search(text) # Search engine model 
print("[ Search engine model ]: ", " ".join(data))

The above code output is as follows , Including full mode 、 Precise pattern and search engine pattern output results .

 Insert picture description here



Two . Data cleaning

In the process of analyzing the corpus , Usually, there are some dirty data or noisy phrases interfering with our experimental results , This requires data cleaning after word segmentation (Data Cleaning). For example, we use Jieba Tools for Chinese word segmentation , It may have dirty data or stop words , Such as “ We ”、“ Of ”、“ Do you ” etc. . These words reduce the quality of the data , In order to get better analysis results , It is necessary to clean the data set or stop word filtering .

  • Incomplete data
  • Duplicate data
  • Wrong data
  • Stop words

Here we mainly explain the stop word filtering , Delete these stop words that appear frequently but do not affect the theme of the text . stay Jieb Introduce in the process of word segmentation stop_words.txt A dictionary of stop words , If it exists, filter it .

 Insert picture description here

Here's a review from the public 、 Meituan and other websites grab “ Huangguoshu Waterfall ” Comments on , We go through Jieba Tool for Chinese word segmentation .

  • Praise :5000 strip
  • Bad review :1000 strip

 Insert picture description here

Complete code :

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
# Add custom dictionaries and disable Dictionaries 
jieba.load_userdict("user_dict.txt")
stop_list = pd.read_csv('stop_words.txt',
engine='python',
encoding='utf-8',
delimiter="\n",
names=['t'])['t'].tolist()
# Chinese word segmentation function 
def txt_cut(juzi):
return [w for w in jieba.lcut(juzi) if w not in stop_list]
# Write the result of word segmentation 
fw = open('fenci_data.csv', "a+", newline = '',encoding = 'gb18030')
writer = csv.writer(fw)
writer.writerow(['content','label'])
# Use csv.DictReader Read the information in the file 
labels = []
contents = []
file = "data.csv"
with open(file, "r", encoding="UTF-8") as f:
reader = csv.DictReader(f)
for row in reader:
# Data element acquisition 
if row['label'] == ' Praise ':
res = 0
else:
res = 1
labels.append(res)
content = row['content']
seglist = txt_cut(content)
output = ' '.join(list(seglist)) # Space splicing 
contents.append(output)
# File is written to 
tlist = []
tlist.append(output)
tlist.append(res)
writer.writerow(tlist)
print(labels[:5])
print(contents[:5])
fw.close()

The results are shown in the following figure , On the one hand, it will be special punctuation 、 Stop word filtering , On the other hand, it imported user_dict.txt The dictionary , take “ Huangguoshu Waterfall ”、“ Scenic Spot ” Such as proper noun participle , Otherwise, it may be divided into “ Huangguoshu ” and “ The waterfall ”、“ scenery ” and “ District ”.

 Insert picture description here

  • Before data cleaning

Remember when I was a kid , Always in front of the TV , Wait 《 Journey to the west 》 Broadcast .“ You carry the load , I lead the horse . Over the mountains and water, double shoulder skating ……" Familiar songs , When it rings again . The water in the lyrics , There is water in Guizhou , Accurately speaking , It's Huangguoshu waterfall in Guizhou ; That curtain waterfall , Into our childhood , Let's linger on . Huangguoshu waterfall is not only one waterfall , It's a big scenic spot , Including doupo Tang waterfall 、 Tianxingqiao scenic spot 、 Huangguoshu waterfall , Huangguoshu waterfall is the most famous one .

  • After data cleaning

Remember When I was a child keep The TV front Wait Journey to the west Broadcast pick Dan Pull Horse Climb the mountain Wade in the water Two shoulders Double slide be familiar with song round the ear sound when The lyrics in water guizhou water accuracy say guizhou Huangguoshu Waterfall That curtain The waterfall Influx childhood linger Huangguoshu Waterfall The waterfall The scenic spot Include steep slope Pond The waterfall Star Bridge The scenic spot Huangguoshu The waterfall Huangguoshu The waterfall famous



3、 ... and . Feature extraction and TF-IDF Calculation

1. Basic concepts

Weight calculation is to measure the importance of feature items in document representation by feature weight , Give the feature words a certain weight to measure the statistical text feature words .TF-IDF(Term Frequency-Invers Document Frequency) It is a classical weight calculation technology used in data analysis and information processing in recent years . This technique calculates the importance of a feature word in the whole corpus according to the frequency of the feature word appearing in the text and the frequency of the document appearing in the whole corpus , Its advantage is that it can filter out some common but unimportant words , Keep as many feature words as possible with high influence .

TF-IDF The calculation formula of is as follows , In style TF-IDF Word frequency representation TF Frequency of words in reversed text IDF The product of the ,TF-IDF The weight is proportional to the frequency of the feature appearing in the document , It is inversely proportional to the number of documents in the whole corpus .TF-IDF The higher the value, the more important the feature word is to the text .

 Insert picture description here

among ,TF The formula for calculating word frequency is as follows ,ni,j It is a characteristic word ti In the training text Dj Is the number of times , The denominator is text Dj The number of all the characteristic words in , The result of calculation is the frequency of a feature word .

 Insert picture description here

Reverse document frequency (Inverse Document Frequency, abbreviation IDF) yes Spark Jones stay 1972 Put forward in , The classical method used to calculate the weight of words and documents . The calculation formula is as follows , Parameters |D| Represents the total number of texts in the corpus ,|Dt| Indicates the feature words contained in the text tj The number of .

 Insert picture description here

In the inverted document frequency method , The weight changes inversely with the number of documents of feature words . Such as some common words “ We ”、“ however ”、“ Of ” etc. , It's very common in all documents , But its IDF The value is very low . Even if it appears in every document , be log1 The result of the calculation is 0, This reduces the use of these common words ; contrary , If an introduction “ Artificial intelligence ” The word , It only appears many times in this document , It's very powerful .

TF-IDF The core idea of technology is how often a feature word appears in an article TF high , And it's rarely seen in other articles , It is believed that this word or phrase has a good ability of classification , Suitable for weight calculation .TF-IDF The algorithm is simple and fast , The results are in line with the actual situation , It's text mining 、 Sentiment analysis 、 Topic distribution and other areas of common means .


2. Code implementation

Scikit-Learn Medium main use Scikit-Learn Two classes in CountVectorizer and TfidfTransformer, Used to calculate word frequency and TF-IDF value .

  • CountVectorizer
    This class is to convert text words into the form of word frequency matrix . such as “I am a teacher” The text contains four words , The frequency of their corresponding words is 1,“I”、“am”、“a”、“teacher” Once in a while .CountVectorizer Will generate a matrix a[M][N], common M A corpus of texts ,N Word , such as a[i][j] Means the word j stay i Frequency of words in similar texts . Call again fit_transform() Function to calculate the number of times each word appears ,get_feature_names() Function to get all text keywords in thesaurus .

 Insert picture description here

  • TfidTransformer
    When using CountVectorizer Class to get the word frequency matrix , Next through TfidfTransformer Class implementation Statistics vectorizer For each word in the variable TF-IDF value .TF-IDF Values are stored in the form of matrix arrays , Each row of data represents a text corpus , Each column in each row represents the weight of one of the features , obtain TF-IDF After that, we can use various data analysis algorithms to analyze , For example, cluster analysis 、LDA Theme distribution 、 Public opinion analysis and so on .

 Insert picture description here

Complete code :

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#---------------------------------- First step Read the file --------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
reader = csv.DictReader(f)
labels = []
contents = []
for row in reader:
labels.append(row['label']) #0- Praise 1- Bad review 
contents.append(row['content'])
print(labels[:5])
print(contents[:5])
#---------------------------------- The second step Data preprocessing --------------------------------
# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts 
vectorizer = CountVectorizer()
# This class will count the tf-idf A weight 
transformer = TfidfTransformer()
# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix 
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
print(n)
print(type(tfidf))
# Get all the words in the bag model 
word = vectorizer.get_feature_names()
for n in word[:10]:
print(n)
print(" Number of words :", len(word))
# take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight 
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() # sparse matrix Be careful float
print(X.shape)
print(X[:10])

The output is as follows :

<class 'scipy.sparse.csr.csr_matrix'>
aaaaa
achievements
amazing
ananananan
ancient
anshun
aperture
app
Number of words : 20254
(6074, 20254)
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]

3.MemoryError Memory overflow error

When we have a lot of data , Matrices often don't store such a large amount of data , The following error occurs :

  • ValueError: array is too big; arr.size * arr.dtype.itemsize is larger than the maximum possible size.
  • MemoryError: Unable to allocate array with shape (26771, 69602) and data type float64

The solutions I offer are as follows :

  • Stop word filtering reduces unnecessary feature words
  • scipy Package provides the creation of sparse matrix , Use coo_matrix(tfidf, dtype=np.float32) transformation tfidf
  • CountVectorizer(min_df=5) increase min_df Parameters , Filter out the less frequent feature words , This parameter can be continuously debugged
    max_df Used to delete terms that appear too often , They are called corpus specific stop words , default max_df yes 1.0 That is, neglect appears in 100% Document terminology ;min_df Used to delete infrequent terms min_df=5 Neglect less than 5 Terms that appear in documents .
  • Use GPU Or expand the memory to solve


Four . Emotion classification based on logistic regression

Get text TF-IDF The value of , This section briefly explains the use of TF-IDF The process of classifying emotions by value , It mainly includes the following steps :

  • After Chinese word segmentation and data cleaning, the word frequency matrix is generated . Main call CountVectorizer Class computes the word frequency matrix , The generated matrix is X.
  • call TfidfTransformer Class computes the word frequency matrix X Of TF-IDF value , obtain Weight Weight matrices .
  • call Sklearn Machine learning packages perform classification operations , call fit() Function training , And assign the predicted class label to pre Array .
  • call Sklearn library PCA() Function to reduce the dimension operation , Reduce these features to two dimensions , Corresponding X and Y Axis , And then visualize it .
  • Algorithm optimization and evaluation .

Logic regression complete code :

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB
#---------------------------------- First step Read the file --------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
reader = csv.DictReader(f)
labels = []
contents = []
for row in reader:
labels.append(row['label']) #0- Praise 1- Bad review 
contents.append(row['content'])
print(labels[:5])
print(contents[:5])
#---------------------------------- The second step Data preprocessing --------------------------------
# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts 
vectorizer = CountVectorizer(min_df=5)
# This class will count the tf-idf A weight 
transformer = TfidfTransformer()
# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix 
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
print(n)
print(type(tfidf))
# Get all the words in the bag model 
word = vectorizer.get_feature_names()
for n in word[:10]:
print(n)
print(" Number of words :", len(word))
# take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight 
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() # sparse matrix Be careful float
print(X.shape)
print(X[:10])
#---------------------------------- The third step Data partitioning --------------------------------
# Use train_test_split Division X y list 
X_train, X_test, y_train, y_test = train_test_split(X,
labels,
test_size=0.3,
random_state=1)
#-------------------------------- Step four Machine learning classification --------------------------------
# Logistic regression classification method model 
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print(" Logistic regression classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
print("\n")

The results are shown in the following figure :

 Insert picture description here



5、 ... and . Algorithm performance evaluation

Algorithm evaluation a lot of real-time, we need to write our own program to achieve , Like drawing ROC curve 、 Statistics of various characteristics 、 Show 4 Digit results . Here the author tries to customize the accuracy (Precision)、 Recall rate (Recall) and F The eigenvalue (F-measure), Its calculation formula is as follows :

P r e c i s i o n = just indeed By pre measuring Of total Count pre measuring Out Of branch class total Count Precision = \frac{ The total number is correctly predicted }{ The total number of categories predicted } Precision= pre measuring Out Of branch class total Count just indeed By pre measuring Of total Count


R e c a l l = just indeed By pre measuring Of total Count measuring try Set in save stay Of branch class total Count Recall = \frac{ The total number is correctly predicted }{ The total number of categories in the test set } Recall= measuring try Set in save stay Of branch class total Count just indeed By pre measuring Of total Count


F − m e a s u r e = 2 ∗ P r e c i s i o n ∗ R e c a l l ( P r e c i s i o n + R e c a l l ) F-measure = \frac{2*Precision*Recall}{(Precision+Recall)} Fmeasure=(Precision+Recall)2PrecisionRecall

Because this article mainly aims at 2 Classification problem , The experimental evaluation is mainly divided into 0 and 1 Two types of , The complete code is as follows :

# -*- coding:utf-8 -*-
import csv
import pandas as pd
import numpy as np
import jieba
import jieba.analyse
from scipy.sparse import coo_matrix
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB
#---------------------------------- First step Read the file --------------------------------
with open('fenci_data.csv', 'r', encoding='UTF-8') as f:
reader = csv.DictReader(f)
labels = []
contents = []
for row in reader:
labels.append(row['label']) #0- Praise 1- Bad review 
contents.append(row['content'])
print(labels[:5])
print(contents[:5])
#---------------------------------- The second step Data preprocessing --------------------------------
# Convert words in the text to word frequency matrix Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts 
vectorizer = CountVectorizer(min_df=5)
# This class will count the tf-idf A weight 
transformer = TfidfTransformer()
# first fit_transform It's calculation tf-idf the second fit_transform It is to turn the text into a word frequency matrix 
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents))
for n in tfidf[:5]:
print(n)
print(type(tfidf))
# Get all the words in the bag model 
word = vectorizer.get_feature_names()
for n in word[:10]:
print(n)
print(" Number of words :", len(word))
# take tf-idf Matrix extraction , Elements w[i][j] Express j Words in i Class text tf-idf The weight 
#X = tfidf.toarray()
X = coo_matrix(tfidf, dtype=np.float32).toarray() # sparse matrix Be careful float
print(X.shape)
print(X[:10])
#---------------------------------- The third step Data partitioning --------------------------------
# Use train_test_split Division X y list 
X_train, X_test, y_train, y_test = train_test_split(X,
labels,
test_size=0.3,
random_state=1)
#-------------------------------- Step four Machine learning classification --------------------------------
# Logistic regression classification method model 
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(LR.score(X_test, y_test)))
pre = LR.predict(X_test)
print(" Logistic regression classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
#---------------------------------- Step five Evaluation results --------------------------------
def classification_pj(name, y_test, pre):
print(" Algorithm evaluation :", name)
# Accuracy rate Precision = The total number of individuals correctly identified / The total number of individuals identified 
# Recall rate Recall = The total number of individuals correctly identified / The total number of individuals present in the test set 
# F value F-measure = Accuracy rate * Recall rate * 2 / ( Accuracy rate + Recall rate )
YC_B, YC_G = 0,0 # forecast bad good
ZQ_B, ZQ_G = 0,0 # correct 
CZ_B, CZ_G = 0,0 # There is 
#0-good 1-bad At the same time, it is calculated to prevent the class label from changing 
i = 0
while i<len(pre):
z = int(y_test[i]) # real 
y = int(pre[i]) # forecast 
if z==0:
CZ_G += 1
else:
CZ_B += 1
if y==0:
YC_G += 1
else:
YC_B += 1
if z==y and z==0 and y==0:
ZQ_G += 1
elif z==y and z==1 and y==1:
ZQ_B += 1
i = i + 1
print(ZQ_B, ZQ_G, YC_B, YC_G, CZ_B, CZ_G)
print("")
# Results output 
P_G = ZQ_G * 1.0 / YC_G
P_B = ZQ_B * 1.0 / YC_B
print("Precision Good 0:", P_G)
print("Precision Bad 1:", P_B)
R_G = ZQ_G * 1.0 / CZ_G
R_B = ZQ_B * 1.0 / CZ_B
print("Recall Good 0:", R_G)
print("Recall Bad 1:", R_B)
F_G = 2 * P_G * R_G / (P_G + R_G)
F_B = 2 * P_B * R_B / (P_B + R_B)
print("F-measure Good 0:", F_G)
print("F-measure Bad 1:", F_B)
# Function call 
classification_pj("LogisticRegression", y_test, pre)

The output is as follows :

 Logistic regression classification
1823 1823
precision recall f1-score support
0 0.94 0.99 0.97 1520
1 0.93 0.70 0.80 303
accuracy 0.94 1823
macro avg 0.94 0.85 0.88 1823
weighted avg 0.94 0.94 0.94 1823
Algorithm evaluation : LogisticRegression
213 1504 229 1594 303 1520
Precision Good 0: 0.9435382685069009
Precision Bad 1: 0.9301310043668122
Recall Good 0: 0.9894736842105263
Recall Bad 1: 0.7029702970297029
F-measure Good 0: 0.9659601798330122
F-measure Bad 1: 0.800751879699248


6、 ... and . Algorithm comparison experiment

1.RandomForest

The code is as follows :

# Random forest classification model n_estimators: The number of trees in the forest 
clf = RandomForestClassifier(n_estimators=20)
clf.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(clf.score(X_test, y_test)))
print("\n")
pre = clf.predict(X_test)
print(' Predicted results :', pre[:10])
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("RandomForest", y_test, pre)
print("\n")

Output results :

 Insert picture description here


2.SVM

The code is as follows :

# SVM Classification model 
SVM = svm.LinearSVC() # Support vector machine classifier LinearSVC
SVM.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(SVM.score(X_test, y_test)))
pre = SVM.predict(X_test)
print(" Support vector machine classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("LinearSVC", y_test, pre)
print("\n")

Output results :

 Insert picture description here


3. Naive Bayes

The code is as follows :

# Naive bayesian model 
nb = MultinomialNB()
nb.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(nb.score(X_test, y_test)))
pre = nb.predict(X_test)
print(" naive bayesian classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MultinomialNB", y_test, pre)
print("\n")

Output results :

 Insert picture description here


4.KNN

The accuracy of the algorithm is not high , And the execution time is long , It's not recommended for text analysis . In some cases, the algorithm comparison is OK , The core code is as follows :

# Nearest neighbor algorithm 
knn = neighbors.KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(knn.score(X_test, y_test)))
pre = knn.predict(X_test)
print(" Nearest neighbor classification ")
print(classification_report(y_test, pre))
classification_pj("KNeighbors", y_test, pre)
print("\n")

Output results :

 Insert picture description here


5. Decision tree

The code is as follows :

# Decision tree algorithm 
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(dtc.score(X_test, y_test)))
pre = dtc.predict(X_test)
print(" Decision tree classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("DecisionTreeClassifier", y_test, pre)
print("\n")

Output results :

 Insert picture description here


6.SGD

The code is as follows :

#SGD Classification model 
from sklearn.linear_model.stochastic_gradient import SGDClassifier
sgd = SGDClassifier()
sgd.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(sgd.score(X_test, y_test)))
pre = sgd.predict(X_test)
print("SGD classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("SGDClassifier", y_test, pre)
print("\n")

Output results :

 Insert picture description here


7.MLP

The algorithm is slow , The core code is as follows :

#MLP Classification model 
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
mlp = MLPClassifier()
mlp.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(mlp.score(X_test, y_test)))
pre = mlp.predict(X_test)
print("MLP classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("MLPClassifier", y_test, pre)
print("\n")

Output results :

 Insert picture description here


8.GradientBoosting

The algorithm is slow , The code is as follows :

#GradientBoosting Classification model 
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(gb.score(X_test, y_test)))
pre = gb.predict(X_test)
print("GradientBoosting classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("GradientBoostingClassifier", y_test, pre)
print("\n")

Output results :

 Insert picture description here


9.AdaBoost

The code is as follows :

#AdaBoost Classification model 
from sklearn.ensemble import AdaBoostClassifier
AdaBoost = AdaBoostClassifier()
AdaBoost.fit(X_train, y_train)
print(' The accuracy of the model :{}'.format(AdaBoost.score(X_test, y_test)))
pre = AdaBoost.predict(X_test)
print("AdaBoost classification ")
print(len(pre), len(y_test))
print(classification_report(y_test, pre))
classification_pj("AdaBoostClassifier", y_test, pre)
print("\n")

Output results :

 Insert picture description here



7、 ... and . summary

Write here , This article is over , In the next article, I'll take you through deep learning (BiLSTM-CNN) The emotion classification method of . I hope it will be of some help to you , At the same time, the inadequacies or mistakes in the article , Readers are welcome to propose . These experiments are some of the common problems in my thesis research or project evaluation , I hope the reader will bring these questions , Think deeply about your own needs , I hope you can apply what you have learned . Finally, if the article helps you , Please thumb up 、 Comment on 、 Collection , This will be my biggest motivation to share .

All in all , This passage Sklearn We have realized various emotion classification algorithms of machine learning , And we can do experimental comparisons , As shown in the figure below , Find random forests 、SVM、SGD、MLP It's not bad , Of course, different datasets have different effects , We need to combine our own data sets to complete .github Download code , Remember to pay attention to it .

 Insert picture description here

Last , As a rookie of artificial intelligence , I hope I can make progress and go deeper , Then it is applied to image recognition 、 Network security 、 Against samples and other fields , Guide you to write simple academic papers , Come on together ! Thank you for meeting many bloggers who have made progress in the past few years , Mutual encouragement ~

 Insert picture description here


Recently, I participated in the big data security competition held by Qianxin and Tsinghua University , There's a lot to gain , Also aware of the chasm of the gap . My main analysis is HC And malicious family website classification , Probably from 200 10000 real website clocks have identified more than 100000 HC Website , It's about data grabbing 、 Malicious traffic detection 、 Jump hijack judgment 、NLP And big data . In the last five directions, Tsinghua University won the prize 、 Institute of information technology, Chinese Academy of Sciences 、 Alibaba team , There is also Peking University 、 Zhejiang University 、 Hand in and other teams , Awesome! , I really want to learn from them writeup. I really cherish this opportunity of actual combat , Hope to continue to refuel in the future , One year, I can break into the top three and get a prize . Although I'm a vegetable , But next, I will share my big data analysis method , Make progress with everyone . The unknown attack , How prevent , On the safe way, please ask a friend and a big man for advice , I also hope that I can make progress in both academic and practical directions . It's not terrible to have a gap , The important thing is that I tried to , Shared , come on. . Finally, thank you for your guidance and guidance , ha-ha ~

(By:Eastmount 2020-08-17 Monday afternoon 3 Written in Wuhan http://blog.csdn.net/eastmount/ )



2020 year 8 month 18 A new one “ Na Zhang AI Safe house ”, centered Python Big data analysis 、 Cyberspace Security 、 Artificial intelligence 、Web Penetration and attack and defense techniques are explained , Share... At the same time CCF、SCI、 The implementation of the algorithm of South core and North nuclear paper . Nazhang's house will be more systematic , And reconstruct all of the author's articles , Explain from scratch Python And security , I have been writing articles for nearly ten years , I really want to share what I have learned, felt and done , Please give me more advice , We sincerely invite your attention ! thank you .

reference :
[1] Yang xiuzhang 《Python Network data crawling and analysis from the beginning to proficient ( Analysis )》
[2] https://blog.csdn.net/WANG_hl/article/details/105234432
[3] https://blog.csdn.net/qq_27590277/article/details/106894245
[4] https://www.cnblogs.com/alivinfer/p/12892147.html
[5] https://blog.csdn.net/qq_28626909/article/details/80382029
[6] https://www.jianshu.com/p/3da3f5608a7c

版权声明
本文为[Eastmount]所创,转载请带上原文链接,感谢

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database