Python uses scikit learn to calculate TF-IDF

Cai junshuai 2021-09-15 08:14:06
python uses scikit learn calculate


 

1 Scikit-learn Download and install

1.1 brief introduction

Scikit-learn It is a simple and effective tool for data mining and data analysis , It is based on Python Machine learning module , be based on BSD free use .

Scikit-learn The basic function of is divided into six parts : classification (Classification)、 Return to (Regression)、 clustering (Clustering)、 Data dimension reduction (Dimensionality reduction)、 Model selection (Model selection)、 Data preprocessing (Preprocessing).

Scikit-Learn The machine learning models in are very rich , Include SVM, Decision tree ,GBDT,KNN wait , The appropriate model can be selected according to the type of problem , Please refer to the official website for details , It is recommended that you download resources from the official website 、 modular 、 Document learning .

1.2 Install the software

pip install scikit-learn
  • 1.
  • 1

Re pass ”from sklearn import feature_extraction” Import .

2 TF-IDF Basic knowledge of

2.1 TF-IDF Concept

TF-IDF(Term Frequency-InversDocument Frequency) It is a weighting technology commonly used in information processing and data mining . This technology adopts a statistical method , According to the frequency of words appearing in the text and the frequency of documents appearing in the whole corpus, the importance of a word in the whole corpus is calculated . Its advantage is that it can filter out some common but unimportant words , Keep important words that affect the whole text at the same time .

  • TF(Term Frequency) Indicates how often a keyword appears throughout the article .
  • IDF(InversDocument Frequency) Indicates the frequency of calculating the inverted text . Text frequency refers to the number of times a keyword appears in all articles of the whole corpus . Inverse document frequency is also called inverse document frequency , It's the reciprocal of the frequency of the document , It is mainly used to reduce the effect of some common words in all documents that have little impact on documents .

computing method : By taking the local part of the quantity ( Word frequency ) And the global component ( Reverse document frequency ) Multiply to calculate tf-idf, And the resulting document is normalized to unit length . The formula of nonstandard weight in document in file , Pictured :

python Use scikit-learn Calculation TF-IDF_ Word frequency

Separate steps

(1) Calculate word frequency

Word frequency = The total number of times a word appears in an article / The total number of words in the article  
 
python Use scikit-learn Calculation TF-IDF_ Download and install _02

(2) Calculate the inverse document frequency

Reverse document frequency (IDF) = log( Total number of documents in thesaurus / Number of documents containing the word +1)

2.2 An example is given to illustrate the calculation

Let's start with an example . Suppose there is a long article 《 Bee culture in China 》, We are going to use the computer to extract its keywords .

An easy way to think of , Is to find the most frequent words . If a word is important , It should appear many times in this article . therefore , We carry out ” Word frequency ”(Term Frequency, Abbreviation for TF) Statistics .

( Be careful : The most frequently used word is —-“ Of ”、” yes ”、” stay ”—- The most commonly used words in this category . They are called ” Stop words ”(stop words), It doesn't help to find the result 、 Words that must be filtered out .)

Suppose we filter them out , Just think about the rest of the words that make sense . There's another problem , We may find that ” China ”、” The bees ”、” farming ” These three words appear as often . Does that mean , As a key word , They are of the same importance ?

Obviously not . because ” China ” It's a very common word , Relatively speaking ,” The bees ” and ” farming ” Not so common . If these three words appear the same number of times in an article , There is reason to think that ,” The bees ” and ” farming ” More important than ” China ”, in other words , On keyword ranking ,” The bees ” and ” farming ” Should be in ” China ” In front of .

therefore , We need an importance adjustment factor , Measure whether a word is a common word . If a word is rare , But it appears many times in this article , So it probably reflects the characteristics of this article , That's what we need .

In statistical language , On the basis of word frequency , Assign one to each word ” Importance ” The weight . The most common word (” Of ”、” yes ”、” stay ”) Give minimum weight , More common words (” China ”) Give less weight , Less common words (” The bees ”、” farming ”) Give greater weight to . This weight is called ” Reverse document frequency ”(Inverse Document Frequency, Abbreviation for IDF), Its size is inversely proportional to the frequency of a word .

got it ” Word frequency ”(TF) and ” Reverse document frequency ”(IDF) in the future , Multiply these two values , You get a word of TF-IDF value . The more important a word is to an article , its TF-IDF The more it's worth . therefore , The top words , This is the key word of this article .

Here are the details of the algorithm .

  1. First step , Calculate word frequency .

python Use scikit-learn Calculation TF-IDF_ machine learning _03

  1. The second step , Calculate the inverse document frequency

python Use scikit-learn Calculation TF-IDF_ The weight _04

  1. The third step , Calculation TF-IDF.

python Use scikit-learn Calculation TF-IDF_ Download and install _05

You can see ,TF-IDF It's proportional to the number of times a word appears in a document , In inverse proportion to the number of times the word appears in the whole language . therefore , Automatic keyword extraction algorithm is very clear , It's the calculation of every word in the document TF-IDF value , And then in descending order , Take the top words .

Or to 《 Bee culture in China 》 For example , Suppose the length of this article is 1000 Word ,” China ”、” The bees ”、” farming ” Each appears 20 Time , The three words ” Word frequency ”(TF) All for 0.02. then , Search for Google Find out , contain ” Of ” The web page of words has 250 One hundred million , Suppose this is the total number of Chinese pages . contain ” China ” Of the web pages 62.3 One hundred million , contain ” The bees ” The web page of is 0.484 One hundred million , contain ” farming ” The web page of is 0.973 One hundred million . Then their inverse document frequency (IDF) and TF-IDF as follows :

python Use scikit-learn Calculation TF-IDF_ The weight _06

As can be seen from the above table ,” The bees ” Of TF-IDF Highest value ,” farming ” secondly ,” China ” The minimum .( If you also calculate ” Of ” The word TF-IDF, That would be an extremely close 0 Value .) therefore , If you choose only one word ,” The bees ” This is the key word of this article .

3 Scikit-Learn Middle computation TF-IDF

Scikit-Learn in TF-IDF The weight calculation method mainly uses two classes :CountVectorizer and TfidfTransformer.

3.1 CountVectorizer

CountVectorizer Class converts words in the text to a word frequency matrix .

For example, a matrix contains an element a[i][j], It said j Words in i Frequency of words in similar texts .

It passes through fit_transform Function to calculate the number of times each word appears ,

adopt get_feature_names() You can get the keywords of all the texts in the word bag ,

adopt toarray() You can see the result of the word frequency matrix .

  • # coding:utf-8
    from sklearn.feature_extraction.text import CountVectorizer
    # corpus
    corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    ]
    # Convert words in the text to word frequency matrix
    vectorizer = CountVectorizer()
    # Count the number of times a word appears
    X = vectorizer.fit_transform(corpus)
    # Get all text keywords in the word bag
    word = vectorizer.get_feature_names()
    print word
    # Check the word frequency result
    print X.toarray()
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    [u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']
    [[0 1 1 1 0 0 1 0 1]
    [0 1 0 1 0 2 1 0 1]
    [1 0 0 0 1 0 1 1 0]
    [0 1 1 1 0 0 1 0 1]]
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
  • You can see from the results , Total includes 9 A characteristic word , namely :

    [u’and’, u’document’, u’first’, u’is’, u’one’, u’second’, u’the’, u’third’, u’this’]

    At the same time, the output of each sentence contains the number of characteristic words .

    for example , The first sentence “This is the first document.”, Its corresponding word frequency is [0, 1, 1, 1, 0, 0, 1, 0, 1],

    Suppose the initial sequence number is from 1 Start counting , Then the frequency of the word means that there is No 2 A word of position “document” common 1 Time 、 The first 3 A word of position “first” common 1 Time 、 The first 4 A word of position “is” common 1 Time 、 The first 9 A word of position “this” common 1 word .

    therefore , Every sentence will get a word frequency vector .

    3.2 TfidfTransformer

    TfidfTransformer Used for statistics vectorizer Of each word in TF-IDF value . The usage is as follows

  • # coding:utf-8
    from sklearn.feature_extraction.text import CountVectorizer
    # corpus
    corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    ]
    # Convert words in the text to word frequency matrix
    vectorizer = CountVectorizer()
    # Count the number of times a word appears
    X = vectorizer.fit_transform(corpus)
    # Get all text keywords in the word bag
    word = vectorizer.get_feature_names()
    print word
    # Check the word frequency result
    print X.toarray()
    # ----------------------------------------------------
    from sklearn.feature_extraction.text import TfidfTransformer
    # Class call
    transformer = TfidfTransformer()
    print transformer
    # Put the word frequency matrix X Statistics TF-IDF value
    tfidf = transformer.fit_transform(X)
    # Look at the data structure tfidf[i][j] Express i Class text tf-idf The weight
    print tfidf.toarray()
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
    • 22.
    • 23.
    • 24.
    • 25.
    • 26.
    • 27.
    • 28.
    • 29.
    • 30.
    • 31.

    The output is shown below :

  • python Use scikit-learn Calculation TF-IDF_ Word frequency _07

     

  • 4 A complete example of mini

    Generally, word frequency statistics and calculation are required at the same time TF-IDF value , Then use the core code :

  • vectorizer=CountVectorizer()
    transformer=TfidfTransformer()
    tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))
    
    • 1.
    • 2.
    • 3.
    # coding:utf-8
    __author__ = "liuxuejiang"
    import jieba
    import jieba.posseg as pseg
    import os
    import sys
    from sklearn import feature_extraction
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import CountVectorizer
    if __name__ == "__main__":
    corpus=[" I Came to Beijing Tsinghua University ",# The first is the result of word segmentation , Words are separated by spaces
    " He Came to 了 NetEase Hangzhou research building ",# The second type of text segmentation results
    " Xiao Ming master graduation And China Academy of sciences, ",# The third type of text segmentation results
    " I Love Beijing The tiananmen square "]# The fourth kind of text segmentation results
    vectorizer=CountVectorizer()# This class will convert the words in the text into word frequency matrix , Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts
    transformer=TfidfTransformer()# This class will count the tf-idf A weight
    tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))# first fit_transform It's calculation tf-idf, the second fit_transform It is to turn the text into a word frequency matrix
    word=vectorizer.get_feature_names()# Get all the words in the bag model
    weight=tfidf.toarray()# take tf-idf Matrix extraction , Elements a[i][j] Express j Words in i Class text tf-idf The weight
    for i in range(len(weight)):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text
    print u"------- Here is the output of ",i,u" A text like word tf-idf The weight ------"
    for j in range(len(word)):
    print word[j],weight[i][j]
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
    • 22.
    • 23.
    • 24.

    Output is as follows :

  • ------- Here is the output of 0 A text like word tf-idf The weight ------ # The original text of this class is :" I came to tsinghua university in Beijing "
    China 0.0
    Beijing 0.52640543361
    building 0.0
    The tiananmen square 0.0
    Xiao Ming 0.0
    Came to 0.52640543361
    Hangzhou research 0.0
    graduation 0.0
    Tsinghua University 0.66767854461
    master 0.0
    Academy of sciences, 0.0
    NetEase 0.0
    ------- Here is the output of 1 A text like word tf-idf The weight ------ # The original text of this class is : " He came to the netease hangyan building "
    China 0.0
    Beijing 0.0
    building 0.525472749264
    The tiananmen square 0.0
    Xiao Ming 0.0
    Came to 0.414288751166
    Hangzhou research 0.525472749264
    graduation 0.0
    Tsinghua University 0.0
    master 0.0
    Academy of sciences, 0.0
    NetEase 0.525472749264
    ------- Here is the output of 2 A text like word tf-idf The weight ------ # The original text of this class is : " Xiao Ming graduated from the Chinese Academy of Sciences with a master's degree “
    China 0.4472135955
    Beijing 0.0
    building 0.0
    The tiananmen square 0.0
    Xiao Ming 0.4472135955
    Came to 0.0
    Hangzhou research 0.0
    graduation 0.4472135955
    Tsinghua University 0.0
    master 0.4472135955
    Academy of sciences, 0.4472135955
    NetEase 0.0
    ------- Here is the output of 3 A text like word tf-idf The weight ------ # The original text of this class is : " I love tian 'anmen square in Beijing "
    China 0.0
    Beijing 0.61913029649
    building 0.0
    The tiananmen square 0.78528827571
    Xiao Ming 0.0
    Came to 0.0
    Hangzhou research 0.0
    graduation 0.0
    Tsinghua University 0.0
    master 0.0
    science 0.0
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
    • 22.
    • 23.
    • 24.
    • 25.
    • 26.
    • 27.
    • 28.
    • 29.
    • 30.
    • 31.
    • 32.
    • 33.
    • 34.
    • 35.
    • 36.
    • 37.
    • 38.
    • 39.
    • 40.
    • 41.
    • 42.
    • 43.
    • 44.
    • 45.
    • 46.
    • 47.
    • 48.
    • 49.
    • 50.
    • 51.

# coding:utf-8 __author__ = "liuxuejiang"import jieba import jieba.posseg as pseg import os import sys from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer if __name__ == "__main__": corpus=[" I Came to Beijing Tsinghua University ",# The first is the result of word segmentation , Words are separated by spaces " He Came to 了 NetEase Hangzhou research building ",# The second type of text segmentation results " Xiao Ming master graduation And China Academy of sciences, ",# The third type of text segmentation results " I Love Beijing The tiananmen square "]# The fourth kind of text segmentation results vectorizer=CountVectorizer()# This class will convert the words in the text into word frequency matrix , Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts transformer=TfidfTransformer()# This class will count the tf-idf A weight tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))# first fit_transform It's calculation tf-idf, the second fit_transform It is to turn the text into a word frequency matrix word=vectorizer.get_feature_names()# Get all the words in the bag model weight=tfidf.toarray()# take tf-idf Matrix extraction , Elements a[i][j] Express j Words in i Class text tf-idf The weight for i in range(len(weight)):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text printu"------- Here is the output of ",i,u" A text like word tf-idf The weight ------"for j in range(len(word)): print word[j],weight[i][j]

版权声明
本文为[Cai junshuai]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/09/20210909135057743t.html

  1. Take you to learn more about nginx basic login authentication: generating passwords using Python
  2. 超硬核Python避坑学习方案奉上!入门到就业一篇就搞定!
  3. Talk about how JMeter executes Python scripts concurrently
  4. Talk about how JMeter executes Python scripts concurrently
  5. Talk about how JMeter executes Python scripts concurrently
  6. python内置函数通过字符串的方式来执行函数代码块,类似java的反射机制相当强大!
  7. python内置函数通过字符串的方式来执行函数代码块,类似java的反射机制相当强大!
  8. python內置函數通過字符串的方式來執行函數代碼塊,類似java的反射機制相當强大!
  9. Les fonctions intégrées Python exécutent des blocs de code de fonction à travers des chaînes, et les mécanismes de réflexion comme Java sont assez puissants!
  10. Python module 1
  11. Python tip: use namedtuple instead of manually created classes
  12. Python - poetry(3)配置项详解
  13. Python - poetry(3)配置项详解
  14. Python - poetry(3)配置項詳解
  15. Python - poetry(3)配置項詳解
  16. Python - détails de l'élément de configuration Poetry (3)
  17. Python - détails de l'élément de configuration Poetry (3)
  18. Python案例实战,pygame模块,Python实现字母代码雨
  19. Python calculation vector angle code
  20. Python基础面试题解读|《Python面试100层》|第1层
  21. 面对小白的pandas命令手册+练习题【三万字详解】
  22. 面對小白的pandas命令手册+練習題【三萬字詳解】
  23. Face au Manuel de commande pandas de Xiaobai + question d'exercice [30 000 mots pour plus de détails]
  24. Interprétation des questions d'entrevue de base Python | 100 couches d'entrevue Python | couche 1
  25. Python data structure and algorithm (17) -- merge sort
  26. Les fonctions intégrées Python exécutent des blocs de code de fonction à travers des chaînes, et les mécanismes de réflexion comme Java sont assez puissants!
  27. Python笔记-uiautomator2截图点击,OpenCV找图
  28. Python文档阅读笔记-OpenCV中Template Matching
  29. Python笔记-利用OpenCV的matchTemplate屏幕找图并使用pyautogui点击
  30. Python筆記-利用OpenCV的matchTemplate屏幕找圖並使用pyautogui點擊
  31. Notes python - utilisez l'écran matchtemplate d'OpenCV pour trouver des images et cliquez sur
  32. Notes de lecture de documents python - Matching de modèles dans OpenCV
  33. Notes python - capture d'écran de l'automate 2 Cliquez pour ouvrir la vue
  34. python链接云服务器的mysql8
  35. python鏈接雲服務器的mysql8
  36. Mysql8 pour les serveurs Cloud liés Python
  37. Python资源大集合,要的话可以拿走!
  38. ️万字【Python基础】保姆式教学️,小白快速入门Python!
  39. ️萬字【Python基礎】保姆式教學️,小白快速入門Python!
  40. Wanzi [base Python] Baby - sitting Teaching, Little White Quick Start Python!
  41. Realizing the function of sending e-mail automatically with Python
  42. Smtpauthenticationerror in Python: solution
  43. 8 steps to teach you how to solve Sudoku in Python! (including source code)
  44. Python change la vie | identifier facilement des centaines de numéros de livraison
  45. Python change life | utilisation de modèles reconnus par ocr
  46. Bibliothèques Python utiles et intéressantes - - psutil
  47. 3. Traitement des données pandas
  48. 【Python编程基础】控制流之链式比较运算符
  49. MFC uses Python scripting language
  50. 【Python編程基礎】控制流之鏈式比較運算符
  51. 【 base de programmation python】 opérateur de comparaison de chaîne pour le flux de contrôle
  52. Python game development, pyGame module, python implementation of Xiaole games
  53. Mise en œuvre du Code de vérification unique (OTP) avec le cadre de repos Django
  54. Python - eval ()
  55. Python - Programmation orientée objet - _Rapport()
  56. Différence entre python - rep (), Str ()
  57. Python - Programmation orientée objet - _Appel()
  58. Python calling matlab script
  59. Python - Programmation orientée objet - _Nouveau() et mode Singleton
  60. Python - Programmation orientée objet - méthode magique (méthode de double soulignement)