Summary of Chinese word segmentation based on Jieba

itread01 2021-02-22 14:09:54
summary chinese word segmentation based


[TOC]## Module installation ```pip install jieba```jieba Word segmentation support 4 There are two patterns of word segmentation :1. The exact pattern, which attempts to slice sentences as precisely as possible , Suitable for text analysis .2. Full mode this mode will scan all the words that can be formed into words in a sentence , It's also very fast , The disadvantage is that it can't solve the problem of ambiguity , Ambiguous words are also scanned .3. Search engine mode, which will segment long words on the basis of accurate mode , Cut out shorter words . In search engines , Part of the input word is required to retrieve the whole word related file , So this model is suitable for search engine segmentation .4. Paddle Pattern the pattern uses PaddlePaddle Deep learning framework , Training sequence tagging network model to achieve word segmentation , It also supports part of speech tagging . The mode is in the 4.0 And above jieba It can only be used in word segmentation . To use this mode, you need to install PaddlePaddle Module , The installation command is “`pip install paddlepaddle`”.## Open the original code ```https://github.com/fxsjy/jieba```## Basic usage ```Python>>> import jieba>>> str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '>>> seg_list = jieba.cut(str1, cut_all=True)>>> print(' All mode word segmentation results :' + '/'.join(seg_list)) All mode word segmentation results : I / Come to / 了 / Chengdu / Of / southwest / traffic / University / Rhinoceros / Pu / Amherst /,/ Find out / Here / Really not / That's great / Good. >>> seg_list = jieba.cut(str1, cut_all=False)>>> print(' Precise pattern segmentation results :' + '/'.join(seg_list)) Precise pattern segmentation results : I / Come to / 了 / Chengdu / Of / southwest / traffic / University / Xipu / Amherst /,/ Find out / Here / That's great ```## Enable Paddle> There's a hole here , First of all, you can't use the latest python edition , I'm from 3.9.1 Down to 3.8.7 Can only be .> In addition, the installation has been used to report errors , It turns out that `Microsoft Visible C++ 2017 Redistributable And above support `.```Pythonimport jiebaimport paddlestr1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here 'paddle.enable_static()jieba.enable_paddle()seg_list = jieba.cut(str1, use_paddle=True)print('Paddle Pattern segmentation results :' + '/'.join(seg_list)) Output :Paddle Pattern segmentation results : I / Come to / 了 / Chengdu / Of / Xipu campus of Southwest Jiaotong University ,/ Find out / Here / That's great ```## Part of speech tagging ```Pythonimport jiebaimport paddle# Part of speech tagging and word segmentation import jieba.posseg as psegstr1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here 'paddle.enable_static()jieba.enable_paddle()words = pseg.cut(str1,use_paddle=True) for seg, flag in words: print('%s %s' % (seg, flag)) Output : I r Come to v 了 u Chengdu LOC Of u Xipu campus of Southwest Jiaotong University , ORG Find out v Here r That's great a```> Be careful :pseg.cut and jieba.cut The objects returned are different !paddle The corresponding table of pattern part of speech tagging is as follows :paddle The set of pattern part of speech and proper name category labels is shown in the following table , Part of speech tags 24 One ( Small letters ), Proper name category label 4 One ( Capital letters ).| Label | Meaning | Label | Meaning | Label | Meaning | Label | Meaning || ---- | -------- | ---- | -------- | ---- | -------- | ---- | -------- || n | A common noun | f | Location noun | s | Place noun | t | Time || nr | The person's name | ns | Place names | nt | Organization name | nw | The name of the work || nz | Other proper names | v | Common verbs | vd | Verb adverbs | vn | A noun verb || a | Adjectives | ad | Adverbs | an | Nouns and adjectives | d | Adverbs || m | Quantifier | q | quantifier | r | Pronouns | p | Preposition || c | Conjunction | u | Auxiliary word | xc | Other function words | w | Punctuation || PER | The person's name | LOC | Place names | ORG | Organization name | TIME | Time |## Adjust the dictionary + Use add_word(word, freq=None, tag=None) and del_word(word) The dictionary can be dynamically modified in the program .+ Use suggest_freq(segment, tune=True) The frequency of a single word can be adjusted , Make it possible to ( Or not ) Be separated .## Smart recognition of new words will jieba.cut() Arguments of functions HMM Set to True, That is to say, we can use HMM Model recognition new words , Words that don't exist in dictionaries . Test it , Results the general ## Search engine pattern segmentation ```pythonimport jiebastr1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here 'seg_list = jieba.cut_for_search(str1)print(' Search engine pattern word segmentation results :' + '/'.join(seg_list)) Output : Search engine pattern word segmentation results : I / Come to / 了 / Chengdu / Of / southwest / traffic / University / Xipu / Amherst /,/ Find out / Here / Really not / Good. / That's great ```## Use custom dictionary user dictionary .txt as follows ![](https://img2020.cnblogs.com/blog/2200001/202102/2200001-20210221182839297-874771894.png)```import jiebastr = ' Looking back like telepathy , Only then can I see the tenderness of the bow ; It's also the tenderness of bowing your head , It's like a water lotus, full of cold wind ; It's also the most shameful , Only in this way can the two of them work hand in hand .';seg_list = jieba.cut(str)print(' Exact pattern word segmentation results when the custom dictionary is not loaded :\n', '/'.join(seg_list))jieba.load_userdict(' User dictionary .txt')seg_list = jieba.cut(str)print(' The exact pattern segmentation results when loading the custom dictionary :\n', '/'.join(seg_list))jieba.add_word(' Most of all ')seg_list = jieba.cut(str)print(' The result of precise pattern segmentation when adding custom words :\n', '/'.join(seg_list))jieba.del_word(' Bow your head ')seg_list = jieba.cut(str)print(' Precise pattern segmentation results when deleting custom words :\n', '/'.join(seg_list))```## Keyword extraction is the word that can best reflect the theme and meaning of the text . Keyword extraction is to extract the most relevant words from the specified text , It can be applied to document retrieval 、 Classification and summary automatic writing, etc . There are two main ways to extract keywords from text :+ The first is a supervised learning algorithm ; This method regards keyword extraction as a binary classification problem , First extract the candidate words that may be keywords , And then judge the candidate words , The result is that “ Is the key word ” and “ It's not a keyword ” Two kinds of , Based on this principle, an algorithm model of keyword classifier is designed , Keep training the model with words , Make the model more mature , Until the model can accurately extract keywords from the new text ;+ The second is unsupervised learning algorithm ; This method is to score the candidate words , Take the candidate with the highest score as the keyword , Common scoring algorithms are TF-IDF and TextRank.jieba Module provides the use of TF-IDF and TextRank The algorithm extracts the function of the keyword .```python# Based on TF-IDF Keyword extraction of the algorithm from jieba import analysetext = ' The reporter recently learned from Nanjing Institute of Geology and paleontology, Chinese Academy of Sciences that , The Institute's early life research team works with American scholars , In the Shibantan biota of the Three Gorges area, Hubei Province, China , I found 4 An ancient creature that looks like a leaf . These “ Leaf ” In fact, it's a strange early animal , They lived at the bottom of the ancient ocean . The related research results have been published in the international professional journal of paleontology 《 Journal of paleontology 》 On .'keywords = analyse.extract_tags(text, topK = 10, withWeight = True, allowPOS = ('n', 'v'))print(keywords)# Based on TextRank Keyword extraction of the algorithm from jieba import analysetext = ' The reporter recently learned from Nanjing Institute of Geology and paleontology, Chinese Academy of Sciences that , The Institute's early life research team works with American scholars , In the Shibantan biota of the Three Gorges area, Hubei Province, China , I found 4 An ancient creature that looks like a leaf . These “ Leaf ” In fact, it's a strange early animal , They lived at the bottom of the ancient ocean . The related research results have been published in the international professional journal of paleontology 《 Journal of paleontology 》 On .'keywords = analyse.textrank(text, topK = 10, withWeight = True, allowPOS = ('n', 'v'))print(keywords)``` Explain :+ extract_tags() + Arguments sentence For the text of the keyword to be extracted ; + Arguments topK Used to specify the number of keywords to return , The default value is 20; + Arguments withWeight Used to specify whether to return weights at the same time , The default value is False, Indicates that no weight is returned ,TF or IDF The higher the weight , The higher the priority of return ; + Arguments allowPOS Used to specify the part of speech of the returned keyword , To filter the returned keywords , The default value is empty , Indicates no filtering .+ textrank() + and extract_tags() The arguments of the functions are basically the same , Only arguments allowPOS The default values for are different . + Because of the different algorithms , The results may vary .## Stop words filtering stop words means that a large number of stop words appear in every file , But for NLP Words that don't work much , Such as “ you ”“ I ”“ Of ”“ stay ” And punctuation . Filter out the stop words after word segmentation , Help to improve NLP The efficiency of .···pythonimport jiebawith open('stopwords.txt', 'r+', encoding = 'utf-8')as fp: stopwords = fp.read().split('\n')word_list = []text = ' The Ministry of Commerce 4 month 23 According to the information released by the Ministry of education on May 15 , First quarter , The net retail sales of agricultural products nationwide reached 936.8 Billion yuan , Growth 31.0%; More than 400 Ten thousand . E-commerce has brought new opportunities to farmers .'seg_list = jieba.cut(text)for seg in seg_list: if seg not in stopwords: word_list.append(seg)print(' Segmentation results when stop word filtering is enabled :\n', '/'.join(word_list))```## Word frequency statistics word frequency is NLP It's a very important concept in English , It is the basis of word segmentation and keyword extraction . In the construction of word segmentation Dictionary , You usually need to set the frequency for each word . Statistics of word frequency can objectively reflect the emphasis of a paragraph of text .```pythonimport jiebatext = ' Steamed bun pot steamed bun pot steamed bun , Steamed buns in a pot , Put the steamed bread on the table , There are steamed buns on the table .'with open('stopwords.txt', 'r+', encoding = 'utf-8')as fp: stopwords = fp.read().split('\n')word_dict = {}jieba.suggest_freq((' The table '), True)seg_list = jieba.cut(text)for seg in seg_list: if seg not in stopwords: if seg in word_dict.keys(): word_dict[seg] += 1 else: word_dict[seg] = 1print(word_dict) Output :{' steamed ': 3, ' Steamed bun ': 5, ' Pot pot ': 1, ' A pot of ': 1, ' Pot ': 1, ' Shelve ': 1, ' The table ': 2, ' above '
版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/02/20210221225813045v.html

  1. Python中的解决中文字符编码的问题
  2. Solving the problem of Chinese character coding in Python
  3. Translation: practical Python Programming 02_ 01_ Datatypes
  4. Installation and use of Python and tensorflow in win10 environment (Python version 3.6, tensorflow version 1.6)
  5. Python series 46
  6. Linux安装Python3
  7. 【python接口自动化】- 正则用例参数化
  8. Python RestFul Api 设计
  9. filecmp --- 文件及目录的比较│Python标准库
  10. Installing python3 on Linux
  11. [Python] Matplotlib 圖表的繪製和美化技巧
  12. (資料科學學習手札108)Python+Dash快速web應用開發——靜態部件篇(上)
  13. 翻譯:《實用的Python程式設計》02_01_Datatypes
  14. 【python接口自动化】- 正则用例参数化
  15. 翻译:《实用的Python编程》02_02_Containers
  16. 两年Java,去字节跳动写Python和Go
  17. [Python interface automation] - regular use case parameterization
  18. Python restful API design
  19. 翻译:《实用的Python编程》02_02_Containers
  20. 两年Java,去字节跳动写Python和Go
  21. 翻译:《实用的Python编程》02_02_Containers
  22. Python基于粒子群优化的投资组合优化研究
  23. ubuntu部署django项目
  24. 兩年Java,去位元組跳動寫Python和Go
  25. 翻譯:《實用的Python程式設計》02_02_Containers
  26. 这样学习Python,爷爷都学会了!超简单Python入门
  27. [Python] 基于 jieba 的中文分词总结
  28. 【python】递归听了N次也没印象,读完这篇你就懂了
  29. [Python] 基于 jieba 的中文分词总结
  30. 人理解迭代,神则体会递归,从电影艺术到Python代码实现神的逆向思维模式
  31. [Python] 基於 jieba 的中文分詞總結
  32. Python属于后端开发还是前端开发?Python入门!
  33. 【python】递归听了N次也没印象,读完这篇你就懂了
  34. 一天快速入门python
  35. 学习Python对年龄有没有要求?30岁可以吗?
  36. 清华教授!12小时整理的最全Python教程(文末无偿分享)
  37. 使用Python开发DeFi项目
  38. python 函数详解
  39. Python工程师是做什么的?前景如何?
  40. Filecmp -- comparison of files and directories
  41. Python - zip() 函数
  42. 30 周年生日,Python 先驱是怎么评价这门语言的?
  43. Drawing and beautifying skills of [Python] Matplotlib chart
  44. Python + dash rapid web application development static components
  45. Translation: practical Python Programming 02_ 01_ Datatypes
  46. python将excel自适应导入数据库
  47. 从小白到大师,这里有一份Pandas入门指南
  48. [Python] 茎叶图和复合饼图的画法
  49. [Python interface automation] - regular use case parameterization
  50. Translation: practical Python Programming 02_ 02_ Containers
  51. Two years of Java, to write Python and go
  52. Translation: practical Python Programming 02_ 02_ Containers
  53. Two years of Java, to write Python and go
  54. Python-geoplot 空间核密度估计图绘制
  55. Python-seaborn 经济学人经典图表仿制
  56. python空间绘图- regionmask掩膜操作示例
  57. Python 空间绘图 - Cartopy 经纬度添加
  58. Python-pykrige包-克里金(Kriging)插值计算及可视化绘制
  59. Python 批量重采样、掩膜、坡度提取
  60. python - 多种交通方式可达圈分析