Summary of Chinese word segmentation based on Jieba

Doraemon paradise 2021-02-22 13:56:50
summary chinese word segmentation based


Module installation

pip install jieba

jieba The word splitter supports 4 A participle pattern :

  1. The exact pattern, which attempts to slice sentences as precisely as possible , Suitable for text analysis .
  2. Full mode this mode will scan all the words that can be formed into words in a sentence , It's also very fast , The disadvantage is that it can't solve the problem of ambiguity , Ambiguous words are also scanned .
  3. Search engine mode, which will segment long words on the basis of precise mode , Cut out shorter words . In search engines , Part of the input word is required to retrieve the whole word related document , So this model is suitable for search engine segmentation .
  4. Paddle Pattern the pattern uses PaddlePaddle Deep learning framework , Training sequence tagging network model to achieve word segmentation , It also supports part of speech tagging .
    The mode is 4.0 And above jieba It can only be used in word segmentation . To use this mode, you need to install PaddlePaddle modular , The installation command is “pip install paddlepaddle”.

Open source code

https://github.com/fxsjy/jieba

Basic usage

>>> import jieba
>>> str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
>>> seg_list = jieba.cut(str1, cut_all=True)
>>> print(' Full pattern word segmentation results :' + '/'.join(seg_list))
Full pattern word segmentation results : I / Came to / 了 / Chengdu / Of / southwest / traffic / university / Rhinoceros / Pu / Campus /,/ Find out / here / Really not / That's good / Pretty good
>>> seg_list = jieba.cut(str1, cut_all=False)
>>> print(' Accurate pattern segmentation results :' + '/'.join(seg_list))
Accurate pattern segmentation results : I / Came to / 了 / Chengdu / Of / southwest / traffic / university / Xipu / Campus /,/ Find out / here / That's good

Enable Paddle

There's a pit here , First of all, you can't use the latest python edition , I started from 3.9.1 Down to 3.8.7 Can only be .
In addition, the installation has been used to report errors , It turns out that Microsoft Visible C++ 2017 Redistributable And above support .

import jieba
import paddle
str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
paddle.enable_static()
jieba.enable_paddle()
seg_list = jieba.cut(str1, use_paddle=True)
print('Paddle Pattern segmentation results :' + '/'.join(seg_list))
Output :
Paddle Pattern segmentation results : I / Came to / 了 / Chengdu / Of / Xipu campus of Southwest Jiaotong University ,/ Find out / here / That's good

Part of speech tagging

import jieba
import paddle
# Part of speech tagging and word segmentation
import jieba.posseg as pseg
str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
paddle.enable_static()
jieba.enable_paddle()
words = pseg.cut(str1,use_paddle=True)
for seg, flag in words:
print('%s %s' % (seg, flag))
Output :
I r
Came to v
了 u
Chengdu LOC
Of u
Xipu campus of Southwest Jiaotong University , ORG
Find out v
here r
That's good a

Be careful :pseg.cut and jieba.cut The objects returned are different !

paddle The corresponding table of pattern part of speech tagging is as follows :

paddle The set of pattern part of speech and proper name category labels is shown in the following table , The part of speech label 24 individual ( Lowercase letters ), Special name category label 4 individual ( Capital ).

label meaning label meaning label meaning label meaning
n A common noun f Position NOUN s Place noun t Time
nr The person's name ns Place names nt Organization name nw The name of the work
nz Other proper names v Common verb vd verbal adverb vn Noun verb
a Adjective ad Accessory words an Noun form words d adverb
m quantifiers q quantifiers r pronouns p Preposition
c Conjunction u auxiliary word xc Other function words w Punctuation
PER The person's name LOC Place names ORG Organization name TIME Time

Adjust dictionary

  • Use add_word(word, freq=None, tag=None) and del_word(word) The dictionary can be dynamically modified in the program .

  • Use suggest_freq(segment, tune=True) The frequency of a single word can be adjusted , Make it possible to ( Or not ) Be divided .

Intelligent recognition of new words

take jieba.cut() The parameters of the function HMM Set to True, That is to say, we can use HMM Model recognition new words , Words that don't exist in dictionaries .

Test it , Results the general

import jieba
str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
seg_list = jieba.cut_for_search(str1)
print(' Search engine pattern segmentation results :' + '/'.join(seg_list))
Output :
Search engine pattern segmentation results : I / Came to / 了 / Chengdu / Of / southwest / traffic / university / Xipu / Campus /,/ Find out / here / Really not / Pretty good / That's good

Use a custom dictionary

User dictionary .txt as follows

import jieba
str = ' Looking back like telepathy , Only then can I see the tenderness of the bow ; It's also the tenderness of bowing your head , It's like a water lotus, full of cold wind ; It's also the most shameful , Only in this way can the two of them work hand in hand .';
seg_list = jieba.cut(str)
print(' Exact pattern word segmentation results when no custom dictionary is loaded :\n', '/'.join(seg_list))
jieba.load_userdict(' User dictionary .txt')
seg_list = jieba.cut(str)
print(' Precise pattern segmentation results when loading a custom dictionary :\n', '/'.join(seg_list))
jieba.add_word(' Most of all ')
seg_list = jieba.cut(str)
print(' Precise pattern segmentation results when adding custom words :\n', '/'.join(seg_list))
jieba.del_word(' Look down ')
seg_list = jieba.cut(str)
print(' Precise pattern segmentation results when deleting custom words :\n', '/'.join(seg_list))

Keywords extraction

Key words are the words that best reflect the theme and meaning of the text . Keyword extraction is to extract the most relevant words from the specified text , It can be applied to document retrieval 、 Classification and summary automatic writing, etc .

There are two main ways to extract keywords from text :

  • The first is supervised learning algorithm ;
    This method regards keyword extraction as a binary classification problem , First extract the candidate words that may be keywords , And then judge the candidate words , The result is that “ Key words ” and “ It's not a keyword ” Two kinds of , Based on this principle, an algorithm model of keyword classifier is designed , Constantly training the model with text , Make the model more mature , Until the model can accurately extract keywords from the new text ;
  • The second is unsupervised learning algorithm ;
    This method is to score the candidate words , Take the candidate with the highest score as the keyword , Common scoring algorithms are TF-IDF and TextRank.jieba Module provides the use of TF-IDF and TextRank Algorithm to extract keywords function .
# be based on TF-IDF Keyword extraction of the algorithm
from jieba import analyse
text = ' The reporter recently learned from Nanjing Institute of Geology and paleontology, Chinese Academy of Sciences that , The Institute's early life research team works with American scholars , In the Shibantan biota of the Three Gorges area, Hubei Province, China , Found out 4 An ancient creature that looks like a leaf . these “ Leaf ” In fact, they were early animals with peculiar shapes , They lived at the bottom of the ancient ocean . The related research results have been published in the international professional journal of paleontology 《 Journal of paleontology 》 On .'
keywords = analyse.extract_tags(text, topK = 10, withWeight = True, allowPOS = ('n', 'v'))
print(keywords)
# be based on TextRank Keyword extraction of the algorithm
from jieba import analyse
text = ' The reporter recently learned from Nanjing Institute of Geology and paleontology, Chinese Academy of Sciences that , The Institute's early life research team works with American scholars , In the Shibantan biota of the Three Gorges area, Hubei Province, China , Found out 4 An ancient creature that looks like a leaf . these “ Leaf ” In fact, they were early animals with peculiar shapes , They lived at the bottom of the ancient ocean . The related research results have been published in the international professional journal of paleontology 《 Journal of paleontology 》 On .'
keywords = analyse.textrank(text, topK = 10, withWeight = True, allowPOS = ('n', 'v'))
print(keywords)

explain :

  • extract_tags()
    • Parameters sentence For the text of the keyword to be extracted ;
    • Parameters topK Used to specify the number of keywords to return , The default value is 20;
    • Parameters withWeight Used to specify whether to return weights at the same time , The default value is False, Indicates that no weight is returned ,TF or IDF Higher weight , The higher the priority returned ;
    • Parameters allowPOS Used to specify the part of speech of the returned keyword , To filter the returned keywords , The default value is empty , Indicates no filtering .
  • textrank()
    • and extract_tags() The parameters of the function are basically the same , Only parameters allowPOS The default values for are different .
    • Because of the different algorithms , The results may vary .

Stop word filtering

Stop words refer to a large number of stop words in every document , But for NLP Words that don't work much , Such as “ you ”“ I ”“ Of ”“ stay ” And punctuation . Filter out the stop words after word segmentation , Contribute to NLP The efficiency of .

···python
import jieba
with open('stopwords.txt', 'r+', encoding = 'utf-8')as fp:
stopwords = fp.read().split('\n')
word_list = []
text = ' The Ministry of Commerce 4 month 23 The data released today shows , First quarter , The net retail sales of agricultural products nationwide reached 936.8 One hundred million yuan , growth 31.0%; More than 400 Wan chang . E-commerce has brought new opportunities to farmers .'
seg_list = jieba.cut(text)
for seg in seg_list:
if seg not in stopwords:
word_list.append(seg)
print(' Word segmentation results when stop word filtering is enabled :\n', '/'.join(word_list))


## Word frequency statistics
Word frequency is NLP A very important concept in , It is the basis of word segmentation and keyword extraction . In the construction of word segmentation Dictionary , You usually need to set the frequency for each word .
Statistics of word frequency can objectively reflect the emphasis of a text .
```python
import jieba
text = ' Steamed bun pot steamed bun pot steamed bun , Steamed buns in a pot , Put the steamed bread on the table , There are steamed buns on the table .'
with open('stopwords.txt', 'r+', encoding = 'utf-8')as fp:
stopwords = fp.read().split('\n')
word_dict = {}
jieba.suggest_freq((' The table '), True)
seg_list = jieba.cut(text)
for seg in seg_list:
if seg not in stopwords:
if seg in word_dict.keys():
word_dict[seg] += 1
else:
word_dict[seg] = 1
print(word_dict)
Output :
{' steamed ': 3, ' a steamed bun ': 5, ' Pot pot ': 1, ' a pot of ': 1, ' pan ': 1, ' Put away ': 1, ' The table ': 2, ' above ': 1}
版权声明
本文为[Doraemon paradise]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/02/20210221184728078F.html

  1. Python中的解决中文字符编码的问题
  2. Solving the problem of Chinese character coding in Python
  3. Translation: practical Python Programming 02_ 01_ Datatypes
  4. Installation and use of Python and tensorflow in win10 environment (Python version 3.6, tensorflow version 1.6)
  5. Python series 46
  6. Linux安装Python3
  7. 【python接口自动化】- 正则用例参数化
  8. Python RestFul Api 设计
  9. filecmp --- 文件及目录的比较│Python标准库
  10. Installing python3 on Linux
  11. [Python] Matplotlib 圖表的繪製和美化技巧
  12. (資料科學學習手札108)Python+Dash快速web應用開發——靜態部件篇(上)
  13. 翻譯:《實用的Python程式設計》02_01_Datatypes
  14. 【python接口自动化】- 正则用例参数化
  15. 翻译:《实用的Python编程》02_02_Containers
  16. 两年Java,去字节跳动写Python和Go
  17. [Python interface automation] - regular use case parameterization
  18. Python restful API design
  19. 翻译:《实用的Python编程》02_02_Containers
  20. 两年Java,去字节跳动写Python和Go
  21. 翻译:《实用的Python编程》02_02_Containers
  22. Python基于粒子群优化的投资组合优化研究
  23. ubuntu部署django项目
  24. 兩年Java,去位元組跳動寫Python和Go
  25. 翻譯:《實用的Python程式設計》02_02_Containers
  26. 这样学习Python,爷爷都学会了!超简单Python入门
  27. [Python] 基于 jieba 的中文分词总结
  28. 【python】递归听了N次也没印象,读完这篇你就懂了
  29. [Python] 基于 jieba 的中文分词总结
  30. 人理解迭代,神则体会递归,从电影艺术到Python代码实现神的逆向思维模式
  31. [Python] 基於 jieba 的中文分詞總結
  32. Python属于后端开发还是前端开发?Python入门!
  33. 【python】递归听了N次也没印象,读完这篇你就懂了
  34. 一天快速入门python
  35. 学习Python对年龄有没有要求?30岁可以吗?
  36. 清华教授!12小时整理的最全Python教程(文末无偿分享)
  37. 使用Python开发DeFi项目
  38. python 函数详解
  39. Python工程师是做什么的?前景如何?
  40. Filecmp -- comparison of files and directories
  41. Python - zip() 函数
  42. 30 周年生日,Python 先驱是怎么评价这门语言的?
  43. Drawing and beautifying skills of [Python] Matplotlib chart
  44. Python + dash rapid web application development static components
  45. Translation: practical Python Programming 02_ 01_ Datatypes
  46. python将excel自适应导入数据库
  47. 从小白到大师,这里有一份Pandas入门指南
  48. [Python] 茎叶图和复合饼图的画法
  49. [Python interface automation] - regular use case parameterization
  50. Translation: practical Python Programming 02_ 02_ Containers
  51. Two years of Java, to write Python and go
  52. Translation: practical Python Programming 02_ 02_ Containers
  53. Two years of Java, to write Python and go
  54. Python-geoplot 空间核密度估计图绘制
  55. Python-seaborn 经济学人经典图表仿制
  56. python空间绘图- regionmask掩膜操作示例
  57. Python 空间绘图 - Cartopy 经纬度添加
  58. Python-pykrige包-克里金(Kriging)插值计算及可视化绘制
  59. Python 批量重采样、掩膜、坡度提取
  60. python - 多种交通方式可达圈分析