Module installation

pip install jieba

jieba The word splitter supports 4 A participle pattern :

  1. The exact pattern, which attempts to slice sentences as precisely as possible , Suitable for text analysis .
  2. Full mode this mode will scan all the words that can be formed into words in a sentence , It's also very fast , The disadvantage is that it can't solve the problem of ambiguity , Ambiguous words are also scanned .
  3. Search engine mode, which will segment long words on the basis of precise mode , Cut out shorter words . In search engines , Part of the input word is required to retrieve the whole word related document , So this model is suitable for search engine segmentation .
  4. Paddle Pattern the pattern uses PaddlePaddle Deep learning framework , Training sequence tagging network model to achieve word segmentation , It also supports part of speech tagging .

    The mode is 4.0 And above jieba It can only be used in word segmentation . To use this mode, you need to install PaddlePaddle modular , The installation command is “pip install paddlepaddle”.

Open source code

https://github.com/fxsjy/jieba

Basic usage

>>> import jieba
>>> str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
>>> seg_list = jieba.cut(str1, cut_all=True)
>>> print(' Full pattern word segmentation results :' + '/'.join(seg_list))
Full pattern word segmentation results : I / Came to / 了 / Chengdu / Of / southwest / traffic / university / Rhinoceros / Pu / Campus /,/ Find out / here / Really not / That's good / Pretty good
>>> seg_list = jieba.cut(str1, cut_all=False)
>>> print(' Accurate pattern segmentation results :' + '/'.join(seg_list))
Accurate pattern segmentation results : I / Came to / 了 / Chengdu / Of / southwest / traffic / university / Xipu / Campus /,/ Find out / here / That's good

Enable Paddle

There's a pit here , First of all, you can't use the latest python edition , I started from 3.9.1 Down to 3.8.7 Can only be .

In addition, the installation has been used to report errors , It turns out that Microsoft Visible C++ 2017 Redistributable And above support .

import jieba
import paddle
str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
paddle.enable_static()
jieba.enable_paddle()
seg_list = jieba.cut(str1, use_paddle=True)
print('Paddle Pattern segmentation results :' + '/'.join(seg_list)) Output :
Paddle Pattern segmentation results : I / Came to / 了 / Chengdu / Of / Xipu campus of Southwest Jiaotong University ,/ Find out / here / That's good

Part of speech tagging

import jieba
import paddle # Part of speech tagging and word segmentation
import jieba.posseg as pseg
str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
paddle.enable_static()
jieba.enable_paddle()
words = pseg.cut(str1,use_paddle=True)
for seg, flag in words:
print('%s %s' % (seg, flag)) Output :
I r
Came to v
了 u
Chengdu LOC
Of u
Xipu campus of Southwest Jiaotong University , ORG
Find out v
here r
That's good a

Be careful :pseg.cut and jieba.cut The objects returned are different !

paddle The corresponding table of pattern part of speech tagging is as follows :

paddle The set of pattern part of speech and proper name category labels is shown in the following table , The part of speech label 24 individual ( Lowercase letters ), Special name category label 4 individual ( Capital ).

label meaning label meaning label meaning label meaning
n A common noun f Position NOUN s Place noun t Time
nr The person's name ns Place names nt Organization name nw The name of the work
nz Other proper names v Common verb vd verbal adverb vn Noun verb
a Adjective ad Accessory words an Noun form words d adverb
m quantifiers q quantifiers r pronouns p Preposition
c Conjunction u auxiliary word xc Other function words w Punctuation
PER The person's name LOC Place names ORG Organization name TIME Time

Adjust dictionary

  • Use add_word(word, freq=None, tag=None) and del_word(word) The dictionary can be dynamically modified in the program .

  • Use suggest_freq(segment, tune=True) The frequency of a single word can be adjusted , Make it possible to ( Or not ) Be divided .

Intelligent recognition of new words

take jieba.cut() The parameters of the function HMM Set to True, That is to say, we can use HMM Model recognition new words , Words that don't exist in dictionaries .

Test it , Results the general

Search engine pattern participle

import jieba
str1 = ' I came to Xipu campus of Southwest Jiaotong University in Chengdu , It's nice to find out here '
seg_list = jieba.cut_for_search(str1)
print(' Search engine pattern segmentation results :' + '/'.join(seg_list))
Output :
Search engine pattern segmentation results : I / Came to / 了 / Chengdu / Of / southwest / traffic / university / Xipu / Campus /,/ Find out / here / Really not / Pretty good / That's good

Use a custom dictionary

User dictionary .txt as follows

import jieba
str = ' Looking back like telepathy , Only then can I see the tenderness of the bow ; It's also the tenderness of bowing your head , It's like a water lotus, full of cold wind ; It's also the most shameful , Only in this way can the two of them work hand in hand .';
seg_list = jieba.cut(str)
print(' Exact pattern word segmentation results when no custom dictionary is loaded :\n', '/'.join(seg_list)) jieba.load_userdict(' User dictionary .txt')
seg_list = jieba.cut(str)
print(' Precise pattern segmentation results when loading a custom dictionary :\n', '/'.join(seg_list)) jieba.add_word(' Most of all ')
seg_list = jieba.cut(str)
print(' Precise pattern segmentation results when adding custom words :\n', '/'.join(seg_list)) jieba.del_word(' Look down ')
seg_list = jieba.cut(str)
print(' Precise pattern segmentation results when deleting custom words :\n', '/'.join(seg_list))

Keywords extraction

Key words are the words that best reflect the theme and meaning of the text . Keyword extraction is to extract the most relevant words from the specified text , It can be applied to document retrieval 、 Classification and summary automatic writing, etc .

There are two main ways to extract keywords from text :

  • The first is supervised learning algorithm ;

    This method regards keyword extraction as a binary classification problem , First extract the candidate words that may be keywords , And then judge the candidate words , The result is that “ Key words ” and “ It's not a keyword ” Two kinds of , Based on this principle, an algorithm model of keyword classifier is designed , Constantly training the model with text , Make the model more mature , Until the model can accurately extract keywords from the new text ;
  • The second is unsupervised learning algorithm ;

    This method is to score the candidate words , Take the candidate with the highest score as the keyword , Common scoring algorithms are TF-IDF and TextRank.jieba Module provides the use of TF-IDF and TextRank Algorithm to extract keywords function .
# be based on TF-IDF Keyword extraction of the algorithm 
from jieba import analyse
text = ' The reporter recently learned from Nanjing Institute of Geology and paleontology, Chinese Academy of Sciences that , The Institute's early life research team works with American scholars , In the Shibantan biota of the Three Gorges area, Hubei Province, China , Found out 4 An ancient creature that looks like a leaf . these “ Leaf ” In fact, they were early animals with peculiar shapes , They lived at the bottom of the ancient ocean . The related research results have been published in the international professional journal of paleontology 《 Journal of paleontology 》 On .'
keywords = analyse.extract_tags(text, topK = 10, withWeight = True, allowPOS = ('n', 'v'))
print(keywords) # be based on TextRank Keyword extraction of the algorithm
from jieba import analyse
text = ' The reporter recently learned from Nanjing Institute of Geology and paleontology, Chinese Academy of Sciences that , The Institute's early life research team works with American scholars , In the Shibantan biota of the Three Gorges area, Hubei Province, China , Found out 4 An ancient creature that looks like a leaf . these “ Leaf ” In fact, they were early animals with peculiar shapes , They lived at the bottom of the ancient ocean . The related research results have been published in the international professional journal of paleontology 《 Journal of paleontology 》 On .'
keywords = analyse.textrank(text, topK = 10, withWeight = True, allowPOS = ('n', 'v'))
print(keywords)

explain :

  • extract_tags()

    • Parameters sentence For the text of the keyword to be extracted ;
    • Parameters topK Used to specify the number of keywords to return , The default value is 20;
    • Parameters withWeight Used to specify whether to return weights at the same time , The default value is False, Indicates that no weight is returned ,TF or IDF Higher weight , The higher the priority returned ;
    • Parameters allowPOS Used to specify the part of speech of the returned keyword , To filter the returned keywords , The default value is empty , Indicates no filtering .
  • textrank()
    • and extract_tags() The parameters of the function are basically the same , Only parameters allowPOS The default values for are different .
    • Because of the different algorithms , The results may vary .

Stop word filtering

Stop words refer to a large number of stop words in every document , But for NLP Words that don't work much , Such as “ you ”“ I ”“ Of ”“ stay ” And punctuation . Filter out the stop words after word segmentation , Contribute to NLP The efficiency of .

···python

import jieba

with open('stopwords.txt', 'r+', encoding = 'utf-8')as fp:

stopwords = fp.read().split('\n')

word_list = []

text = ' The Ministry of Commerce 4 month 23 The data released today shows , First quarter , The net retail sales of agricultural products nationwide reached 936.8 One hundred million yuan , growth 31.0%; More than 400 Wan chang . E-commerce has brought new opportunities to farmers .'

seg_list = jieba.cut(text)

for seg in seg_list:

if seg not in stopwords:

word_list.append(seg)

print(' Word segmentation results when stop word filtering is enabled :\n', '/'.join(word_list))


## Word frequency statistics
Word frequency is NLP A very important concept in , It is the basis of word segmentation and keyword extraction . In the construction of word segmentation Dictionary , You usually need to set the frequency for each word . Statistics of word frequency can objectively reflect the emphasis of a text .
```python
import jieba
text = ' Steamed bun pot steamed bun pot steamed bun , Steamed buns in a pot , Put the steamed bread on the table , There are steamed buns on the table .'
with open('stopwords.txt', 'r+', encoding = 'utf-8')as fp:
stopwords = fp.read().split('\n')
word_dict = {}
jieba.suggest_freq((' The table '), True)
seg_list = jieba.cut(text)
for seg in seg_list:
if seg not in stopwords:
if seg in word_dict.keys():
word_dict[seg] += 1
else:
word_dict[seg] = 1
print(word_dict) Output :
{' steamed ': 3, ' a steamed bun ': 5, ' Pot pot ': 1, ' a pot of ': 1, ' pan ': 1, ' Put away ': 1, ' The table ': 2, ' above ': 1}

[Python] be based on jieba Chinese word segmentation summary of more related articles

  1. [python] Use Jieba Tools Chinese word segmentation and text clustering concepts

    Statement : Because of worry CSDN Blog lost , Simply back it up in the blog Garden , I will write articles in both places in the future ~ thank CSDN And the platform provided by blog Park .        There's a lot about Python Crawling the body Ontology. Message box Inf ...

  2. Python be based on jieba Cloud of Chinese words

    Today I learned python The word cloud technology from os import path from wordcloud import WordCloud import matplotlib.pyplot as plt ...

  3. python utilize jieba Chinese word segmentation to stop words

    Chinese word segmentation (Chinese Word Segmentation) It refers to the segmentation of a sequence of Chinese characters into a single word . Word segmentation module jieba, It is python Better use of word segmentation module . The string to be segmented can be unicod ...

  4. python Third party Library ------jieba library ( Chinese word segmentation )

    jieba“ stammer ” Chinese word segmentation : Do the best Python Chinese word segmentation component github:https://github.com/fxsjy/jieba Features support three word segmentation patterns : Accurate model , Try to cut the sentence as precisely as possible , ...

  5. Python Third party Library jieba( Chinese word segmentation ) Introduction and advancement ( Official documents )

    jieba " stammer " Chinese word segmentation : Do the best Python Chinese word segmentation component github:https://github.com/fxsjy/jieba characteristic Three word segmentation modes are supported : Accurate model , ...

  6. In depth study of the actual combat - be based on RNN The exploration of Chinese word segmentation in English

    In depth study of the actual combat - be based on RNN The exploration of Chinese word segmentation in English In recent years , Deep learning has made remarkable achievements in many fields of artificial intelligence . Microsoft uses 152 Layer depth neural network in ImageNet I won many firsts in the competition , At the same time, it surpasses the level of human recognition in image recognition ...

  7. stammer (jieba) Chinese word segmentation and its application

    Unlike English text classification, Chinese text classification only needs to separate words one by one , Chinese text classification needs to separate the words composed of words to form a vector . therefore , Need participle . Here we use the popular online open source word segmentation tool, stuttering word segmentation (jieba), It works ...

  8. python Word cloud picture and Chinese word segmentation

    2019-12-12 Chinese text segmentation and word cloud specific function introduction and learning code : import jiebaa=" Because words in Chinese text are not segmented by spaces or punctuation marks "#jieba.lcut()s yes ...

  9. Self made is based on HMM The Chinese word segmentation device of

    Unlike English, where there are spaces between words as a natural dividing line , There is no obvious boundary between Chinese words . Some methods must be used to divide Chinese sentences into word sequences for further processing , This step is called Chinese word segmentation . The mainstream Chinese word segmentation methods include rule-based word segmentation , ...

  10. be based on CRF Chinese participle

    http://biancheng.dnbcw.info/java/341268.html CRF brief introduction Conditional Random Field: Conditional random field , A machine learning technology ( Model ) CRF from J ...

Random recommendation

  1. 【 New technology 】CentOS Under the system docker Installation, configuration and use of

    1 docker brief introduction     Docker Provides an envelope to run your application (envelope), Or containers . It was originally dotCloud Started an amateur project , And open source some time ago . It has attracted a lot of attention and discussion ...

  2. php Common array operations

    php Commonly used array operation function , Including array assignments . Split . Merge . Calculation . add to . Delete . Inquire about . Judge . Sort, etc array_combine function : Use the value of an array as the key name of the new array , As another array of values <?p ...

  3. iOS Control the rotation of a single controller

    iOS Control the rotation of a single controller Control individual ViewController The rotation of the // No rotation , Keep the vertical screen //iOS 5 - (BOOL) shouldAutorotateToInterfaceOrientat ...

  4. thinkphp The sub table method in

    public function getPartitionTableName($data=array()) { // Partition the data table if(isset($data[$this->partitio ...

  5. CSAPP LAB: Buffer Overflow

    This is a CSAPP Famous experiments on the official website , Stack overflow attack by injecting assembly code . Experimental materials are available in my github Warehouse https://github.com/Cheukyin/CSAPP-LAB/ choice buffer-ov ...

  6. Sass function -- List functions

    Introduction to list functions List functions mainly include the use of some functions for list parameters , It mainly includes the following : length($list): Returns the length of a list : nth($list, $n): Returns a label value specified in a list  join ...

  7. Python Regular matching letters are case insensitive while reading xml Application in

    Problems to be solved : To match a string , The case of the letters in the string is uncertain , How to match ? Before the problem, we used string comparison , For example, to match 'abc', Then use the statement : if s == 'abc':#s For the string that needs to be matched prin ...

  8. nodejs Server development learning notes

    Learning , Constantly correcting mistakes ... Studied for a while nodejs, I don't quite understand many of them , I've seen a lot of examples on the Internet , Hope that through some of their own summary to make their understanding more comprehensive , At the same time, it is also used as a study note to keep notes . preparation node ...

  9. laravel get and all difference

      get ,all  You can get the model  all  It's getting all of it directly ,get  It's getting the model after adding a lot of constraints ,get If you don't put any constraints on it , Effect and all equivalent

  10. switch Windows Sleep function

    stay windows When sleeping, it will cache the data in memory to the hard disk C:/Hiberfil.sys file , In case the power failure can recover from it , But this is right SSD The hard disk is very lossy , Turn it off if you don't have to : close : powercfg -h ...