January sixteen , The fifth season of the Chinese poetry conference came to an end . from 2016 year 2 month 12 The first season of the day begins , It's just four years ago . On this stage , At the age of 16 Wu Yishu, a gifted girl at the age of 、 Rain unimpeded take away little brother Lei Hai for 、 Chen Geng, a dignified and beautiful doctoral student of Peking University , Peng min, a three season veteran with no regrets , All of them impressed us with their wonderful performances . The Chinese poetry conference has influenced a large number of Chinese people imperceptibly , Inspired a lot of people's love for poetry .
Because I love it. , I think of using Python + ElasticSearch This is a big data method “ Poetry conference ”. If you like it , Please come and experience it with me . After reading the flying flowers at the end of the article , I believe you will have enough courage and confidence to sign up for the next season's Chinese poetry conference !
ElasticSearch It's a distribution 、 High expansion 、 High real time search and data analysis engine . It can be very convenient to make a large number of data search 、 The ability to analyze and explore . make the best of ElasticSearch Horizontal scalability of , Can make data more valuable in the production environment .
ElasticSearch It's using NoSql database , Its basic concept is different from that of traditional relational database . Let's take a look at these two concepts :
file
NoSql Database is also called document database . A document is a record of a relational database ( That's ok ).
Indexes
Index in relational database , A data structure designed to speed up query setup . Index different from relational database ,Elasticsearch An index of this meaning will be created for each field , and Elasticsearch The index in is transparent , therefore Elasticsearch Stop talking about the index of this meaning , It gives the index two meanings :
ElasticSearch It's using java Compiling , install ElasticSearch Before , First install java Running environment . In order to lighten the burden of computer , Can not be installed JDK, Install only JRE that will do .
After installation , Environment variable needs to be set . I installed jre1.8.0_241, The installation path is C:\Program Files\Java\, If you install a different version path and path , Please fill in... According to the actual installation :
When the environment variable is set , Can be installed ElasticSearch 了 .ElasticSearch Easy to install , from Official website Download and unzip , Under its decompression path bin Run... In file elasticsearch.bat, You can start ElasticSearch service .
Omnipotent pip:
pip install elasticsearch
After successful installation , You can use this client to connect Elasticsearch The server .
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch()
>>> es.info()
{
'name': 'LAPTOP-8507OGEN', 'cluster_name': 'elasticsearch', 'cluster_uuid': 'OwrXmbSwTk6LB-q9lFDV0w', 'version': {
'number': '7.5.2', 'build_flavor': 'default', 'build_type': 'zip', 'build_hash': '8bec50e1e0ad29dad5653712cf3bb580cd1afcdf', 'build_date': '2020-01-15T12:11:52.313576Z', 'build_snapshot': False, 'lucene_version': '8.3.0', 'minimum_wire_compatibility_version': '6.8.0', 'minimum_index_compatibility_version': '6.0.0-beta1'}, 'tagline': 'You Know, for Search'}
The goal of my climb this time is Ancient poetry network , There are many ways to classify poems . I'm going to climb three hundred poems of Tang Dynasty and three hundred poems of Song Dynasty .
ElasticSearch You can index data directly without creating an index , however , To use some advanced aggregation features , Automatically created indexes are not ideal , The index cannot be changed after it is created . Better before indexing the data , Create index first .
>>> from elasticsearch import Elasticsearch, client
>>> es = Elasticsearch()
>>> ic = client.IndicesClient(es)
>>> doc = {
"mappings": {
"properties": {
"title": {
# subject
"type": "keyword"
},
"epigraph": {
# Name
"type": "keyword"
},
"dynasty": {
# Dynasty
"type": "keyword"
},
"author": {
# author
"type": "keyword"
},
"content": {
# Content
"type": "text"
}
}
}
}
>>> ic.create(index='poetry', body=doc)
{
'acknowledged': True, 'shards_acknowledged': True, 'index': 'poetry'}
Python visit http There are many libraries , The most convenient thing to use is requests, I will use it. requests To get the content of the web page , use bs4 To parse the content of a web page .
Get a list of 300 Tang poems
Get the list page of three hundred Tang poems html:
>>> import requests
>>> html = requests.get('https://so.gushiwen.org/gushi/tangshi.aspx').text
And then use BeautifulSoup The module of html Code parsing , Get a list of names and addresses :
>>> from bs4 import BeautifulSoup
>>> import lxml
>>> soup = BeautifulSoup(html, "lxml")
>>> typecont = soup.find_all(attrs={
"class":"typecont"})
>>> index = 1
>>> for div in typecont:
for ch in div.children:
if ch.name == 'span':
print(index, ch.a.text, ch.a.attrs['href'])
index += 1
such , I got the address list of Tang poetry , common 320 The first
Get the content of Tang poetry
visit https://so.gushiwen.org/shiwenv_c90ff9ea5a71.aspx, Fetch 《 Climb the stork tower 》 The page of , And use bs4 analysis :
>>> html = requests.get('https://so.gushiwen.org/shiwenv_c90ff9ea5a71.aspx').text
>>> soup = BeautifulSoup(html, "lxml")
Get the title of Tang poetry :
>>> cont = soup.select('.main3 .left .sons .cont')[0]
>>> title = cont.h1.text
>>> title
' Climb the stork tower '
Get the Dynasty and the author :
>>> al = cont.p.select('a')
>>> dynasty = al[0].text
>>> dynasty
' Tang Dynasty '
>>> author = al[1].text
>>> author
Get the content of the poem :
>>> content = cont.select('.contson')[0].text
>>> content
'\n Day by day , The flow of the Yellow River into the sea . To see a thousand miles , Take it to the next level .\n'
>>> content.strip()
' Day by day , The flow of the Yellow River into the sea . To see a thousand miles , Take it to the next level .'
We got the data , We can save them to ElasticSearch It's in . use ElasticSearch The term of , It's called “ Index poetry data ”.
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch()
>>> doc = {
'title':title,
'dynasty':dynasty,
'author':author,
'content':content
}
>>> ret = es.index(index='poetry', body=doc)
>>> print(json.dumps(ret, indent=4, separators=(',', ': '), ensure_ascii=False))
{
"_index": "test",
"_type": "test",
"_id": "bp3tNnABnXfYifgM_GRI",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
According to the returned id, Look up the data you just saved :
>>> ret = es.get(index='poetry', id='bp3tNnABnXfYifgM_GRI')
>>> print(json.dumps(ret, indent=4, separators=(',', ': '), ensure_ascii=False))
{
"_index": "test",
"_type": "test",
"_id": "bp3tNnABnXfYifgM_GRI",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"title": " Climb the stork tower ",
"dynasty": " Tang Dynasty ",
"author": " wang zhihuan ",
"content": " Day by day , The flow of the Yellow River into the sea . To see a thousand miles , Take it to the next level ."
}
}
Not enough code to crawl all 100 That's ok , It can be directly copied and saved as a local file . As long as the environment installation is OK ,ElasticSearch Service started normally , Code can run directly .
#!/usr/bin/env python
# coding:utf-8
import lxml
import requests
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch, client
def create_index():
es = Elasticsearch()
ic = client.IndicesClient(es)
# Determine if the index exists
if not ic.exists(index="poetry"):
# Create index
doc = {
"mappings": {
"properties": {
"title": {
"type": "keyword"
},
"epigraph": {
"type": "keyword"
},
"dynasty": {
"type": "keyword"
},
"author": {
"type": "keyword"
},
"content": {
"type": "text"
}
}
}
}
ic.create(index='poetry', body=doc)
def get_poetry(list_url):
es = Elasticsearch()
# Get list page
html = requests.get(list_url).text
soup = BeautifulSoup(html, "lxml")
typecont = soup.find_all(attrs={
"class":"typecont"})
# Traverse the list
for div in typecont:
for ch in div.children:
if ch.name == 'span':
# Get the content of the poem
print('get:', ch.a.text, ch.a.attrs['href'])
html = requests.get('https://so.gushiwen.org' + ch.a.attrs['href']).text
soup = BeautifulSoup(html, "lxml")
cont = soup.select('.main3 .left .sons .cont')[0]
# title
title = cont.h1.text
# Word board
epigraph = ""
if '·' in title:
epigraph = title[:title.index('·')]
al = cont.p.select('a')
# Dynasty
dynasty = al[0].text
# author
author = al[1].text
# Content
content = cont.select('.contson')[0].text.strip()
# Index data
doc = {
"title": title,
"epigraph": epigraph,
"dynasty": dynasty,
"author": author,
"content": content
}
# ret = es.index(index='poetry', doc_type='poetry', body=doc)
ret = es.index(index='poetry', body=doc)
print(ret)
def main():
create_index()
get_poetry('https://so.gushiwen.org/gushi/tangshi.aspx')
get_poetry('https://so.gushiwen.org/gushi/songsan.aspx')
if __name__ == '__main__':
main()
With poetry data , We can do statistical analysis . Let's have a try , Find out how many poems I have included :
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch()
>>> ret = es.search(index='poetry')
>>> ret.keys() # The result is a dictionary , You can check the data one by one
dict_keys(['took', 'timed_out', '_shards', 'hits'])
>>> ret['hits']['total']['value'] # This is the total number of documents found
613
>>> for item in ret['hits']['hits']: # Here are some of the documents found (ElasticSearch By default, only before 10 strip )
print(item['_source']['title'], '--', item['_source']['author'])
Early onset Baidicheng / Emperor Bai went down to Jiangling -- Li Bai
In the night, the city was sent to hear the flute -- Li Yi
Jason -- Li shangyin
Sui palace -- Li shangyin
The Jade Pool -- Li shangyin
Furong floor to send xin gradually -- wang changling
Boudoir resentment -- wang changling
Spring palace music -- wang changling
On September 9, I remember my brothers in Shandong -- Wang wei
Liangzhou CI -- Wang Han
Returns... In the result ret[‘hits’][‘total’][‘value’] It's the statistics we want ,ret[‘hits’][‘hits’] It's the documents you find .ElasticSearch By default, only before 10 strip , You can use parameters to specify the number of returns . The following specifies to return 2 A document :
>>> ret = es.search(index='poetry', body={
'size':2})
>>> ret['hits']['total']['value'] # The total number of queries is still 613
613
>>> for item in ret['hits']['hits']: # return 2 A document
print(item['_source']['title'], '--', item['_source']['author'])
print(item['_source']['content'])
print()
Early onset Baidicheng / Emperor Bai went down to Jiangling -- Li Bai
Leaving at dawn the White King crowned with rainbow cloud , I have sailed a thousand miles through Three Georges in a day . With monkeys' sad adieus the riverbanks are loud , My boat has left ten thought mountains far away .
In the night, the city was sent to hear the flute -- Li Yi
The sand is like snow in front of the beacon , The moon outside the city is like frost .( Return to Lefeng One work : Back to Lefeng ) I don't know where to blow the reed pipe , A night of conscription .
So how many of them are written by Li Bai :
>>> condition = {
"query":{
"match":{
"author":" Li Bai "}},"size":0}
>>> ret = es.search(index='poetry', body=condition)
>>> ret['hits']['total']['value']
37
wow , Unexpectedly 37 There are so many articles !
Tang and Song dynasties , It is the heyday of Chinese poetry , The masterpiece is as brilliant as a star , The stars of poetry are shining . Here 613 In the works , Whose works are the most ? It's Li Bai ? Or poet Du Fu ? Or Li Qingzhao who started a pronoun style 、 Su Dongpo ? Let them have a fight PK Well .
>>> ret = es.search(index='poetry', body={
'size':0, 'aggs': {
'authors':{
"terms": {
"field": "author"}}}})
>>> for item in ret['aggregations']['authors']['buckets']:
print(item['key'], item['doc_count'])
Du Fu 39
Li Bai 37
Wang wei 29
Li shangyin 24
Su shi 16
Xin qiji 16
meng haoran 15
Li qingzhao 14
Zhou bangyan 12
wei yingwu 12
Congratulations to the above players for winning the competition 10 name . Look at the big screen :
One of the characteristics of a word is its name , And behind almost every word name , There is a story . In our Poetry Library , Which word boards are most popular among poets ? Slightly modify the above search conditions , We'll soon know the result . Here is the search result before 11 individual , Because some words have no name ( Maybe the ability is the word board ).
>>> ret = es.search(index='poetry', body={
'size':0, 'aggs': {
'epigraphs':{
"terms": {
"field": "epigraph",'size':11 }}}})
>>> for item in ret['aggregations']['epigraphs']['buckets']:
print(item['key'], item['doc_count'])
273
Qingpingle 11
USES 9
recent 9
Partridge day 9
The cast operator 8
Xijiang month 8
#NAME? 7
Congratulations to the bridegroom 7
Sauvignon blanc 7
Linjiang fairy 6
The most popular word list is :
When you create an index , Some friends should find out content The type of field is not the same as other , Everything else is keyword, and content yes text. That's because ElasticSearch There's a word segmentation mechanism , The participle mechanism will keyword Type as a word , And for text Field of , Every word in English is a word, and every word in Chinese is a word .
about text Field ,ElasticSearch Used Fielddata Cache technology , To aggregate fields like this , First of all, open the Fielddata:
>>> es.index(index='poetry', doc_type='_mapping', body={
"properties": {
"content": {
"type": "text","fielddata": True}}})
{
'acknowledged': True}
And then you can aggregate :
>>> ret = es.search(index='poetry', body={
'size':0, 'aggs': {
'content':{
"terms": {
"field": "content",'size':20 }}}})
>>>
>>> for item in ret['aggregations']['content']['buckets']:
print(item['key'], item['doc_count'])
One 288
people 287
No 280
wind 265
mountain 217
nothing 208
month 205
flowers 205
God 185
Come on 184
In the spring 182
when 178
cloud 177
Japan 171
On 168
What 166
water 166
night 163
Yes 155
rain 146
Text frequency TOP20, Write it out in order , It's almost a five character quatrain :
A man is not a mountain , No moon, no sky .
It's cloudy in spring , It rains at night .
No wonder ? It is no wonder that ! Because every Chinese character , It's a picture in itself 、 A story . Please TOP20 appearance :
The most wonderful part of the Chinese poetry conference is the flying flowers . The eighth time the Jedi fight back , The scene of flying flowers with the hundred people group is impressive . It was “ jiang ” word , So let's take a look at one that includes “ jiang ” What are the poems of the characters .
The above examples are all used ElasticSearch The aggregate analysis function of , This example is to make full-text search function .
>>> ret = es.search(index='poetry', body={
"query":{
"match":{
"content":" jiang "}}, "highlight":{
"fields":{
"content":{
}}}})
>>> ret['hits']['total']['value']
138
>>> for item in ret['hits']['hits']:
print(item['_source']['title'], item['_source']['author'])
print(item['highlight']['content'])
print()
body Added highlight option ,“highlight”:{“fields”:{“content”:{}}}, You can highlight keywords . contain “ jiang ” Poetry of words , altogether 138 The first , Only the front... Is shown here 10 The first .
Yi jiangnan Bai Juyi
jiang Nan Hao , The scenery used to be familiar with . sunrise jiang Flower is better than fire , Spring comes jiang The water is as green as blue . Can not remember jiang south ?
Three poems in Jiangnan Bai Juyi
jiang Nan Hao , The scenery used to be familiar with . sunrise jiang Flower is better than fire , Spring comes jiang The water is as green as blue . Can not remember jiang south ?
jiang Nanyi , The most memorable is Hangzhou . Looking for Guizi in the middle of the mountain temple , Looking at the tide on the pillow of the County Pavilion . When will you revisit !
jiang Nanyi , Second, I remember Wu palace . A cup of spring bamboo leaves with Wu wine , Wuwa double dance drunk lotus . Meet again sooner or later !
seek distraction in writing Du Mu
Down and out jiang Nanzai wine shop , Delicate in the waist, light in the palm .( jiang south One work : jiang lake ; Slim One work : A gut break ) A dream of Yangzhou in ten years , Win the brothel .
Long dry line, · Home near Jiujiang water Cui Hao
Family nine jiang water , Come and go nine jiang Side . It's the same as Changgan , I don't know each other .
The cast operator · I live at the head of the Yangtze River Ezann lee
I live for a long time jiang head , You live long jiang tail . Every day I miss you, I don't see you , Drink together for a long time jiang water .
When will the water rest , When has this hatred . I only wish to keep my heart like my heart , Love will not be lost .
Sauvignon blanc · northern hills green Lin Yun
( Who knows the feeling of parting One work : There is a feeling of separation ) You are full of tears , I have tears in my eyes , The ribbons are not concentric , jiang The edge tide has leveled off .( jiang edge One work : jiang head )’]
Frost dawn corner · I park at night on the river Yellow machine
cold jiang Overnight . utter a long and loud cry jiang Song of . The underwater ichthyosaurs startled , The wind is rolling 、 Waves turn the house . Poetry is not enough . The wine continued to break . Don't ask about the rise and fall of grass , Tears of fame 、 I want to win .
#NAME? · Everyone says Jiangnan is good Weizhuang
Everybody says jiang Nan Hao , Visitors only like jiang Nan Lao . Spring water is green in the sky , Painting boats sleep in the rain . A man on the other side of the river is like a moon , White wrists coagulate frost and snow . Don't go home before you are old , You have to break your heart to return home .
Picking mulberry seeds · Hate you not like jianglouyue Lu benzhong
Hate you is not like jiang Lou Yue , North and south, East and West , North and south, East and West , Nothing but company . Hate you, but like jiang Lou Yue , I'll pay you back when I'm full , I'll pay you back when I'm full , When is the reunion ?
Qingpingle · Wang's nunnery in Boshan Xin qiji
In my life, I lived in the northern part of the Great Wall jiang south , Come back and have beautiful hair . Cloth is sleeping in autumn night , In front of you jiang mountain .
Another super flying flower , Also include “ jiang ” and “ water ” What are the poems of ?
>>> condition = {
"query" : {
"match" : {
"content" : {
"query": " jiang water ",
"operator" : "and"
}
}
},
"highlight": {
"fields" : {
"content" : {
}
}
}
}
>>> ret = es.search(index='poetry', body=condition)
>>> ret['hits']['total']['value']
52
>>> for item in ret['hits']['hits']:
print(item['_source']['title'], item['_source']['author'])
print(item['highlight']['content'])
print()
In our Poetry Library , share 52 The first includes “ jiang ” and “ water ” The poetry of , Only the front... Is shown here 10 The first .
Yi jiangnan Bai Juyi
jiang Nan Hao , The scenery used to be familiar with . sunrise jiang Flower is better than fire , Spring comes jiang water Green as blue . Can not remember jiang south ?
The cast operator · I live at the head of the Yangtze River Ezann lee
I live for a long time jiang head , You live long jiang tail . Every day I miss you, I don't see you , Drink together for a long time jiang water . this water How long will take , When has this hatred . I only wish to keep my heart like my heart , Love will not be lost .
Long dry line, · Home near Jiujiang water Cui Hao
Family nine jiang water , Come and go nine jiang Side . It's the same as Changgan , I don't know each other .
Bamboo words · The mountain peach is covered with red flowers Liu yuxi
The mountain peach is covered with red flowers , Shu jiang In the spring water Beat the mountain stream . Red flowers are easy to fade like Langyi , water Flow infinite like Nong sorrow .
Three poems in Jiangnan Bai Juyi
jiang Nan Hao , The scenery used to be familiar with . sunrise jiang Flower is better than fire , Spring comes jiang water Green as blue . Can not remember jiang south ?\n jiang Nanyi , The most memorable is Hangzhou . Looking for Guizi in the middle of the mountain temple , Looking at the tide on the pillow of the County Pavilion . When will you revisit !\n jiang Nanyi , Second, I remember Wu palace . A cup of spring bamboo leaves with Wu wine , Wuwa double dance drunk lotus . Meet again sooner or later !
Frost dawn corner · I park at night on the river Yellow machine
cold jiang Overnight . utter a long and loud cry jiang Song of . water Ichthyosaurus startled , The wind is rolling 、 Waves turn the house . Poetry is not enough . The wine continued to break . Don't ask about the rise and fall of grass , Tears of fame 、 I want to win .
#NAME? · Everyone says Jiangnan is good Weizhuang
Everybody says jiang Nan Hao , Visitors only like jiang Nan Lao . In the spring water Blue in the sky , Painting boats sleep in the rain . A man on the other side of the river is like a moon , White wrists coagulate frost and snow . Don't go home before you are old , You have to break your heart to return home .
Looking at the jiangnan · Comb and wash wen tingyun
Comb and wash , Looking at jiang floor . A thousand sails is not , Twilight pulse water long . Broken intestine, white Pingzhou .
To judge hanchuo of Yangzhou Du Mu
The green mountains are hidden water Far away , Autumn is over jiang The South grass has not withered . 24 bridge moon night , Where do jade people teach flute playing ?
Park in Qinhuai Du Mu
Smoke cage cold water Moon cage sand , Night park Qinhuai near the restaurant . Business women don't know how to die , Partition jiang Still singing backyard flowers .
With this weapon , Would you like to sign up for the next poetry conference ? Then hurry up to sign up !