Business data analysis from entry to entry (9) Python Network Data Acquisition

CuterCorley 2021-04-05 22:20:11
business data analysis entry entry


@[toc]

Preface

This article mainly talks about Python One of the most common applications —— Network data acquisition , That's reptiles : First, it introduces the basic knowledge of web page and network , To lay the foundation for getting data from web pages ; Next, two cases are used to introduce the different ways of obtaining and processing data from the network , To further understand Python Crawlers and data processing .

One 、 Web and web Basics

1. Data sources

There are a lot of data sources , It can be downloaded from database In order to get , It can be downloaded from file In order to get , You can also get it from The Internet In order to get , You can also get raw data directly .

There are many sources of database data , such as RDBMS, Relational database management system , It's structured data , Specific include MySQL、PostgreSQL、SQLServer、Oracle、SQLite Database type, etc .

Data files include :

  • Excel Most common , The most problematic .
  • The most common separation format 、 The most popular . Include GNU sed (csv)、 Tab separator (tsv)、| Separators, etc . The problem includes the separation of data fields 、 Coding, etc . as follows :data file csv
  • Fixed length every column has a fixed length . problem : The column is too big . as follows :data file fixed
  • JSON namely JavaScript Object Notation(JavaScript Object notation ) Abbreviation , It belongs to semi-structured data . The property is to the left of the colon , The number is to the right of the colon , Attributes are separated by commas , Multi valued attributes as hierarchical values . as follows :data file json
  • XMLExtensible Markup Language( Extensible markup language ) Abbreviation , It belongs to semi-structured data , It's also the most common data exchange . as follows :data file xml
  • Parquet Column store ,Spark. as follows :data file parquet

network data : Mainly for HTML, For unstructured data . as follows :data web html

2. Basic knowledge of network

Network data transmission , Usually first request 、 Again Respond to , In the middle, it may be forwarded through many layers , Computer theory OSI Model as follows :data computer web osi

The process of each layer may be as follows :data computer web layer Now the general website model is client / The server , The client is usually the browser used by itself , The server is to store website resources 、 A platform for managing network requests .

The general request process is as follows :(1) User input URL;(2) The client sends the request Request;(3) Server receives request Request;(4) Server returns response Response Back;(5) The client receives and parses Response.

For one url, Such as https://127.0.0.1:8000/hello,http To express an agreement ,127.0.0.1 Indicates the host number ,8000 It's the port number ,/hello It's the path , So you can pinpoint the information you want to access .

The basic operation of using browser to visit website is as follows :data web browser basic

You can see , When searching and filtering , Links will also change , To request different content from the browser .

You can also use the browser's audit tool , You can see Page elements Network request style etc. . as follows :data web browser dev

You can see , The content in the page is organized by many tags and styles , This is it. HTML Code ; At the same time, at the time of request and response , They all carry a lot of parameters .

Further use of audit tools is as follows :data web browser dev deep

You can see , You can simulate different devices to request through settings , In this case, the User-Agent The parameters change with it .

One HTTP Requests include request methods 、 Request path and HTTP edition , One HTTP Responses include HTTP edition 、 Status code and response body .

3.HTML、CSS And web data capture mode

The web page is made by HTML Code composition , Information is generally contained in these codes ;CSS It's some style files , It has little effect on obtaining data ;JavaScript Code can perform some more complex logic , The impact on data acquisition may be greater .

A simple HTML The code is as follows :

<!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"> <title>Title</title></head><body> <h1> home page </h1> <form action="" method="post"> <table> <tr> <td><input type="text" name="name"></td> <td><input type="submit" value=" Submit "></td> </tr> </table> </form></body></html>

There are two ways to crawl web pages :

  • progressive scanning Line by Line Include Simple string Deal with and Regular expressions Way, etc . A regular expression is a special sequence of characters , It makes it easy to check whether a string matches a pattern ,Python Medium re Module enable Python It has all the regular expression functions , among , The principle of regular expression is as follows :data web re principal
 Please refer to [https://www.runoob.com/python/python-reg-expressions.html](https://www.runoob.com/python/python-reg-expressions.html).
  • Tree model Tree Model utilize HTML To get the tree structure of HTML Information in , Include BeautifulSoup、lxml The library supports this function . Network request HTML And show the process of tree structure as follows :data web html get show

for example , For the following case code :data web html demo

among ,date The data is Sep 13, 2014,message The data is i didnt know that.

If you use regular expressions to extract these two data , The way is <h2>(.+)<\/h2> and <\/span>(.+)<\/li>; And use attribute models like BeautifulSoup Extract the data , Will establish the following structure :data web html demo tree

thus , The way to extract data is div.h2.text and div.ul.li.text.

Two 、BOSS Direct employment data capture case

1. Website Preview

With BOSS Direct employment https://www.zhipin.com/ For example , To achieve a more complete network data capture process .

The website preview is as follows :data web boss review

You can see , Select certain... In the viewer HTML Code area , There will also be a corresponding highlight in the page , The data that needs to be obtained is also in these corresponding HTML In the code ; The position of each result is class by job-list Of div Below ul Below li Below class by job-primary Of div, There are as many job information as there are li, One of them li Is as follows :

<li> <div class="job-primary"> <div class="info-primary"> <div class="primary-wrapper"> <div class="primary-box" href="/job_detail/7271f2f28169375a1nR42t-6GFpQ.html" data-jid="7271f2f28169375a1nR42t-6GFpQ" data-itemid="1" data-lid="nlp-aqyTkPDQjXA.search.1" data-jobid="102127880" data-index="0" ka="search_list_1" target="_blank"> <div class="job-title"> <span class="job-name"><a href="/job_detail/7271f2f28169375a1nR42t-6GFpQ.html" title=" Data analysis " target="_blank" ka="search_list_jname_1" data-jid="7271f2f28169375a1nR42t-6GFpQ" data-itemid="1" data-lid="nlp-aqyTkPDQjXA.search.1" data-jobid="102127880" data-index="0"> Data analysis </a></span> <span class="job-area-wrapper"> <span class="job-area"> Beijing · Chaoyang District · Bird nest </span> </span> <span class="job-pub-time"></span> </div> <div class="job-limit clearfix"> <span class="red">50-80K·14 pay </span> <p>3-5 year <em class="vline"></em> Undergraduate </p> <div class="info-publis"> <h3 class="name"><img class="icon-chat" src="https://z.zhipin.com/web/geek/resource/icon-chat-v2.png"> Mr. Cao <em class="vline"></em> data mining </h3> </div> <button class="btn btn-startchat" href="javascript:;" data-url="/wapi/zpgeek/friend/add.json?jobId=7271f2f28169375a1nR42t-6GFpQ&amp;lid=nlp-aqyTkPDQjXA.search.1" redirect-url="/web/geek/chat?id=495f7159c0c8664a1nFz39m8EA~~"> <img class="icon-chat icon-chat-hover" src="https://z.zhipin.com/web/geek/resource/icon-chat-hover-v2.png" alt=""> <span> Communicate immediately </span> </button> </div> <div class="info-detail" style="top: 0px;"></div> </div> </div> <div class="info-company"> <div class="company-text"> <h3 class="name"><a href="/gongsi/33e052361693f8371nF-3d25.html" title=" Jingdong group recruitment " ka="search_list_company_1_custompage" target="_blank"> Jingdong group </a></h3> <p><a href="/i100001/" class="false-link" target="_blank" ka="search_list_company_industry_1_custompage" title=" E-commerce industry recruitment information "> Electronic Commerce </a><em class="vline"></em> Listed <em class="vline"></em>10000 More people </p> </div> <a href="/gongsi/33e052361693f8371nF-3d25.html" ka="search_list_company_1_custompage_logo" target="_blank"><img class="company-logo" src="https://img.bosszhipin.com/beijin/mcs/bar/20191129/3cdf5ba2149e309b38868b62ae9c22cabe1bd4a3bd2a63f070bdbdada9aad826.jpg?x-oss-process=image/resize,w_100,limit_0" alt=""></a> </div> </div> <div class="info-append clearfix"> <div class="tags"> <span class="tag-item">Excel</span> <span class="tag-item">SPSS</span> <span class="tag-item">Python</span> <span class="tag-item"> data mining </span> <span class="tag-item"> Data warehouse </span> </div> <div class="info-desc"> Supplementary medical insurance , Holiday benefits , Regular check-up , Annual bonus , Meal supplement , Traffic subsidy , Free shuttle bus , Bag eating , the stock option , Staff travel , Snack afternoon tea , Five social insurance and one housing fund , Paid annual leave </div> </div> </div></li>

You can see , All the information you need is in the code .

You can also get job descriptions on the page , as follows :data web boss review description

And get the complete HTML The code is as follows :

<li> <div class="job-primary"> <div class="info-primary"> <div class="primary-wrapper"> <div class="primary-box" href="/job_detail/7271f2f28169375a1nR42t-6GFpQ.html?ka=search_list_1" data-jid="7271f2f28169375a1nR42t-6GFpQ" data-itemid="1" data-lid="nlp-aqyTkPDQjXA.search.1" data-jobid="102127880" data-index="0" ka="search_list_1" target="_blank"> <div class="job-title"> <span class="job-name"><a href="/job_detail/7271f2f28169375a1nR42t-6GFpQ.html" title=" Data analysis " target="_blank" ka="search_list_jname_1" data-jid="7271f2f28169375a1nR42t-6GFpQ" data-itemid="1" data-lid="nlp-aqyTkPDQjXA.search.1" data-jobid="102127880" data-index="0"> Data analysis </a></span> <span class="job-area-wrapper"> <span class="job-area"> Beijing · Chaoyang District · Bird nest </span> </span> <span class="job-pub-time"></span> </div> <div class="job-limit clearfix"> <span class="red">50-80K·14 pay </span> <p>3-5 year <em class="vline"></em> Undergraduate </p> <div class="info-publis"> <h3 class="name"><img class="icon-chat" src="https://z.zhipin.com/web/geek/resource/icon-chat-v2.png"> Mr. Cao <em class="vline"></em> data mining </h3> </div> <button class="btn btn-startchat" href="javascript:;" data-url="/wapi/zpgeek/friend/add.json?jobId=7271f2f28169375a1nR42t-6GFpQ&amp;lid=nlp-aqyTkPDQjXA.search.1" redirect-url="/web/geek/chat?id=495f7159c0c8664a1nFz39m8EA~~"> <img class="icon-chat icon-chat-hover" src="https://z.zhipin.com/web/geek/resource/icon-chat-hover-v2.png" alt=""> <span> Communicate immediately </span> </button> </div> <div class="info-detail" style="top: -307.1px;"> <div class="info-detail-top"> <div class="detail-top-left"> <div class="detail-top-title"> Data analysis </div> <div class="detail-top-text"> Jingdong group · data mining : Mr. Cao </div> <a href="javascript:;" ka="popjob_interest_tosign_7271f2f28169375a1nR42t-6GFpQ" data-url="/geek/tag/jobtagupdate.json?jobId=7271f2f28169375a1nR42t-6GFpQ&amp;expectId=&amp;tag=4&amp;lid=nlp-aqyTkPDQjXA.search.1" class="link-like " job-id="495f7159c0c8664a1nFz39m8EA~~"> Interested in </a> </div> <div class="detail-top-right detail-top-right2"> <div class="code-des"> scan , At any time with BOSS Talk </div> <div class="code-icon"></div> </div> </div> <div class="detail-bottom"> <div class="detail-bottom-title"> Job description </div> <div class="detail-bottom-text"> Job description <br>1、 Analysis of user portraits , Through the analysis of massive data mining , Extract user characteristics 、 Behavior trajectory ;<br>2、 Participate in the research and development of algorithms , Improve the performance and business indicators of the algorithm system ;<br>3、 carding 、 Connect temporary data requirements of different business lines , And abstract customized data products ; <br>4、 Combined with the needs of the project , Comprehensive utilization of Jingdong Mall data , Build a customized index model ;<br>5、 Responsible for providing data analysis support for product operation , Such as product analysis 、 User analysis 、 Operation analysis, etc , And according to the analysis results, the paper puts forward some practical suggestions ;<br>6、 Actively promote cross departmental cooperation , Cooperate with all kinds of projects to implement quality and quantity as scheduled .<br> Post requirements <br>1、 Bachelor degree or above , statistical 、 data 、 Computer related major is preferred ;<br>2、 At least two years working experience in Internet data analysis , Experience in e-commerce companies is preferred , Experience in building composite index is preferred ;<br>3、 Able to process data independently , Write a special analysis report , Master the common classification 、 clustering 、 forecast 、 Association rules 、 Sequential patterns and other mining algorithms ;<br>4、 Strong learning ability , Good communication skills , Can fully understand the business logic and purpose , There are clear ideas and methods for data analysis ;<br>5、 High data sensitivity , Be good at finding problems from data , And can give a certain solution ;<br>6、 Master SQL、EXCEL, be familiar with SPSS、SAS、Clementine、R、python Any kind of professional data analysis tool , Yes Hadoop、Hive、Spark Use experience is preferred .<br>7、 There is a return 、 clustering 、 classification 、 neural network 、NLP、 Optimization theory and other related theoretical basis and project application is preferred </div> </div> </div> </div> </div> <div class="info-company"> <div class="company-text"> <h3 class="name"><a href="/gongsi/33e052361693f8371nF-3d25.html" title=" Jingdong group recruitment " ka="search_list_company_1_custompage" target="_blank"> Jingdong group </a></h3> <p><a href="/i100001/" class="false-link" target="_blank" ka="search_list_company_industry_1_custompage" title=" E-commerce industry recruitment information "> Electronic Commerce </a><em class="vline"></em> Listed <em class="vline"></em>10000 More people </p> </div> <a href="/gongsi/33e052361693f8371nF-3d25.html" ka="search_list_company_1_custompage_logo" target="_blank"><img class="company-logo" src="https://img.bosszhipin.com/beijin/mcs/bar/20191129/3cdf5ba2149e309b38868b62ae9c22cabe1bd4a3bd2a63f070bdbdada9aad826.jpg?x-oss-process=image/resize,w_100,limit_0" alt=""></a> </div> </div> <div class="info-append clearfix"> <div class="tags"> <span class="tag-item">Excel</span> <span class="tag-item">SPSS</span> <span class="tag-item">Python</span> <span class="tag-item"> data mining </span> <span class="tag-item"> Data warehouse </span> </div> <div class="info-desc"> Supplementary medical insurance , Holiday benefits , Regular check-up , Annual bonus , Meal supplement , Traffic subsidy , Free shuttle bus , Bag eating , the stock option , Staff travel , Snack afternoon tea , Five social insurance and one housing fund , Paid annual leave </div> </div> </div></li>

You can also visit the position details as follows :data web boss review list

2. Data acquisition

Import the required library first , as follows :

## Import the necessary packagesfrom bs4 import BeautifulSoup as bsimport urllibimport reimport pandas as pdimport requests

To synchronize this section ipynb And data files , You can click add directly QQ Group Python Geek tribe 963624318 In the group folder Business data analysis from entry to entry You can download it in .

Use requests Library simulation request :

response = requests.get('https://www.zhipin.com/job_detail/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&city=100010000&industry=&position=')

Get the corresponding content returned , as follows :

display(response.content[:300], response.text, response.encoding)

Output :

b'<!DOCTYPE html>\n<html>\n <head>\n <meta charset="utf-8" />\n <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" />\n <title>\xe8\xaf\xb7\xe7\xa8\x8d\xe5\x90''<!DOCTYPE html>\n<html>\n <head>\n <meta charset="utf-8" />\n <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" />\n <title>请ç¨\x8då\x90\x8e</title>\n <style>\n html,\n body {\n margin: 0;\n width: 100%;\n height: 100%;\n }\n @keyframes bossLoading {\n 0% {\n transform: translate3d(0, 0, 0);\n }\n 50% {\n transform: translate3d(0, -10px, 0);\n }\n }\n .data-tips {\n text-align: center;\n height: 100%;\n position: relative;\n background: #fff;\n top: 50%;\n margin-top: -37px;\n }\n .data-tips .boss-loading {\n width: 100%;\n }\n .data-tips .boss-loading p {\n margin-top: 10px;\n color: #9fa3b0;\n }\n .boss-loading .component-b,\n .boss-loading .component-s1,\n .boss-loading .component-o,\n .boss-loading .component-s2 {\n display: inline-block;\n width: 40px;\n height: 42px;\n line-height: 42px;\n font-family: Helvetica Neue,Helvetica,Arial,Hiragino Sans GB,Hiragino Sans GB W3,Microsoft YaHei UI,Microsoft YaHei,WenQuanYi Micro Hei,sans-serif;\n font-weight: bolder;\n font-size: 40px;\n color: #eceef2;\n vertical-align: top;\n -webkit-animation-fill-mode: both;\n -webkit-animation: bossLoading 0.6s infinite linear alternate;\n -moz-animation: bossLoading 0.6s infinite linear alternate;\n animation: bossLoading 0.6s infinite linear alternate;\n }\n .boss-loading .component-o {\n -webkit-animation-delay: 0.1s;\n -moz-animation-delay: 0.1s;\n animation-delay: 0.1s;\n }\n .boss-loading .component-s1 {\n -webkit-animation-delay: 0.2s;\n -moz-animation-delay: 0.2s;\n animation-delay: 0.2s;\n }\n .boss-loading .component-s2 {\n -webkit-animation-delay: 0.3s;\n -moz-animation-delay: 0.3s;\n animation-delay: 0.3s;\n }\n </style>\n </head>\n <body>\n <div class="data-tips">\n <div class="tip-inner">\n <div class="boss-loading">\n <span class="component-b">B</span><span class="component-o">O</span><span class="component-s1">S</span><span class="component-s2">S</span>\n <p class="gray">æ\xad£å\x9c¨å\x8a\xa0è½½ä¸\xad...</p>\n </div>\n </div>\n </div>\n <script>\n var securityPageName="securityCheck";!function(){var a=new Image;a.src="https://t.zhipin.com/f.gif?pk="+securityPageName+"&r="+document.referrer}(),function(){function e(c){var l,m,n,o,p,q,r,e=function(){var a=location.hostname;return"localhost"===a||/^(\\d+\\.){3}\\d+$/.test(a)?a:"."+a.split(".").slice(-2).join(".")}(),f=function(a,b){var f=document.createElement("script");f.setAttribute("type","text/javascript"),f.setAttribute("charset","UTF-8"),f.onload=f.onreadystatechange=function(){d&&"loaded"!=this.readyState&&"complete"!=this.readyState||b()},f.setAttribute("src",a),"IFRAME"!=c.tagName?c.appendChild(f):c.contentDocument?c.contentDocument.body?c.contentDocument.body.appendChild(f):c.contentDocument.documentElement.appendChild(f):c.document&&(c.document.body?c.document.body.appendChild(f):c.document.documentElement.appendChild(f))},g=function(a){var b=new RegExp("(^|&)"+a+"=([^&]*)(&|$)"),c=window.location.search.substr(1).match(b);return null!=c?unescape(c[2]):null},h={get:function(a){var b,c=new RegExp("(^| )"+a+"=([^;]*)(;|$)");return(b=document.cookie.match(c))?unescape(b[2]):null},set:function(a,b,c,d,e){var g,f=a+"="+encodeURIComponent(b);c&&(g=new Date(c).toGMTString(),f+=";expires="+g),f=d?f+";domain="+d:f,f=e?f+";path="+e:f,document.cookie=f}},i=function(a){window.location.replace(a)},j=function(a,c){c||a.indexOf("security-check.html")>-1?i(c):i(a);var d=new Image;d.src="https://t.zhipin.com/f.gif?pk="+securityPageName+"&ca=securityCheckJump_"+Math.round(((new Date).getTime()-b)/1e3)+"&r="+document.referrer};window.location.href,l=g("seed")||"",m=g("ts"),n=g("name"),o=g("callbackUrl"),p=g("srcReferer")||"","null"!==n&&l&&n&&o||(q=new Image,q.src="https://t.zhipin.com/f.gif?pk="+securityPageName+"&ca=securityCheckUrlFile&url="+window.location.href),l&&m&&n&&(r=setInterval(function(){a++,a>5&&clearInterval(r);var c=new Image;c.src="https://t.zhipin.com/f.gif?pk="+securityPageName+"&ca=securityCheckTimer_"+Math.round(((new Date).getTime()-b)/1e3)+"&r="+document.referrer},1e4),f("security-js/"+n+".js",function(){var n,a=(new Date).getTime()+2304e5,d="",f={},g=window.ABC||c.contentWindow.ABC;try{d=(new g).z(l,parseInt(m)+1e3*60*(480+(new Date).getTimezoneOffset()))}catch(k){}d&&o?(h.set("__zp_stoken__",d,a,e,"/"),"undefined"!=typeof window.wst&&"function"==typeof wst.postMessage&&(f={name:"setWKCookie",params:{url:e,name:"__zp_stoken__",value:encodeURIComponent(d),expiredate:a,path:"/"}},window.wst.postMessage(JSON.stringify(f))),j(p,o)):(n=new Image,n.src="https://t.zhipin.com/f.gif?pk="+securityPageName+"&ca=securityCheckNoCode_"+Math.round(((new Date).getTime()-b)/1e3)+"&r="+document.referrer,i("/"))}))}function j(a){if(!f&&!g&&document.addEventListener)return document.addEventListener("DOMContentLoaded",a,!1);if(!(h.push(a)>1))if(f)!function(){try{document.documentElement.doScroll("left"),i()}catch(a){setTimeout(arguments.callee,0)}}();else if(g)var b=setInterval(function(){/^(loaded|complete)$/.test(document.readyState)&&(clearInterval(b),i())},0)}var d,f,g,h,i,a=0,b=(new Date).getTime(),c=window.navigator.userAgent;c.indexOf("MSIE ")>-1&&(d=!0),f=!(!window.attachEvent||window.opera),g=/webkit\\/(\\d+)/i.test(navigator.userAgent)&&RegExp.$1<525,h=[],i=function(){for(var a=0;a<h.length;a++)h[a]()},j(function(){var b,a=window.navigator.userAgent.toLowerCase();return"micromessenger"==a.match(/micromessenger/i)||"wkwebview"==a.match(/wkwebview/i)?(e(document.getElementsByTagName("head").item(0)),void 0):(b=document.createElement("iframe"),b.style.height=0,b.style.width=0,b.style.margin=0,b.style.padding=0,b.style.border="0 none",b.name="zhipinFrame",b.src="about:blank",b.attachEvent?b.attachEvent("onload",function(){e(b)}):b.onload=function(){e(b)},(document.body||document.documentElement).appendChild(b),void 0)})}();\n\n var _hmt = _hmt || [];\n (function() {\n var hm = document.createElement("script");\n hm.src = "https://hm.baidu.com/hm.js?194df3105ad7148dcf2b98a91b5e727a";\n var s = document.getElementsByTagName("script")[0];\n s.parentNode.insertBefore(hm, s);\n })();\n </script>\n </body>\n</html>\n''ISO-8859-1'

among ,response.content To get the corresponding original content ,response.text It is used to get the content after coding and rendering ,response. encoding It's used to get the encoding .

But obviously , The page information is not completely displayed , This is because general requests are not just incoming links , There are some other request information , Such as User-Agent、Referer、Cookie etc. . as follows :

header = { 'Cookie': 'lastCity=100010000; __g=-; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1601602464; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1601603899; __zp_stoken__=cb83bGgJhaiViDXQAITlxUxFkf1pCNVEpEwUhZztsI15sAmVWQCkEKnUxcRpDISgGPFcSd0wHd11lKGM1Pn80J0RbEhEvayU6GXYcUwQVSThRFWM6IQ4gLwRCG2wAHE59OgYYZFcOBlsQA3VWJQ%3D%3D; __c=1601602461; __l=l=%2Fwww.zhipin.com%2F&r=&g=&friend_source=0&friend_source=0; __a=80430348.1601602461..1601602461.7.1.7.7', 'Host': 'www.zhipin.com', 'Referer': 'https://www.zhipin.com/job_detail/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&city=100010000&industry=&position=', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'}res = requests.get('https://www.zhipin.com/job_detail/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&city=100010000&industry=&position=', headers=header)res.text

At this time, the output information is more complete .

At the same time, you can save the requested content to a file , as follows :

html_file = open('bosspage.html','w', encoding='utf-8')html_file.write(res.text)html_file.close()

After operation , You can see that a file has been added to the current directory bosspage.html, You can also open the file in a browser , The page is the same effect as the previous page , The information needed is also stored in HTML In the code .

3. Extract list information

With the web code , You can extract information , Before, we used string method to extract string , Choose now BeautifulSoup To select the information you need .

Simple use as follows :

html = """<html><head><title>The Dormouse's title</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>"""soup_first = bs(html, 'html.parser')soup_first.prettify()

Output :

'<html>\n <head>\n <title>\n The Dormouse\'s title\n </title>\n </head>\n <body>\n <p class="title" name="dromouse">\n <b>\n The Dormouse\'s story\n </b>\n </p>\n <p class="story">\n Once upon a time there were three little sisters; and their names were\n <a class="sister" href="http://example.com/elsie" id="link1">\n <!-- Elsie -->\n </a>\n ,\n <a class="sister" href="http://example.com/lacie" id="link2">\n Lacie\n </a>\n and\n <a class="sister" href="http://example.com/tillie" id="link3">\n Tillie\n </a>\n ;\nand they lived at the bottom of a well.\n </p>\n <p class="story">\n ...\n </p>\n </body>\n</html>\n'

You can get all the text in the tag , as follows :

soup_first.text

Output :

"\nThe Dormouse's title\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\n,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n\n\n"

You can also get the attributes of the tag , as follows :

all_a = soup_first.find_all("a")all_a[0]["href"]

Output :

'http://example.com/elsie'

You can see , Got it a Labeled href attribute , I.e. link .

You can also get all the links , as follows :

[a['href'] for a in soup_first.find_all("a")]

Output :

['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']

There are other uses :

display(soup_first.title,soup_first.head,soup_first.a,soup_first.p.string,soup_first.find_all("a"))

Output :

<title>The Dormouse's title</title><head><title>The Dormouse's title</title></head><a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>"The Dormouse's story"[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Use BOSS Face to face BeautifulSoup To initialize :

soup = bs(res.text, 'lxml')soup.prettify()

Locate the information you need , as follows :

all_jobs = soup.find_all("div", class_="job-primary")all_jobs[0]

Output :

<div class="job-primary"><div class="info-primary"><div class="primary-wrapper"><div class="primary-box" data-index="0" data-itemid="1" data-jid="7271f2f28169375a1nR42t-6GFpQ" data-jobid="102127880" data-lid="nlp-arJU8s0LBOW.search.1" href="/job_detail/7271f2f28169375a1nR42t-6GFpQ.html" ka="search_list_1" target="_blank"><div class="job-title"><span class="job-name"><a data-index="0" data-itemid="1" data-jid="7271f2f28169375a1nR42t-6GFpQ" data-jobid="102127880" data-lid="nlp-arJU8s0LBOW.search.1" href="/job_detail/7271f2f28169375a1nR42t-6GFpQ.html" ka="search_list_jname_1" target="_blank" title=" Data analysis "> Data analysis </a></span><span class="job-area-wrapper"><span class="job-area"> Beijing · Chaoyang District · Bird nest </span></span><span class="job-pub-time"></span></div><div class="job-limit clearfix"><span class="red">50-80K·14 pay </span><p>3-5 year <em class="vline"></em> Undergraduate </p><div class="info-publis"><h3 class="name"><img class="icon-chat" src="https://z.zhipin.com/web/geek/resource/icon-chat-v2.png"/> Mr. Cao <em class="vline"></em> data mining </h3></div><button class="btn btn-startchat" data-url="/wapi/zpgeek/friend/add.json?jobId=7271f2f28169375a1nR42t-6GFpQ&amp;lid=nlp-arJU8s0LBOW.search.1" href="javascript:;" redirect-url="/web/geek/chat?id=495f7159c0c8664a1nFz39m8EA~~"><img alt="" class="icon-chat icon-chat-hover" src="https://z.zhipin.com/web/geek/resource/icon-chat-hover-v2.png"/><span> Communicate immediately </span></button></div><div class="info-detail"></div></div></div><div class="info-company"><div class="company-text"><h3 class="name"><a href="/gongsi/33e052361693f8371nF-3d25.html" ka="search_list_company_1_custompage" target="_blank" title=" Jingdong group recruitment "> Jingdong group </a></h3><p><a class="false-link" href="/i100001/" ka="search_list_company_industry_1_custompage" target="_blank" title=" E-commerce industry recruitment information "> Electronic Commerce </a><em class="vline"></em> Listed <em class="vline"></em>10000 More people </p></div><a href="/gongsi/33e052361693f8371nF-3d25.html" ka="search_list_company_1_custompage_logo" target="_blank"><img alt="" class="company-logo" src="https://img.bosszhipin.com/beijin/mcs/bar/20191129/3cdf5ba2149e309b38868b62ae9c22cabe1bd4a3bd2a63f070bdbdada9aad826.jpg?x-oss-process=image/resize,w_100,limit_0"/></a></div></div><div class="info-append clearfix"><div class="tags"><span class="tag-item">Excel</span><span class="tag-item">SPSS</span><span class="tag-item">Python</span><span class="tag-item"> data mining </span><span class="tag-item"> Data warehouse </span></div><div class="info-desc"> Supplementary medical insurance , Holiday benefits , Regular check-up , Annual bonus , Meal supplement , Traffic subsidy , Free shuttle bus , Bag eating , the stock option , Staff travel , Snack afternoon tea , Five social insurance and one housing fund , Paid annual leave </div></div></div>

You can see , This is the details of a position .

First, extract a position information :

base_boss_url ="https://www.zhipin.com"job_link= base_boss_url + all_jobs[0].a["href"]job_title = all_jobs[0].a.textjob_salary = all_jobs[0].find('span',class_='red').textother_detail = all_jobs[0].find("div", class_="info-detail").textcompany_url = base_boss_url + all_jobs[0].select(".info-company")[0].a["href"]company = all_jobs[0].select(".info-company")[0].a.textcompany_info = all_jobs[0].select(".info-company")[0].p.textpublish_info = all_jobs[0].find("div",class_="info-publis").h3.text"{}-{}-{}-{}-{}-{}-{}-{}".format(job_link,job_title,job_salary,other_detail,company_url,company,company_info,publish_info)

Output :

'https://www.zhipin.com/job_detail/7271f2f28169375a1nR42t-6GFpQ.html- Data analysis -50-80K·14 pay --https://www.zhipin.com/gongsi/33e052361693f8371nF-3d25.html- Jingdong group - E-commerce is on the market 10000 More people - Mr. Cao data mining '

obviously , It has been extracted 1 Job details .

Further adoption for Extract the information of all positions in the current page , as follows :

jobs_index = []for job_ in all_jobs: job_link= base_boss_url + job_.a["href"] job_title = job_.a.text job_salary = job_.find('span',class_='red').text other_detail = job_.find("div", class_="info-detail").text company_url = base_boss_url + job_.select(".info-company")[0].a["href"] company = job_.select(".info-company")[0].a.text company_info = job_.select(".info-company")[0].p.text publish_info = job_.find("div",class_="info-publis").h3.text jobs_index.append([job_link,job_title,job_salary,other_detail,company_url,company,company_info,publish_info])jobs_index

Output :

[['https://www.zhipin.com/job_detail/7271f2f28169375a1nR42t-6GFpQ.html', ' Data analysis ', '50-80K·14 pay ', '', 'https://www.zhipin.com/gongsi/33e052361693f8371nF-3d25.html', ' Jingdong group ', ' E-commerce is on the market 10000 More people ', ' Mr. Cao data mining '], ['https://www.zhipin.com/job_detail/1fe1d55e100e19d43nR509m-E1Q~.html', ' Data analysis ', '18-35K·15 pay ', '', 'https://www.zhipin.com/gongsi/918159f26789c3891nV53dQ~.html', ' The little red book ', ' Internet D Wheel and above 1000-9999 people ', ' Mr. Liu, business data center '], ['https://www.zhipin.com/job_detail/4423d7c2eda602351nR-09u0EVs~.html', ' Data analysis ', '25-40K·16 pay ', '', 'https://www.zhipin.com/gongsi/fa2f92669c66eee31Hc~.html', 'BOSS Direct employment ', ' Human resources services D Wheel and above 1000-9999 people ', ' Mr. alifan data analysis '], ['https://www.zhipin.com/job_detail/9c2e41ed166d74bd03J-29u0F1s~.html', ' Business data analysis ', '25-40K·15 pay ', '', 'https://www.zhipin.com/gongsi/980f48937a13792b1nd63d0~.html', ' Drops travel ', ' Mobile Internet D Wheel and above 1000-9999 people ', ' Mr. Wang, senior manager of business analysis '], ['https://www.zhipin.com/job_detail/27d069780b8cc5c53nV62dS8EFE~.html', ' Post data analysis ', '20-40K·14 pay ', '', 'https://www.zhipin.com/gongsi/6e19637143bd80ad1HV_3N26GQ~~.html', ' Jianxin Jinke ', ' Banks don't need financing 1000-9999 people ', ' Mr. Wang, architect / researcher '], ... ['https://www.zhipin.com/job_detail/4c408eec4076e9d80nV73NW5FVU~.html', ' Data Analyst ', '30-50K·16 pay ', '', 'https://www.zhipin.com/gongsi/ea9c5680f57d53d71HV90ty5.html', ' A lot of spelling ', ' Mobile Internet is on the market 1000-9999 people ', ' Ms. Wang, data team of commercialization Department leader'], ['https://www.zhipin.com/job_detail/9f44d60c7097321033142tu4FVI~.html', ' Business data analysis ', '20-30K', '', 'https://www.zhipin.com/gongsi/92674acda23901841nd_292-EQ~~.html', ' Many car groups ', ' Internet D Wheel and above 10000 More people ', ' Ms. Li HR'], ['https://www.zhipin.com/job_detail/a6df576d9539ad810HN439i7Flo~.html', ' Data Analyst ', '30-50K·14 pay ', '', 'https://www.zhipin.com/gongsi/48e6b3630a48ccdb03N-2di9.html', ' Share the momentum ', ' The Internet doesn't need financing 500-999 people ', ' Ms. Chen recruiter '], ['https://www.zhipin.com/job_detail/89713a5a1647e44e0XF63dW8F1Y~.html', ' Data Analyst ', '20-30K·13 pay ', '', 'https://www.zhipin.com/gongsi/d6f0653b1a4d44740XB_29W0.html', ' Ape counseling ', ' Online education D Wheel and above 1000-9999 people ', ' Ms. Mao hrbp Senior Manager '], ['https://www.zhipin.com/job_detail/7585af83791f132833F639u7Flo~.html', ' Data Analyst ', '15-25K', '', 'https://www.zhipin.com/gongsi/f12428f4426b92a033V52tU~.html', '360', ' Mobile Internet is on the market 1000-9999 people ', ' Ms. Zhang HRBP']]

obviously , At this time, the extracted information is useful .

Because there are many pages , So you need to turn pages to get information from each page , At this point, you need to get the link to the next page in the page , as follows :

next_page = base_boss_url + soup.find("a", class_="next")['href']next_page

Output :

'https://www.zhipin.com/c100010000/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&page=2'

obviously , Get a link to the next page .

Further, it is realized in the form of function :

def extract_jobs(page): page_soup = bs(page, 'lxml') all_jobs = page_soup.find_all("div", class_="job-primary") jobs_index = [] print("parseing page ",page_soup.title.text) for job_ in all_jobs: job_link= base_boss_url + job_.a["href"] job_title = job_.a.text job_salary = job_.find('span',class_='red').text other_detail = job_.find("div", class_="info-detail").text company_url = base_boss_url + job_.select(".info-company")[0].a["href"] company = job_.select(".info-company")[0].a.text company_info = job_.select(".info-company")[0].p.text publish_info = job_.find("div",class_="info-publis").h3.text jobs_index.append([job_link,job_title,job_salary,other_detail,company_url,company,company_info,publish_info]) next_page = base_boss_url + soup.find("a", class_="next")['href'] print("next page is ",next_page) return jobs_index, next_page

Recycling enables crawling multiple pages :

next_page = "https://www.zhipin.com/job_detail/?query= Data analysis &city=100010000&industry=&position="header = { 'Cookie': 'Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1601622554; lastCity=100010000; __g=-; toUrl=https%3A%2F%2Fwww.zhipin.com%2Fc100010000%2F%3Fquery%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%26page%3D2%26ka%3Dpage-2; t=CPzVdSehDMWYI0ch; wt=CPzVdSehDMWYI0ch; _bl_uid=70kkOfndrz2x09b2wqjXvwRw7CXh; __c=1601622556; __l=l=%2Fwww.zhipin.com%2Fjob_detail%2F%3Fquery%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%26city%3D100010000%26industry%3D%26position%3D&r=&g=&friend_source=0&friend_source=0; __a=10559958.1598103978.1598103978.1601622556.20.2.19.20; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1601625518; __zp_stoken__=cb83bGmgSGWkpKCwDKD94UGNacAUEGlI0IiUsTFEZOkpsdHcVUH9dZWN0U3hoOykGPFcSd0wHeyVlID01OwRMXh5NPCtDNBRnZXAZTAIVSThRFWM6IQ86BGZgXnpPRRhtOgYYZFcOBlsQA3VWJQ%3D%3D', 'Host': 'www.zhipin.com', 'Referer': 'https://www.zhipin.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',}counter = 0all_jobs = []while next_page != "javascript:;": print("start to fecth url ",next_page) boss_response = requests.get(next_page, headers=header) jobs, next_page = extract_jobs(boss_response.text) counter +=1 if len(jobs) > 0: all_jobs = all_jobs + jobs if counter > 3: break time.sleep(random.randint(5,12))

Output is as follows :

start to fecth url https://www.zhipin.com/job_detail/?query= Data analysis &city=100010000&industry=&position=parseing page 「 National data analysis recruitment 」-2020 National data analysis of the latest recruitment information - BOSS Direct employment start to fecth url https://www.zhipin.com/c100010000/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&page=2parseing page 「 National data analysis recruitment 」-2020 National data analysis of the latest recruitment information - BOSS Direct employment start to fecth url https://www.zhipin.com/c100010000/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&page=3parseing page 「 National data analysis recruitment 」-2020 National data analysis of the latest recruitment information - BOSS Direct employment start to fecth url https://www.zhipin.com/c100010000/?query=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90&page=4parseing page 「 National data analysis recruitment 」-2020 National data analysis of the latest recruitment information - BOSS Direct employment 

Because now BOSS The anti climbing measures of direct employment are quite strict , So you have to add... To the request header Cookie Information ( This is to verify the basic information of a user ), In the browser Cookie The method of obtaining is as follows :data web boss get cookie

Be careful , Need to get the first network request ( Which is similar to https://www.zhipin.com/job_detail/?query= Data analysis &city=100010000&industry=&position= Request ) Corresponding Cookie, Because different requests correspond to Cookie It might be different ; also , One Cookie It can only be used once , So crawling once should be recaptured Cookie, It's better to register users and get Cookie、 It's more effective . At the same time in order to Control access frequency , After each flipping cycle , All pass time.sleep() Method pauses execution .

At this point, view the data obtained , as follows :

all_jobs

Output :

[['https://www.zhipin.com/job_detail/e1bde0976de53e081nR43dm5EFRU.html', ' Data Analyst ( Internship )', '200-300 element / God ', '', 'https://www.zhipin.com/gongsi/2e64a887a110ea9f1nRz.html', ' tencent ', ' The Internet is on the market 10000 More people ', ' Ms. Huang HRBP'], ['https://www.zhipin.com/job_detail/062a0a30e8b663103nJy2N65E1E~.html', '【 The school recruit 】 Data Analyst ', '20-30K·16 pay ', '', 'https://www.zhipin.com/gongsi/fa2f92669c66eee31Hc~.html', 'BOSS Direct employment ', ' Human resources services D Wheel and above 1000-9999 people ', 'BOSS Direct employment School Recruitment Campus Recruitment '], ['https://www.zhipin.com/job_detail/5b132f8291af536d3nN42di7F1Y~.html', ' The weekend double cease Data analysis ', '7-12K', '', 'https://www.zhipin.com/gongsi/aa07960c21a559c61nV_3N24GFs~.html', ' Beijing Pancheng ', ' Electronic Commerce 100-499 people ', ' Ms. Wang, personnel manager '],... ['https://www.zhipin.com/job_detail/3bcf1023eea94e363nN_3d65GFM~.html', ' Data Analyst ', '7-8K', '', 'https://www.zhipin.com/gongsi/c58313ff6a0317b10HN83d-0.html', ' Nut power ', ' game A round 100-499 people ', ' Producer Li Qiang '], ['https://www.zhipin.com/job_detail/0cf98b59a339fd4603R90tS5FlQ~.html', ' Data Analyst ', '8-10K', '', 'https://www.zhipin.com/gongsi/90ffbb07580a82d203d73d-5Fw~~.html', ' Beijing stands for innovation and technology ...', ' academic / Unfunded research 0-20 people ', ' Gao Xiaoling, designer ']]

obviously , We've got the data we need .

It can also be further saved to a file , as follows :

fout = open('job_data.csv', 'wt')for info in all_jobs: fout.write(",".join(info)+"\n")fout.close()

After successful execution , There will be one more file in the list job_data.csv.

4. Get job details

When getting job details , You can use the link to the details you've got earlier , adopt requests Simulate the request and use BeautifulSoup analysis .

First take a product details link as an example to explore . Check out the webpage below :data web boss job detail review

You can see , Job details are in class by detail-content Of div in .

Get the details of a position details page , as follows :

detail_link = all_jobs[0][0]header = { 'Cookie': 'lastCity=100010000; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1601602464,1601624966; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1601627370; __zp_stoken__=cb83bGmgSGWkpKFVye2gnUGNacAVQeH5ZeQEsTFEZOiALeWBKTX9dZWN0eHZBaRkGPFcSd0wHey9kCTc1M2kdDjAjby9CXQRiHX9yWnsLSThRFWM6IT9oLWhLXnpPRRhwOgYYZFcOBlsQA3VWJQ%3D%3D; __fid=7627d554a7f83f762fe906cbda0d7906; __g=-; __c=1601602461; __l=l=%2Fwww.zhipin.com%2Fc100010000%2F%3Fquery%3D%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590%26page%3D5&r=http%3A%2F%2F127.0.0.1%3A8888%2Fnotebooks%2Fcrawl_boss.ipynb&g=&friend_source=0&friend_source=0; __a=80430348.1601602461..1601602461.23.1.23.23', 'Host': 'www.zhipin.com', 'Referer': 'https://www.zhipin.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',}job_detail = requests.request("GET", detail_link,headers=header)job_soup = bs(job_detail.text,"lxml")detail_text = job_soup.find("div",class_="job-sec")detail_text.text

Output :

'\n Job description \n\n 【 Tencent video Number 】 Data analysis - Daily interns 【 Only accept 985/211 Or overseas university counterpart professional resume 】 Job responsibilities: responsible for the data analysis of wechat video number products and related job requirements 1. Data Science / Statistics / Mathematics / Trust tube / Bachelor degree or above in computer science or related field ;2. Good data structure and algorithm foundation , Excellent coding ability ;3. Data driven , Skillfully use sql、excel, Efficient use of data to guide and optimize solutions ,Python or R( must ),SQL( must ),Tableau( Bonus points , Suggest )Excel( must ) Experience in massive data processing ;4. Good communication skills 、 Frank and direct 、 Value teamwork ;5. Have talent, other interests and hobbies , Internet companies have the same internship experience , Self management account is preferred 6. One week internship 4 Days or more , Be able to arrive at the post immediately , Internship 3 A month or more 7. It must be a student with a school status , Give priority to 2021 Session and 2021 After graduation students . other :1. place : Beijing offline ;2. treatment : High salaries , Travel and transportation reimbursement , Free meals , Large space , Team atmosphere nice\n \n'

Further processing of the text :

detail_text = detail_text.text.replace("\n","").replace(" ","")detail_text

Output :

' Job description 【 Tencent video Number 】 Data analysis - Daily interns 【 Only accept 985/211 Or overseas university counterpart professional resume 】 Job responsibilities: responsible for the data analysis of wechat video number products and related job requirements 1. Data Science / Statistics / Mathematics / Trust tube / Bachelor degree or above in computer science or related field ;2. Good data structure and algorithm foundation , Excellent coding ability ;3. Data driven , Skillfully use sql、excel, Efficient use of data to guide and optimize solutions ,Python or R( must ),SQL( must ),Tableau( Bonus points , Suggest )Excel( must ) Experience in massive data processing ;4. Good communication skills 、 Frank and direct 、 Value teamwork ;5. Have talent, other interests and hobbies , Internet companies have the same internship experience , Self management account is preferred 6. One week internship 4 Days or more , Be able to arrive at the post immediately , Internship 3 A month or more 7. It must be a student with a school status , Give priority to 2021 Session and 2021 After graduation students . other :1. place : Beijing offline ;2. treatment : High salaries , Travel and transportation reimbursement , Free meals , Large space , Team atmosphere nice'

obviously , The page is much more beautiful .

Further through the loop to obtain the details of multiple links to the details of information :

job_desc=[]header = { 'Cookie': 'Cookie: lastCity=100010000; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1601602464,1601624966; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1601628313; __zp_stoken__=cb83bGmgSGWkpKF9eQW0WUGNacAVVDB9sNDssTFEZOlIDHXcKU39dZWN0enMzK2IGPFcSd0wHeyAzIGM1LHd1KFU0Y1BHPxZtbHF0XH4cSThRFWM6IUQqX21JXnpPRRhuOgYYZFcOBlsQA3VWJQ%3D%3D; __fid=7627d554a7f83f762fe906cbda0d7906; __g=-; ___gtid=729532789; __c=1601602461; __l=l=%2Fwww.zhipin.com%2Fjob_detail%2F7271f2f28169375a1nR42t-6GFpQ.html%3Fka%3Dsearch_list_jname_1_blank%26lid%3Dnlp-axWMPTPcuB6.search.1&r=http%3A%2F%2F127.0.0.1%3A8888%2Fnotebooks%2Fcrawl_boss.ipynb&g=&friend_source=0&friend_source=0; __a=80430348.1601602461..1601602461.28.1.28.28', 'Host': 'www.zhipin.com', 'Referer': 'https://www.zhipin.com/', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',}for job in all_jobs[:4]: print(".",end="") job_detail = requests.request("GET", job[0], headers=header) job_soup = bs(job_detail.text,"lxml") detail_text = job_soup.find("div",class_="job-sec").text.replace("\n","").replace(" ","") job_desc.append([job[0],detail_text]) time.sleep(random.random()*3)job_desc

Output :

[['https://www.zhipin.com/job_detail/e1bde0976de53e081nR43dm5EFRU.html', ' Job description 【 Tencent video Number 】 Data analysis - Daily interns 【 Only accept 985/211 Or overseas university counterpart professional resume 】 Job responsibilities: responsible for the data analysis of wechat video number products and related job requirements 1. Data Science / Statistics / Mathematics / Trust tube / Bachelor degree or above in computer science or related field ;2. Good data structure and algorithm foundation , Excellent coding ability ;3. Data driven , Skillfully use sql、excel, Efficient use of data to guide and optimize solutions ,Python or R( must ),SQL( must ),Tableau( Bonus points , Suggest )Excel( must ) Experience in massive data processing ;4. Good communication skills 、 Frank and direct 、 Value teamwork ;5. Have talent, other interests and hobbies , Internet companies have the same internship experience , Self management account is preferred 6. One week internship 4 Days or more , Be able to arrive at the post immediately , Internship 3 A month or more 7. It must be a student with a school status , Give priority to 2021 Session and 2021 After graduation students . other :1. place : Beijing offline ;2. treatment : High salaries , Travel and transportation reimbursement , Free meals , Large space , Team atmosphere nice'], ['https://www.zhipin.com/job_detail/062a0a30e8b663103nJy2N65E1E~.html', ' Job description our daily work :1、 writing code( Include SQL/Shell/Python/R etc. ), extract 、 Processing related business data 2、 application Excel/Python And some visualization tools for data visualization related analysis 3、 Write an analysis report , The relevant analysis conclusions are given . It is mainly divided into several aspects : Evaluation of the effect of product business optimization iteration , Data validation of business assumptions , Business optimization implementation strategy formulation , Research orientation of possible business problems , Explore the strategic direction of business development 4、 With the product 、 market 、 operating 、 sales 、 Design and other departments to communicate business analysis conclusions , So that the discovery of data analysis can drive the relevant optimization of business, you in our eyes :1、 This position needs to write a lot of code , So hopefully you're not a person who's afraid to deal with code .2、 This position needs data analysis , Will need to master a lot of mathematics related knowledge , So I hope you are not a person who didn't like learning mathematics when you were young .3、 This position involves a lot of communication with people , So I hope you're not too shy .4、 The purpose of this position is to understand the internal mechanism of user behavior through data analysis , So we want you to be a better empathic person , Usually, I am willing to understand 、 And someone who can take care of the feelings of others .5、 In order to better understand the business , We will need to learn extensively , Economics, for example 、 psychology 、 system theory 、 Information theory and so on , So I hope you are one who can challenge yourself constantly , People who are curious enough about all the unknowns .6、 Knowledge can be learned 、 Ability can be exercised 、 Mind can cultivate , In the end, you still need to be a person with enough career pursuit .'], ['https://www.zhipin.com/job_detail/5b132f8291af536d3nN42di7F1Y~.html', ' Job description skill requirements : Data analysis , Data warehouse 1、 Collect industry related information , Provide more accurate data information for relevant demanders ;2、 Enrich market analysis ability , Make a daily analysis plan , Master all kinds of analytical techniques ;3、 To the market 、 industry 、 Company operations, etc. provide data analysis plan , Support strategic decisions ;4、 Publish research findings or analytical reviews , Cooperate with the promotion and training of the company .5、 Assist department manager to improve department management system ;'], ['https://www.zhipin.com/job_detail/b8f8a877b20685010XF62Nm1FVs~.html', ' Job description 【 Responsibilities 】\u20281、 Participate in raw data 、 Data extraction 、 government 、 The whole process from statistical analysis to report presentation 2、 Deep understanding of business , Discover business characteristics , Mining the value of derivative data \u2028【 Qualifications 】1、 Computer and other related majors are preferred ;2、 be familiar with SQL, Have use HIVESQL perhaps SPARKSQL Experienced person , Skillfully use java,python perhaps scala first ;3、 Open and flexible , Sensitive to numbers , Good at finding problems from data and grasping the key points ;4, Good data sensitivity 、 Good logical thinking , Can discover and analyze the hidden changes and problems in the data in time ;5、 Good logical thinking ability , Be able to find valuable laws from massive data 6、 understand spark ecology , Experience in big data processing is preferred 7、 Internship period at least 6 Months .']]

Further save the data as follows :

fout = open('job_desc.csv', 'wt', encoding='gbk')for info in job_desc: fout.write("{},\"{}\"\n".format(info[0],info[1].encode('gbk', 'ignore').decode('gbk', 'ignore')))fout.close()

Look at the current directory again , More files are job_desc.csv.

5. Word frequency statistics and word cloud display

Description of Chinese details , We need to do word segmentation first , Divide the passage into shorter words , Use jieba Library participle , You need to pass the command before using it conda install -c conda-forge jieba Installation .

jieba The simple use of the library is as follows :

import jiebafenci = jieba.cut(" I went to university in Beijing , I went to Peking University, which is better than Tsinghua University ",cut_all=True)'/'.join(fenci)

Output :

' I / stay / Beijing / On / university /// I / On / Of / yes / Than / tsinghua / good / Of / Beijing / Peking University, / university '

Further use the following :

fenci = jieba.cut(" I am studying in Beijing , I went to Peking University, which is better than Tsinghua University ",cut_all=False)print("/ ".join(fenci))fenci = jieba.cut(" I love tian 'anmen square in Beijing , Five rings are one less than six rings , Learn from good examples python It's not a low-end labor force , Sobbing ",cut_all=False)print("/ ".join(fenci))jieba.suggest_freq(" Six rings ", tune=True)fenci = jieba.cut(" I love tian 'anmen square in Beijing , There is one more ring in the five rings than in the six rings , Learn from good examples python It's not a low-end labor force , Sobbing ",cut_all=False)print("/ ".join(fenci))

Output :

 I / stay / Beijing / read / university / ,/ I read / Of / yes / Than / tsinghua / good / Of / Beijing University I / Love / Beijing / The tiananmen square / ,/ Five rings / Less than six rings / One ring / ,/ Learn from good examples / python/ Just / No / Low-end / Labour / 了 / ,/ Sobbing at me / Love / Beijing / The tiananmen square / ,/ Five rings / Than / Six rings / One more step / ,/ Learn from good examples / python/ Just / No / Low-end / Labour / 了 / ,/ Sobbing 

Yes job_desc The processing is as follows :

combined_job_desc = " ".join([j[1] for j in job_desc])fenci_job_desc = jieba.cut(combined_job_desc,cut_all=False)space = " ".join(fenci_job_desc)space

Before using the word cloud , By order conda install -c conda-forge wordcloud install wordcloud library .

The generated word cloud objects are as follows :

# Generate WordCloud object wc = WordCloud(# width=800,# height=600, background_color="white", # Set the background color max_words=200, # The maximum number of words ( The default is 200) colormap='viridis', # string or matplotlib colormap, default="viridis" random_state=10, # Set how many randomly generated states there are , That's how many color schemes there are font_path='STLITI.TTF' # Set font path )##commentsmy_wordcloud = wc.generate(space)

Be careful : When initializing again , If it's a Chinese word , You need to specify the font_path, namely The font path , And you need to have the corresponding font file in the path .

To get a font file for testing , You can click add directly QQ Group Python Geek tribe 963624318 , In the group folder Business data analysis from entry to entry You can download it in ,Windows The system can also be in C:\Windows\Fonts Select a font that supports Chinese and copy it to the project path .

Show word cloud :

import matplotlib.pyplot as plt%matplotlib inlineplt.imshow(wc, interpolation="bilinear")plt.axis("off")plt.figure()

Show :data web boss wordcloud first

You can see , According to the number of keyword weight and distinguish the size of words , Form discriminative word cloud statistics .

But it can be further optimized , Get rid of some repetitive 、 Meaningless words , Can be initialized in WordCloud Object stopwords Parameters ignore these words . as follows :

wc = WordCloud(# width=800,# height=600, background_color="white", # Set the background color # max_words=200, # The maximum number of words ( The default is 200) colormap='viridis', # string or matplotlib colormap, default="viridis" random_state=10, # Set how many randomly generated states there are , That's how many color schemes there are font_path='STKAITI.TTF', stopwords=(' data ',' Data analysis ',' Job description ',' Work ',' Job content ',' duty ',' operating duty ',' Job requirements ',' Position ',' describe ',' product ',' Experience ',' skilled ', ' Conduct ',' operating ',' relevant ',' Above education ',' Use ',' Tools ',' Undergraduate ',' Provide ',' be responsible for ',' Business ',' be familiar with ',' analysis ',' first ',' Ability ',' Strategy ', ' In office ',' be familiar with ',' Development ',' project ',' company ',' demand ',' Support ',' Responsibilities ',' industry ',' problem ',' Research ',' Logic ',' have ',' build ',' can ', ' Decision making ',' complete ',' technology ',' monitor ',' Customer ',' be based on ',' Method ',' Design ',' understand ',' good ',' department ',' daily ',' adopt ',' The team ',' Internet ',' according to ' ,' establish ',' as well as ',' Have ',' Find out ',' application ',' Business unit ',' To develop ',' master ',' requirement ',' platform ',' Basics ',' above ',' Push ',' system ',' management ' ,' Stronger ',' Study ',' management ',' Qualifications ',' Suggest ',' major ',' to ground ',' assist ',' perform ',' value ',' programme ',' Put forward ',' solve ',' Fast ',' good ',' Participate in ', ' Direction ',' improvement ',' Building ',' assessment ',' Research and development ',' Information ',' extract ',' thorough ',' Commonly used ',' Include ',' Position ',' understand ',' user '))my_wordcloud = wc.generate(space)plt.imshow(wc, interpolation="bilinear")plt.axis("off")plt.figure()

Show :data web boss wordcloud second

You can see , Compared with before , It's more persuasive .

You can also count the word frequency further , as follows :

from jieba import analysekeywords = analyse.extract_tags(combined_job_desc, topK=300, withWeight=True, allowPOS=('n',))keywords

Output :

[(' data ', 0.3934027646776582), (' Business ', 0.35639458308689875), (' Position ', 0.19911889163164556), (' Position ', 0.19509550027518988), (' Ability ', 0.1561817429800633),... (' All ', 0.03118962121886076), (' Conditions ', 0.030984291967974684), (' Basics ', 0.030147029148860763), (' technology ', 0.0298699821428481), (' aspect ', 0.026981713165253163)]

Because many skills are expressed in English , Such as MySQL、Python etc. , Therefore, Chinese can be further removed , Re analysis . It is shown as follows :

import res = 'hi Novice oh'remove_chinese = re.compile(r'[\u4e00-\u9fa5]') #[\u4e00-\u9fa5] Is a regular expression that matches all Chinese remove_chinese.split(s)

Output :

['hi', '', 'oh']

You can see , The Chinese character in the string is removed . The details of job requirements are as follows :

all_english = ''.join(remove_chinese.split(combined_job_desc))all_english

Output :

'【】-【985/211】1.////;2.,;3.,sql、excel,,PythonR(),SQL(),Tableau(,)Excel();4.、、;5.,,6.4,,37.,20212021.:1.:;2.:,,,,nice :1、code(SQL/Shell/Python/R),、2、Excel/Python3、,.:,,,,4、、、、、,:1、,.2、,,.3、,.4、,,、.5、,,、、、,,.6、、、,. :,1、,;2、,,;3、、、,;4、,.5、; 【】\u20281、、、、2、,,\u2028【】1、;2、SQL,HIVESQLSPARKSQL,java,pythonscala;3、,,;4,、,;5、,6、spark,7、6.'

meanwhile , A lot of English is because of case and so on , In fact, it is the same meaning expressed , Such as SQL and sql, It means the same thing , It's just the case , You can combine Statistics :

combined_job_desc.count("SQL") + combined_job_desc.count("sql")

Output :

6

The analysis of word frequency is as follows :

keywords = jieba.analyse.extract_tags(all_english, topK=300, withWeight=True, allowPOS=())keywords

Output :

[('SQL', 1.5593175003782607), ('Excel', 1.0395450002521738), ('985', 0.5197725001260869), ('211', 0.5197725001260869), ('sql', 0.5197725001260869), ('excel', 0.5197725001260869), ('PythonR', 0.5197725001260869), ('Tableau', 0.5197725001260869), ('6.4', 0.5197725001260869), ('37', 0.5197725001260869), ('20212021', 0.5197725001260869), ('nice', 0.5197725001260869), ('code', 0.5197725001260869), ('Shell', 0.5197725001260869), ('Python', 0.5197725001260869), ('Python3', 0.5197725001260869), ('HIVESQLSPARKSQL', 0.5197725001260869), ('java', 0.5197725001260869), ('pythonscala', 0.5197725001260869), ('spark', 0.5197725001260869)]

here , The frequency of words varies greatly .

Then draw the words as follows :

eng_job_desc = jieba.cut(all_english,cut_all=False)en_space = " ".join(eng_job_desc)wc_eng = WordCloud(# width=1600,# height=800, background_color="white", # Set the background color max_words=300, # The maximum number of words ( The default is 200) colormap='viridis', # string or matplotlib colormap, default="viridis" random_state=10, # Set how many randomly generated states there are , That's how many color schemes there are # font_path='./fonts/cn/msyh.ttc')##commentsmy_wordcloud = wc_eng.generate(en_space)plt.imshow(wc_eng, interpolation="bilinear")plt.axis("off")plt.figure()

Show :data web boss wordcloud third

3、 ... and 、 King's glory list integration case

The page of the list of King's glory heroes is https://pvp.qq.com/web201605/herolist.shtml, It shows the basic information of the hero .

The front is to find useful information from a large amount of data in the web page , But there's a simpler way for some websites , Some websites provide data API, That is, through JSON Form provides data to the front end and then renders the display , obviously , Directly from JSON API It's easier and more efficient to get data from .

For example, the list page of King's glory heroes uses JSON data , as follows :data web king honor json

You can see , Its address is https://pvp.qq.com/web201605/js/herolist.json, Contains basic information about all heroes , You can download the JSON file , Then you can get information directly from the file , You don't need to parse it from the web page , And integrate this information with the information in the web page 、 Form better information , And the realization can query the related hero's information through the keyword .

1. obtain JSON data

First import the required library and get to JSON data , as follows :

import jsonimport requestsfrom bs4 import BeautifulSoup as bsrongyao_response = requests.request("GET", "https://pvp.qq.com/web201605/js/herolist.json")rongyao_response.text

Save it to a local file , as follows :

r = requests.get('https://pvp.qq.com/web201605/js/herolist.json', stream=True)with open("herolist.json", 'wb') as fd: for chunk in r.iter_content(chunk_size=128): fd.write(chunk)

Yes JSON Object operations can have json Library implementation . take JSON The object is converted to a dictionary as follows :

json_obj = """{ "zoo_animal": "Lion", "food": ["Meat", "Veggies", "Honey"], "fur": "Golden", "clothes": null, "diet": [{"zoo_animal": "Gazelle", "food":"grass", "fur": "Brown"}]}"""data = json.loads(json_obj)data

Output :

{'zoo_animal': 'Lion', 'food': ['Meat', 'Veggies', 'Honey'], 'fur': 'Golden', 'clothes': None, 'diet': [{'zoo_animal': 'Gazelle', 'food': 'grass', 'fur': 'Brown'}]}

You can also turn a dictionary into JSON object , as follows :

json.dumps(data)

Output :

'{"zoo_animal": "Lion", "food": ["Meat", "Veggies", "Honey"], "fur": "Golden", "clothes": null, "diet": [{"zoo_animal": "Gazelle", "food": "grass", "fur": "Brown"}]}'

It can also read JSON File into Dictionary , as follows :

hero_list = Nonewith open('herolist.json','rb') as json_data: hero_list = json.load(json_data) print(hero_list[:5])

Output :

[{'ename': 105, 'cname': ' Lian po ', 'title': ' Justice booms ', 'new_type': 0, 'hero_type': 3, 'skin_name': ' Justice booms | Hell rock soul '}, {'ename': 106, 'cname': ' Little Joe ', 'title': ' The breeze of love ', 'new_type': 0, 'hero_type': 2, 'skin_name': ' The breeze of love | The night before all saints | Swan dream | Pure white flowers marry | Colorful unicorns '}, {'ename': 107, 'cname': ' zhaoyun ', 'title': ' The sky is full of dragons ', 'new_type': 0, 'hero_type': 1, 'hero_type2': 4, 'skin_name': ' The sky is full of dragons | endure ● Burning shadow | The future era | The Royal admiral | Hip hop King | Deacon white | The heart of the engine '}, {'ename': 108, 'cname': ' mozi ', 'title': ' Peace watch ', 'new_type': 0, 'hero_type': 2, 'hero_type2': 1, 'skin_name': ' Peace watch | Metal Storm | Eragon | Attack Mozi '}, {'ename': 109, 'cname': ' Daji ', 'title': ' Fox of charm ', 'pay_type': 11, 'new_type': 0, 'hero_type': 2, 'skin_name': ' Fox of enchantment | Maid coffee | Glamour Vegas | Alice in Wonderland | Girl Ali | Warm Samba '}]

Before printing 5 File information .

Get the type of each hero , as follows :

hero_type = [" All "," warrior "," Master "," tanks "," The assassin "," striker "," auxiliary "]for hero in hero_list: combine_type = [] if "hero_type" in hero: combine_type.append(hero_type[hero["hero_type"]]) if "new_type" in hero: combine_type.append(hero_type[hero["new_type"]]) if "hero_type2" in hero: combine_type.append(hero_type[hero["hero_type2"]]) print(hero["cname"] +" "+('|').join(combine_type))

Output :

 Lian po tanks | All Little Joe Master | All Zhao Yun warrior | All | The assassin Mozi Master | All | Soldier Daji Master | All ... Mengyu striker | All mirrors The assassin | All Mengtian warrior | All agodo tanks | All Charlotte warrior | warrior 

2. Get web hero information

At this point, you can get https://pvp.qq.com/web201605/herolist.shtml Information in , Including links to pictures, etc . Try the following :

html_hero_response = requests.request("GET", "https://pvp.qq.com/web201605/herolist.shtml")html_hero_response.content.decode('gbk')

As you can see from the output , The list of heroes in the output is not complete , It's not consistent with the actual reality of the web page , This may be because part of the information is through JavaScript Rendering to the web page , There is no in the web source code , Therefore, it was not requested . You can use selenium Library to simulate browser access , Operate the browser like a human , And get the complete list of Heroes . It needs to be installed before use selenium library , Directly through conda install -c conda-forge selenium Command to install ; You also need to download the driver ,Chrome and FIrefox It can be driven , With Chrome For example , You need to download Chrome Browser version , The way is as follows :data web hero Google version

After getting the version , Until then http://chromedriver.storage.googleapis.com/index.html Choose between and Chrome Similar driver versions , Such as 83.0.4103.14, Click and select under the current version chromedriver_win32.zip download , Download and unzip to get chromedriver.exe file , Move it to Anaconda Install under directory Scripts Under the table of contents , Such as E:\Anaconda3\Scripts, If not for use Anaconda, It's ordinary Python Environmental Science , Then move to Python Install under directory Scripts Under the table of contents , Such as E:\Python\Python38-32\Scripts Under the table of contents , You can use selenium A simulated visit to .

Due to the slow download from the official website , So I've already Chrome83.0.4103.14 The driver of the version has been downloaded , You can click add directly QQ Group Python Geek tribe 963624318 In the group folder Python Related installation package You can download it in , If you need other versions, you can also ask the group leader .

The simulated access is as follows :

from selenium import webdriverbrowser = webdriver.Chrome()browser.get("https://pvp.qq.com/web201605/herolist.shtml")html = browser.page_sourcebrowser.quit()

perform , as follows :data web king honor selenium simulation

You can see , There is one Chrome The browser pops up and visits the website , Automatically close after getting the information .

Now use BeautifulSoup To analyze , Get a list of Heroes :

hero_soup = bs(html,'lxml')hero_html_list=hero_soup.find("ul",class_="herolist")all_hero_list =hero_html_list.find_all("li")print(all_hero_list[0].text)print("https://"+all_hero_list[0].img["src"].strip("/"))

Output :

 Charlotte https://game.gtimg.cn/images/yxzj/img201606/heroimg/536/536.jpg

obviously , Got the basic information .

Further integration , Get a list of all hero names and image links :

gen_heros=[[info.text, "https://"+info.img["src"].strip("/")] for info in all_hero_list]gen_heros

Output L:

[[' Charlotte ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/536/536.jpg'], [' Agudor ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/533/533.jpg'], [' Meng tien ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/527/527.jpg'], [' mirror ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/531/531.jpg'], [' Mengyu ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/524/524.jpg'],... [' Daji ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/109/109.jpg'], [' mozi ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/108/108.jpg'], [' zhaoyun ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/107/107.jpg'], [' Little Joe ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/106/106.jpg'], [' Lian po ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/105/105.jpg']]

3. Data integration

Now we need to hero_list and gen_heros The data in the two lists are integrated , And the realization searches according to the keyword .

First of all, give hero_list The hero in defines the type of hero and according to the name of the hero in hero_list Search for the existence of the function , as follows :

def build_hero_type(hero): combine_type = [] if "hero_type" in hero: combine_type.append(hero_type[hero["hero_type"]]) if "new_type" in hero: combine_type.append(hero_type[hero["new_type"]]) if "hero_type2" in hero: combine_type.append(hero_type[hero["hero_type2"]]) return(('|').join(combine_type))def search_for_hero_info(name=None): for hero in hero_list: if "cname" in hero: if hero["cname"] == name: return hero return None

The simple use of these two functions is as follows :

su_lie=search_for_hero_info(" Su lie ")print(su_lie)hero_detail = search_for_hero_info(gen_heros[0][0])print(hero_detail)hero_detail["skin_name"].strip("&#10;'")build_hero_type(hero_detail)

Output is as follows :

{'ename': 194, 'cname': ' Su lie ', 'title': ' Unyielding iron wall ', 'pay_type': 10, 'new_type': 0, 'hero_type': 3, 'hero_type2': 1, 'skin_name': ' Unyielding iron wall | Love and peace | The power of toughness | Xuanwuzhi '}{'ename': 536, 'cname': ' Charlotte ', 'title': ' Rose swordsman ', 'new_type': 1, 'hero_type': 1, 'skin_name': ' Rose swordsman '}' warrior | warrior '

Now the function of merging two lists :

def merge_hero_info(hero_html, hero_json): all_heros = [] for hero in hero_html: hero_detail = search_for_hero_info(hero[0]) all_heros.append([hero[0],build_hero_type(hero_detail),hero_detail.get("skin_name",'').strip("&#10;'"),hero[1]]) return all_heros

Use this function to merge the two lists as follows :

combined_heros=[]combined_heros = merge_hero_info(gen_heros, hero_list)combined_heros[:5]

Output :

[[' Charlotte ', ' warrior | warrior ', ' Rose swordsman ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/536/536.jpg'], [' Agudor ', ' tanks | All ', ' The son of the forest ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/533/533.jpg'], [' Meng tien ', ' warrior | All ', ' Order rules | Order dragon Hunter ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/527/527.jpg'], [' mirror ', ' The assassin | All ', ' The blade of the broken mirror | Ice blade Wonderland ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/531/531.jpg'], [' Mengyu ', ' striker | All ', ' Hot gun boy | Guixu dream performance ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/524/524.jpg']]

4. Index

Now let's go further and build an index for quick lookup , Realize according to the name of the hero 、 Hero type 、 Hero skin . To implement the input of keyword For Su lie , You can return the following results :

[' Su lie ', [[' Su lie ', ' tanks | warrior | warrior ', ' Unyielding iron wall | Love and peace ', 'http://game.gtimg.cn/images/yxzj/img201606/heroimg/194/194.jpg']]], [' tanks ', [[' Su lie ', ' tanks | warrior | warrior ', ' Unyielding iron wall | Love and peace ', 'http://game.gtimg.cn/images/yxzj/img201606/heroimg/194/194.jpg'], [' armoured ', ' warrior | All | tanks ', ' Break the edge | Lord of dragon Kingdom ', 'http://game.gtimg.cn/images/yxzj/img201606/heroimg/193/193.jpg'] ] ] ]]

First, generate keyword list according to hero information :

# According to the hero message , Generate keyword A list of def get_keywords_array(hero): keywords =[] if hero[0]: keywords.append(hero[0]) if hero[1]: keywords += hero[1].split('|') if hero[2]: keywords +=hero[2].split('|') return keywordsget_keywords_array(combined_heros[12])

Output :

[' Pig eight quit ', ' tanks ', ' All ', ' Worry free warrior ', ' May you always get more than you wish for ']

Then the function of adding index and creating search list is implemented :

# Add index to search data list def add_to_index(index, keyword, info): for entry in index: if entry[0] == keyword: entry[1].append(info) return #not find index.append([keyword,[info]])# Create a list of search data def build_up_index(index_array): for hero_info in combined_heros: keywords = get_keywords_array(hero_info) for key in keywords: add_to_index(index_array,key,hero_info) 

Finally, the function of searching information according to keywords is realized :

# Search the list by keywords def lookup(index,keyword): for entry in index: if entry[0] == keyword: return entry[1] #not find return entry[0]

The retrieval test is as follows :

search_index=[]build_up_index(search_index)lookup(search_index," The assassin ")

Output :

[[' mirror ', ' The assassin | All ', ' The blade of the broken mirror | Ice blade Wonderland ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/531/531.jpg'], [' d ', ' warrior | All | The assassin ', '', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/518/518.jpg'], [' The king of clouds ', ' The assassin | All | warrior ', ' The eye of Horus ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/506/506.jpg'], [' Waner shangguan ', ' Master | All | The assassin ', ' Jinghong's pen | Xiuzhu Mohist ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/513/513.jpg'], [' Sima yi ', ' The assassin | All | Master ', ' The dying heart | Master of nightmare language ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/137/137.jpg'], ... [' Han xin ', ' The assassin | All ', ' a state scholar of no equal | Street Fighter | The Vatican envoy | White dragon chant | The shadow of dreams ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/150/150.jpg'], [' The sable cicada ', ' Master | All | The assassin ', ' The best dancer | Exotic dancer | Christmas Love Song | The sound of dreams | Midsummer night dream ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/141/141.jpg'], [' Li Bai ', ' The assassin | All ', ' Green lotus Sword Fairy | Van Helsing | The fox of the Millennium | The Phoenix courted her | Sharpness ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/131/131.jpg'], [' Ako ', ' The assassin | All ', ' The blade of faith | Love care | Night owl | Fatal glamour | Rhythm heat wave ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/116/116.jpg'], [' zhaoyun ', ' warrior | All | The assassin ', ' The sky is full of dragons | endure ● Burning shadow | The future era | The Royal admiral | Hip hop King | Deacon white | The heart of the engine ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/107/107.jpg']]

You can see , In the name of Heroes 、 Both hero type and skin assassin data are retrieved .

At this point, check the data structure after the index is established :

display(len(search_index),search_index[4])

Output :

446[' tanks ', [[' Agudor ', ' tanks | All ', ' The son of the forest ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/533/533.jpg'], [' Pig eight quit ', ' tanks | All ', ' Worry free warrior | May you always get more than you wish for ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/511/511.jpg'], [' The goddess of the moon ', ' Master | All | tanks ', ' Princess hanyue | Reflection of dew flower ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/515/515.jpg'], [' Sun ce ', ' tanks | All | warrior ', ' The sea of light | The journey of the sea | Dog and cat Diaries ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/510/510.jpg'], [' Dream, ', ' tanks | All ', ' Dream spirit | A dream come true | Fat Da Rong Rong ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/198/198.jpg'], ... [' White ', ' tanks | All ', ' last generation | White death | Ferocious | Prince of starry night ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/120/120.jpg'], [' Zhong Wu Yan ', ' warrior | All | tanks ', ' The hammer of barbarism | Biochemical alert | The hammer of the king | Beach beauty ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/117/117.jpg'], [' Liu chan ', ' auxiliary | All | tanks ', ' Riot mechanism | Ying meow looks out on the wild | Gentleman bear meow | The talented goalkeeper ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/114/114.jpg'], [' ZhuangZhou ', ' auxiliary | All | tanks ', ' Happy dream | The dream of carp | Mirage King | Cloud dream builder ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/113/113.jpg'], [' Lian po ', ' tanks | All ', ' Justice booms | Hell rock soul ', 'https://game.gtimg.cn/images/yxzj/img201606/heroimg/105/105.jpg']]]

You can see , All keywords are indexed , So the length is much more than the number of Heroes .

summary

Reptiles are Python One of the most widely used , You can quickly get a lot of data from the web page .Python It provides us with a lot of access to network data 、 A library for extracting and processing network data , Such as requests、selenium、BeautifulSoup、re、jieba、wordcloud etc. , Reasonable and flexible use of these tools can make efficient crawler development .

this paper First text From the blog column Data analysis , Forwarded by me to https://www.helloworld.net/p/8kovtpmS6wF8X, Other platforms are infringing , Clickable https://blog.csdn.net/CUFEECR/article/details/108907733 Look at the original , You can also click https://blog.csdn.net/CUFEECR Browse more quality original content .

 Preview
版权声明
本文为[CuterCorley]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/04/20210405220533253V.html

  1. 利用Python爬虫获取招聘网站职位信息
  2. Using Python crawler to obtain job information of recruitment website
  3. Several highly rated Python libraries arrow, jsonpath, psutil and tenacity are recommended
  4. Python装饰器
  5. Python实现LDAP认证
  6. Python decorator
  7. Implementing LDAP authentication with Python
  8. Vscode configures Python development environment!
  9. In Python, how dare you say you can't log module? ️
  10. 我收藏的有关Python的电子书和资料
  11. python 中 lambda的一些tips
  12. python中字典的一些tips
  13. python 用生成器生成斐波那契数列
  14. python脚本转pyc踩了个坑。。。
  15. My collection of e-books and materials about Python
  16. Some tips of lambda in Python
  17. Some tips of dictionary in Python
  18. Using Python generator to generate Fibonacci sequence
  19. The conversion of Python script to PyC stepped on a pit...
  20. Python游戏开发,pygame模块,Python实现扫雷小游戏
  21. Python game development, pyGame module, python implementation of minesweeping games
  22. Python实用工具,email模块,Python实现邮件远程控制自己电脑
  23. Python utility, email module, python realizes mail remote control of its own computer
  24. 毫无头绪的自学Python,你可能连门槛都摸不到!【最佳学习路线】
  25. Python读取二进制文件代码方法解析
  26. Python字典的实现原理
  27. Without a clue, you may not even touch the threshold【 Best learning route]
  28. Parsing method of Python reading binary file code
  29. Implementation principle of Python dictionary
  30. You must know the function of pandas to parse JSON data - JSON_ normalize()
  31. Python实用案例,私人定制,Python自动化生成爱豆专属2021日历
  32. Python practical case, private customization, python automatic generation of Adu exclusive 2021 calendar
  33. 《Python实例》震惊了,用Python这么简单实现了聊天系统的脏话,广告检测
  34. "Python instance" was shocked and realized the dirty words and advertisement detection of the chat system in Python
  35. Convolutional neural network processing sequence for Python deep learning
  36. Python data structure and algorithm (1) -- enum type enum
  37. 超全大厂算法岗百问百答(推荐系统/机器学习/深度学习/C++/Spark/python)
  38. 【Python进阶】你真的明白NumPy中的ndarray吗?
  39. All questions and answers for algorithm posts of super large factories (recommended system / machine learning / deep learning / C + + / spark / Python)
  40. [advanced Python] do you really understand ndarray in numpy?
  41. 【Python进阶】Python进阶专栏栏主自述:不忘初心,砥砺前行
  42. [advanced Python] Python advanced column main readme: never forget the original intention and forge ahead
  43. python垃圾回收和缓存管理
  44. java调用Python程序
  45. java调用Python程序
  46. Python常用函数有哪些?Python基础入门课程
  47. Python garbage collection and cache management
  48. Java calling Python program
  49. Java calling Python program
  50. What functions are commonly used in Python? Introduction to Python Basics
  51. Python basic knowledge
  52. Anaconda5.2 安装 Python 库(MySQLdb)的方法
  53. Python实现对脑电数据情绪分析
  54. Anaconda 5.2 method of installing Python Library (mysqldb)
  55. Python implements emotion analysis of EEG data
  56. Master some advanced usage of Python in 30 seconds, which makes others envy it
  57. python爬取百度图片并对图片做一系列处理
  58. Python crawls Baidu pictures and does a series of processing on them
  59. python链接mysql数据库
  60. Python link MySQL database