Python crawler: setting cookie to solve website interception and crawling ant short rent

Love to learn 2021-02-21 11:22:15
python crawler setting cookie solve


Preface

The text and pictures of the article come from the Internet , Just for learning 、 Communication use , Not for any commercial purpose , The copyright belongs to the original author , If you have any questions, please contact us in time for handling .

author :Eastmount

We're writing Python Reptilian time , Sometimes we will encounter anti climbing means such as refusing to visit the website , For example, we want to crawl the short rent data of ants , It will prompt “ Current visit suspected hacker attack , Has been set to block by webmaster ” Tips , As shown in the figure below . At this point we need to use the settings Cookie To climb , Let's introduce it in detail . Thank you very much for the ideas provided by my student Cheng Feng , The back wave pushes the front wave !

One . Website analysis and crawler interception

When we open the ant short rent search in Guiyang City , The feedback is shown in the figure below . file We can see that the short-term rental information presents a certain regular distribution , As shown in the figure below , That's what we're crawling for .

Review elements through the browser , We can see that every rental information we need to crawl is located in <dd></dd> Under the node .

In positioning the name of the house , As shown in the figure below , be located <div class="room-detail clearfloat"></div> Under the node .

Now let's write a simple BeautifulSoup To climb .

<section id="nice" data-tool="mdnice Editor " data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice Editor " style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;"><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</span><br>import&nbsp;urllib<br>import&nbsp;re<br>from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br>import&nbsp;codecs<br>&nbsp;<br>url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/?map=no'</span><br>response=urllib.urlopen(url)<br>contents&nbsp;=&nbsp;response.read()<br>soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br><span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;soup.title<br><span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;soup<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># The name of the short rental house </span><br><span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'dd'</span>):<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;name&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"room-detail&nbsp;clearfloat"</span>}):<br>&nbsp;fname&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ The name of the short rental house ]'</span>,&nbsp;fname.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br></code></pre> </section> But it's a pity. , Wrong report , It shows that ant financial service's preventive measures are in place .

Two . Set up Cookie Of BeautifulSoup Reptiles

The code to add a header is as follows , Here's the code and the results , I'll teach you how to get Cookie.

<section id="nice" data-tool="mdnice Editor " data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice Editor " style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;"><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</span><br>import&nbsp;urllib2<br>import&nbsp;re<br>from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br>&nbsp;<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Crawler functions </span><br>def&nbsp;gydzf(url):<br>&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;AppleWebKit/537.36&nbsp;(KHTML,&nbsp;like&nbsp;Gecko)&nbsp;Chrome/51.0.2704.103&nbsp;Safari/537.36"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'dd'</span>):<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># The name of the short rental house </span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;name&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"room-detail&nbsp;clearfloat"</span>}):<br>&nbsp;fname&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ The name of the short rental house ]'</span>,&nbsp;fname.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Short rental prices </span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;price&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"moy-b"</span>}):<br>&nbsp;string&nbsp;=&nbsp;price.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;fprice&nbsp;=&nbsp;re.sub(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"[¥]+"</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">""</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),string)<br>&nbsp;fprice&nbsp;=&nbsp;fprice[0:5]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ Short rental prices ]'</span>,&nbsp;fprice.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Rating and number of comments </span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;score&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>):<br>&nbsp;fscore&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ Short rental rating / Comment on / The number of residents ]'</span>,&nbsp;fscore.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Web link url</span><br>&nbsp;url_dzf&nbsp;=&nbsp;tag.find(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"target"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"_blank"</span>})<br>&nbsp;urls&nbsp;=&nbsp;url_dzf.attrs[<span class="hljs-string" style="color: #98c379; line-height: 26px;">'href'</span>]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ Web link ]'</span>,&nbsp;urls.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;urlss&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com'</span>&nbsp;+&nbsp;urls&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span><br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;urlss<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># The main function </span><br><span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>&nbsp;__name__&nbsp;==&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'__main__'</span>:<br>&nbsp;i&nbsp;=&nbsp;1<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">while</span>&nbsp;i&lt;10:<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">' Page number '</span>,&nbsp;i<br>&nbsp;url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/'</span>&nbsp;+&nbsp;str(i)&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'/?map=no'</span><br>&nbsp;gydzf(url)<br>&nbsp;i&nbsp;=&nbsp;i+1<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">else</span>:<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">" end "</span><br></code></pre> </section> The output result is shown in the figure below : <section id="nice" data-tool="mdnice Editor " data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice Editor " style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;"> Page number &nbsp;1<br>[ The name of the short rental house ]&nbsp; Tang Dongyuan Fortune Plaza -- City simple duplex B & B <br>[ Short rental prices ]&nbsp;298<br>[ Short rental rating / Comment on / The number of residents ]&nbsp;5.0 branch ·5 comments · Second house · To live in 3 people <br>[ Web link ]&nbsp;/room/851634765<br>http://www.mayi.com/room/851634765<br>[ The name of the short rental house ]&nbsp; Tang Dongyuan Fortune Plaza -- Fresh lemon duplex B & B <br>[ Short rental prices ]&nbsp;568<br>[ Short rental rating / Comment on / The number of residents ]&nbsp;2 comments · San Ju · To live in 6 people <br>[ Web link ]&nbsp;/room/851634467<br>http://www.mayi.com/room/851634467<br>&nbsp;<br>...<br>&nbsp;<br> Page number &nbsp;9<br>[ The name of the short rental house ]&nbsp;【 Next to the park of north high speed railway station 】 American style + Super large, comfortable and comfortable <br>[ Short rental prices ]&nbsp;366<br>[ Short rental rating / Comment on / The number of residents ]&nbsp;3 comments · Second house · To live in 5 people <br>[ Web link ]&nbsp;/room/851018852<br>http://www.mayi.com/room/851018852<br>[ The name of the short rental house ]&nbsp; Dayingpo ( Near the international shopping center of CUHK ) Nordic small fresh three rooms <br>[ Short rental prices ]&nbsp;298<br>[ Short rental rating / Comment on / The number of residents ]&nbsp; San Ju · To live in 6 people <br>[ Web link ]&nbsp;/room/851647045<br>http://www.mayi.com/room/851647045<br></code></pre> </section>

Next we want to get the details file Here the author mainly provides analysis Cookie Methods , Use the browser to open the web page , Right click “ Check ”, And then refresh the page . stay “NetWork” Find the page in and click , It's popping up Headers I'm hiding this information from you . file The two most common parameters are Cookie and User-Agent, As shown in the figure below : file And then in Python Set these parameters in the code , Call again Urllib2.Request() Just submit the request , The core code is as follows :

<section id="nice" data-tool="mdnice Editor " data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice Editor " style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;">&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;...&nbsp;Chrome/61.0.3163.100&nbsp;Safari/537.36"</span><br>&nbsp;cookie=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"mediav=%7B%22eid%22%3A%22387123...b3574ef2-21b9-11e8-b39c-1bc4029c43b8"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Cookie"</span>:cookie}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag1&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"main"</span>}):<br></code></pre> </section> Be careful , Every hour Cookie It will be updated once , We need to modify it manually Cookie value , That's the code above cookie Variables and user_agent Variable . The full code is shown below : <section id="nice" data-tool="mdnice Editor " data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice Editor " style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;">import&nbsp;urllib2<br>import&nbsp;re<br>from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br>import&nbsp;codecs<br>import&nbsp;csv<br>&nbsp;<br>&nbsp;<br>c&nbsp;=&nbsp;open(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"ycf.csv"</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"wb"</span>)&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#write&nbsp; Write </span><br>c.write(codecs.BOM_UTF8)<br>writer&nbsp;=&nbsp;csv.writer(c)<br>writer.writerow([<span class="hljs-string" style="color: #98c379; line-height: 26px;">" The name of the short rental house "</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">" Address "</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">" Price "</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">" score "</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">" The number of people who can live "</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">" Per capita price "</span>])<br>&nbsp;<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Crawling for details </span><br>def&nbsp;getInfo(url,fname,fprice,fscore,users):<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Use browser developer mode to view the user_agent And cookie Set the access header (headers) Avoid anti crawlers , And run every other period of time according to cookie Change... In the code cookie</span><br>&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;AppleWebKit/537.36&nbsp;(KHTML,&nbsp;like&nbsp;Gecko)&nbsp;Chrome/61.0.3163.100&nbsp;Safari/537.36"</span><br>&nbsp;cookie=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"mediav=%7B%22eid%22%3A%22387123%22eb7;&nbsp;mayi_uuid=1582009990674274976491;&nbsp;sid=42200298656434922.85.130.130"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Cookie"</span>:cookie}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Short rental address </span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag1&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"main"</span>}):<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">' Short rental address :'</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag2&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag1.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"desWord"</span>}):<br>&nbsp;address&nbsp;=&nbsp;tag2.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;address<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># The number of people who can live </span><br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">' The number of people who can live :'</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag4&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag1.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"w258"</span>}):<br>&nbsp;yy&nbsp;=&nbsp;tag4.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'span'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;yy<br>&nbsp;fname&nbsp;=&nbsp;fname.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;address&nbsp;=&nbsp;address.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;fprice&nbsp;=&nbsp;fprice.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;fscore&nbsp;=&nbsp;fscore.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;fpeople&nbsp;=&nbsp;yy[2:3].encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;ones&nbsp;=&nbsp;int(<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">float</span>(fprice))/int(<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">float</span>(fpeople))<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Store locally </span><br>&nbsp;writer.writerow([fname,address,fprice,fscore,fpeople,ones])<br>&nbsp;<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Crawler functions </span><br>def&nbsp;gydzf(url):<br>&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;AppleWebKit/537.36&nbsp;(KHTML,&nbsp;like&nbsp;Gecko)&nbsp;Chrome/51.0.2704.103&nbsp;Safari/537.36"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'dd'</span>):<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># The name of the short rental house </span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;name&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"room-detail&nbsp;clearfloat"</span>}):<br>&nbsp;fname&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ The name of the short rental house ]'</span>,&nbsp;fname.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Short rental prices </span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;price&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"moy-b"</span>}):<br>&nbsp;string&nbsp;=&nbsp;price.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;fprice&nbsp;=&nbsp;re.sub(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"[¥]+"</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">""</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),string)<br>&nbsp;fprice&nbsp;=&nbsp;fprice[0:5]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ Short rental prices ]'</span>,&nbsp;fprice.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Rating and number of comments </span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;score&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>):<br>&nbsp;fscore&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ Short rental rating / Comment on / The number of residents ]'</span>,&nbsp;fscore.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Web link url</span><br>&nbsp;url_dzf&nbsp;=&nbsp;tag.find(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"target"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"_blank"</span>})<br>&nbsp;urls&nbsp;=&nbsp;url_dzf.attrs[<span class="hljs-string" style="color: #98c379; line-height: 26px;">'href'</span>]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[ Web link ]'</span>,&nbsp;urls.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;urlss&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com'</span>&nbsp;+&nbsp;urls&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span><br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;urlss<br>&nbsp;getInfo(urlss,fname,fprice,fscore,user_agent)<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># The main function </span><br><span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>&nbsp;__name__&nbsp;==&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'__main__'</span>:<br>&nbsp;i&nbsp;=&nbsp;0<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">while</span>&nbsp;i&lt;33:<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">' Page number '</span>,&nbsp;(i+1)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>(i==0):<br>&nbsp;url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/?map=no'</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>(i&gt;0):<br>&nbsp;num&nbsp;=&nbsp;i+2&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;"># Except the first page is empty , From the second page, press 2 The order is increasing </span><br>&nbsp;url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/'</span>&nbsp;+&nbsp;str(num)&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'/?map=no'</span><br>&nbsp;gydzf(url)<br>&nbsp;i=i+1<br>&nbsp;<br>c.close()<br></code></pre> </section> The output is as follows , Store local CSV file : ![file](https://oscimg.oschina.net/oscnet/up-a99c96e8e914edcc1997d6ee51516f0fb84.png) meanwhile , You can try Selenium Crawling ants short rent , It should also be possible . Finally, I hope this article can help you , If there are shortcomings , Please forgive ~

* Statement : This article is organized on the Internet , The copyright belongs to the original author , If the source information is wrong or infringes the rights and interests , Please contact us for deletion or authorization .

[ Take it with you !Python 3.9 Official Chinese documents , Time limited collection !] (http://dwz.date/dE6v)

[ Time limit ! Quick collar !14 Zhang HD Python Quick reference table , It is necessary to improve efficiency !] (http://dwz.date/dE6w)

[GitHub Star sign 3W+,80 individual Python Case study , Take you easy to play Python Study !] (http://dwz.date/dE64)

版权声明
本文为[Love to learn]所创,转载请带上原文链接,感谢
https://pythonmana.com/2021/02/20210221112115776l.html

  1. Python Tkinter inserts all the pictures in a directory into the docx file
  2. 解决忽略VScode中Python插件pylint报错的问题
  3. To solve the problem of ignoring the error of Python plug-in in vscode
  4. python 毫秒级时间,时间戳转换
  5. Python millisecond time, timestamp conversion
  6. python try except 出现异常时,except 中如何返回异常的信息字符串
  7. When an exception occurs in Python try except, how to return the exception information string in except
  8. 手机最强Python编程神器,在手机上运行Python
  9. The strongest Python Programming artifact on mobile phones, running Python on mobile phones
  10. 2021年Python程序员薪资待遇如何?
  11. 「python安装」Windows上安装和创建python开发环境
  12. What is the salary of Python programmers in 2021?
  13. "Python installation" to install and create a python development environment on Windows
  14. python解决组合问题
  15. Python to solve the problem of composition
  16. Python中的Lasso回归之最小角算法LARS
  17. Lars, the least angle algorithm of lasso regression in Python
  18. 利用python提取网站曲线图数据
  19. Using Python to extract website graph data
  20. Python3中urllib详细使用方法(header,代理,超时,认证,异常处理)
  21. Detailed usage of urllib in Python 3 (header, proxy, timeout, authentication, exception handling)
  22. python 第三方库paramiko
  23. python 第三方库paramiko
  24. Python third party library paramiko
  25. Python third party library paramiko
  26. 卸载 PyCharm!这才是 Python 小白的最理想的 IDE
  27. 卸载 PyCharm!这才是 Python 小白的最理想的 IDE
  28. Uninstall pycharm! This is the ideal IDE for Python Xiaobai
  29. django学习-27.admin管理后台里:对列表展示页面的数据展示进行相关优化
  30. Uninstall pycharm! This is the ideal IDE for Python Xiaobai
  31. Django learning - 27. Admin management background: optimize the data display of the list display page
  32. python day2
  33. python day2
  34. Python 内存泄漏问题排查
  35. Troubleshooting of Python memory leak
  36. Python 与 excel的简单应用
  37. Simple application of Python and excel
  38. Python 与 excel的简单应用
  39. Simple application of Python and excel
  40. 2.7万 Star!最全面的 Python 设计模式集合
  41. 27000 stars! The most comprehensive collection of Python design patterns
  42. python day3
  43. python day3
  44. Commonly used data operation functions of Python
  45. (数据科学学习手札108)Python+Dash快速web应用开发——静态部件篇(上)
  46. (learning notes of data science 108) Python + dash rapid web application development -- static components (I)
  47. (数据科学学习手札108)Python+Dash快速web应用开发——静态部件篇(上)
  48. (learning notes of data science 108) Python + dash rapid web application development -- static components (I)
  49. [Python] Matplotlib 图表的绘制和美化技巧
  50. Drawing and beautifying skills of [Python] Matplotlib chart
  51. [Python] Matplotlib 图表的绘制和美化技巧
  52. Drawing and beautifying skills of [Python] Matplotlib chart
  53. Virtual environment of Python project
  54. 翻译:《实用的Python编程》02_01_Datatypes
  55. Translation: practical Python Programming 02_ 01_ Datatypes
  56. 翻译:《实用的Python编程》02_01_Datatypes
  57. 翻译:《实用的Python编程》02_01_Datatypes
  58. Translation: practical Python Programming 02_ 01_ Datatypes
  59. Translation: practical Python Programming 02_ 01_ Datatypes
  60. Python 3 入门,看这篇就够了