Python爬虫:设置Cookie解决网站拦截并爬取蚂蚁短租

爱学习的豆包 2021-02-21 11:21:54
Python selenium BeautifulSoup github dwz


前言

文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。

作者:Eastmount

我们在编写Python爬虫时,有时会遇到网站拒绝访问等反爬手段,比如这么我们想爬取蚂蚁短租数据,它则会提示“当前访问疑似黑客攻击,已被网站管理员设置为拦截”提示,如下图所示。此时我们需要采用设置Cookie来进行爬取,下面我们进行详细介绍。非常感谢我的学生承峰提供的思想,后浪推前浪啊!

一. 网站分析与爬虫拦截

当我们打开蚂蚁短租搜索贵阳市,反馈如下图所示结果。 file 我们可以看到短租房信息呈现一定规律分布,如下图所示,这也是我们要爬取的信息。

通过浏览器审查元素,我们可以看到需要爬取每条租房信息都位于<dd></dd>节点下。

在定位房屋名称,如下图所示,位于<div class="room-detail clearfloat"></div>节点下。

接下来我们写个简单的BeautifulSoup进行爬取。

<section id="nice" data-tool="mdnice编辑器" data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice编辑器" style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;"><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</span><br>import&nbsp;urllib<br>import&nbsp;re<br>from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br>import&nbsp;codecs<br>&nbsp;<br>url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/?map=no'</span><br>response=urllib.urlopen(url)<br>contents&nbsp;=&nbsp;response.read()<br>soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br><span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;soup.title<br><span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;soup<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#短租房名称</span><br><span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'dd'</span>):<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;name&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"room-detail&nbsp;clearfloat"</span>}):<br>&nbsp;fname&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[短租房名称]'</span>,&nbsp;fname.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br></code></pre> </section> 但很遗憾,报错了,说明蚂蚁金服防范措施还是挺到位的。

二. 设置Cookie的BeautifulSoup爬虫

添加消息头的代码如下所示,这里先给出代码和结果,再教大家如何获取Cookie。

<section id="nice" data-tool="mdnice编辑器" data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice编辑器" style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;"><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#&nbsp;-*-&nbsp;coding:&nbsp;utf-8&nbsp;-*-</span><br>import&nbsp;urllib2<br>import&nbsp;re<br>from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br>&nbsp;<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#爬虫函数</span><br>def&nbsp;gydzf(url):<br>&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;AppleWebKit/537.36&nbsp;(KHTML,&nbsp;like&nbsp;Gecko)&nbsp;Chrome/51.0.2704.103&nbsp;Safari/537.36"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'dd'</span>):<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#短租房名称</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;name&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"room-detail&nbsp;clearfloat"</span>}):<br>&nbsp;fname&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[短租房名称]'</span>,&nbsp;fname.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#短租房价格</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;price&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"moy-b"</span>}):<br>&nbsp;string&nbsp;=&nbsp;price.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;fprice&nbsp;=&nbsp;re.sub(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"[¥]+"</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">""</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),string)<br>&nbsp;fprice&nbsp;=&nbsp;fprice[0:5]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[短租房价格]'</span>,&nbsp;fprice.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#评分及评论人数</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;score&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>):<br>&nbsp;fscore&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[短租房评分/评论/居住人数]'</span>,&nbsp;fscore.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#网页链接url</span><br>&nbsp;url_dzf&nbsp;=&nbsp;tag.find(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"target"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"_blank"</span>})<br>&nbsp;urls&nbsp;=&nbsp;url_dzf.attrs[<span class="hljs-string" style="color: #98c379; line-height: 26px;">'href'</span>]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[网页链接]'</span>,&nbsp;urls.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;urlss&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com'</span>&nbsp;+&nbsp;urls&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span><br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;urlss<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#主函数</span><br><span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>&nbsp;__name__&nbsp;==&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'__main__'</span>:<br>&nbsp;i&nbsp;=&nbsp;1<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">while</span>&nbsp;i&lt;10:<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'页码'</span>,&nbsp;i<br>&nbsp;url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/'</span>&nbsp;+&nbsp;str(i)&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'/?map=no'</span><br>&nbsp;gydzf(url)<br>&nbsp;i&nbsp;=&nbsp;i+1<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">else</span>:<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">"结束"</span><br></code></pre> </section> 输出结果如下图所示: <section id="nice" data-tool="mdnice编辑器" data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice编辑器" style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;">页码&nbsp;1<br>[短租房名称]&nbsp;大唐东原财富广场--城市简约复式民宿<br>[短租房价格]&nbsp;298<br>[短租房评分/评论/居住人数]&nbsp;5.0分·5条评论·二居·可住3人<br>[网页链接]&nbsp;/room/851634765<br>http://www.mayi.com/room/851634765<br>[短租房名称]&nbsp;大唐东原财富广场--清新柠檬复式民宿<br>[短租房价格]&nbsp;568<br>[短租房评分/评论/居住人数]&nbsp;2条评论·三居·可住6人<br>[网页链接]&nbsp;/room/851634467<br>http://www.mayi.com/room/851634467<br>&nbsp;<br>...<br>&nbsp;<br>页码&nbsp;9<br>[短租房名称]&nbsp;【高铁北站公园旁】美式风情+超大舒适安逸<br>[短租房价格]&nbsp;366<br>[短租房评分/评论/居住人数]&nbsp;3条评论·二居·可住5人<br>[网页链接]&nbsp;/room/851018852<br>http://www.mayi.com/room/851018852<br>[短租房名称]&nbsp;大营坡(中大国际购物中心附近)北欧小清新三室<br>[短租房价格]&nbsp;298<br>[短租房评分/评论/居住人数]&nbsp;三居·可住6人<br>[网页链接]&nbsp;/room/851647045<br>http://www.mayi.com/room/851647045<br></code></pre> </section>

接下来我们想获取详细信息 file 这里作者主要是提供分析Cookie的方法,使用浏览器打开网页,右键“检查”,然后再刷新网页。在“NetWork”中找到网页并点击,在弹出来的Headers中就隐藏这这些信息。 file 最常见的两个参数是Cookie和User-Agent,如下图所示: file 然后在Python代码中设置这些参数,再调用Urllib2.Request()提交请求即可,核心代码如下:

<section id="nice" data-tool="mdnice编辑器" data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice编辑器" style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;">&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;...&nbsp;Chrome/61.0.3163.100&nbsp;Safari/537.36"</span><br>&nbsp;cookie=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"mediav=%7B%22eid%22%3A%22387123...b3574ef2-21b9-11e8-b39c-1bc4029c43b8"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Cookie"</span>:cookie}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag1&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"main"</span>}):<br></code></pre> </section> 注意,每小时Cookie会更新一次,我们需要手动修改Cookie值即可,就是上面代码的cookie变量和user_agent变量。完整代码如下所示: <section id="nice" data-tool="mdnice编辑器" data-website="https://www.mdnice.com" style="font-size: 16px; color: black; padding: 0 10px; line-height: 1.6; word-spacing: 0px; letter-spacing: 0px; word-break: break-word; word-wrap: break-word; text-align: left; font-family: Optima-Regular, Optima, PingFangSC-light, PingFangTC-light, 'PingFang SC', Cambria, Cochin, Georgia, Times, 'Times New Roman', serif;"><pre class="custom" data-tool="mdnice编辑器" style="margin-top: 10px; margin-bottom: 10px; border-radius: 5px; box-shadow: rgba(0, 0, 0, 0.55) 0px 2px 10px;"><span style="display: block; background: url(https://files.mdnice.com/point.png); height: 30px; width: 100%; background-size: 40px; background-repeat: no-repeat; background-color: #282c34; margin-bottom: -7px; border-radius: 5px; background-position: 10px 10px;"></span><code class="hljs" style="overflow-x: auto; padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px; -webkit-overflow-scrolling: touch; padding-top: 15px; background: #282c34; border-radius: 5px;">import&nbsp;urllib2<br>import&nbsp;re<br>from&nbsp;bs4&nbsp;import&nbsp;BeautifulSoup<br>import&nbsp;codecs<br>import&nbsp;csv<br>&nbsp;<br>&nbsp;<br>c&nbsp;=&nbsp;open(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"ycf.csv"</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"wb"</span>)&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#write&nbsp;写</span><br>c.write(codecs.BOM_UTF8)<br>writer&nbsp;=&nbsp;csv.writer(c)<br>writer.writerow([<span class="hljs-string" style="color: #98c379; line-height: 26px;">"短租房名称"</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"地址"</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"价格"</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"评分"</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"可住人数"</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"人均价格"</span>])<br>&nbsp;<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#爬取详细信息</span><br>def&nbsp;getInfo(url,fname,fprice,fscore,users):<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#通过浏览器开发者模式查看访问使用的user_agent及cookie设置访问头(headers)避免反爬虫,且每隔一段时间运行要根据开发者中的cookie更改代码中的cookie</span><br>&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;AppleWebKit/537.36&nbsp;(KHTML,&nbsp;like&nbsp;Gecko)&nbsp;Chrome/61.0.3163.100&nbsp;Safari/537.36"</span><br>&nbsp;cookie=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"mediav=%7B%22eid%22%3A%22387123%22eb7;&nbsp;mayi_uuid=1582009990674274976491;&nbsp;sid=42200298656434922.85.130.130"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent,<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Cookie"</span>:cookie}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#短租房地址</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag1&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"main"</span>}):<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'短租房地址:'</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag2&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag1.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"desWord"</span>}):<br>&nbsp;address&nbsp;=&nbsp;tag2.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;address<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#可住人数</span><br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'可住人数:'</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag4&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag1.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"w258"</span>}):<br>&nbsp;yy&nbsp;=&nbsp;tag4.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'span'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;yy<br>&nbsp;fname&nbsp;=&nbsp;fname.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;address&nbsp;=&nbsp;address.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;fprice&nbsp;=&nbsp;fprice.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;fscore&nbsp;=&nbsp;fscore.encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;fpeople&nbsp;=&nbsp;yy[2:3].encode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf-8"</span>)<br>&nbsp;ones&nbsp;=&nbsp;int(<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">float</span>(fprice))/int(<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">float</span>(fpeople))<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#存储至本地</span><br>&nbsp;writer.writerow([fname,address,fprice,fscore,fpeople,ones])<br>&nbsp;<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#爬虫函数</span><br>def&nbsp;gydzf(url):<br>&nbsp;user_agent=<span class="hljs-string" style="color: #98c379; line-height: 26px;">"Mozilla/5.0&nbsp;(Windows&nbsp;NT&nbsp;10.0;&nbsp;Win64;&nbsp;x64)&nbsp;AppleWebKit/537.36&nbsp;(KHTML,&nbsp;like&nbsp;Gecko)&nbsp;Chrome/51.0.2704.103&nbsp;Safari/537.36"</span><br>&nbsp;headers={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"User-Agent"</span>:user_agent}<br>&nbsp;request=urllib2.Request(url,headers=headers)<br>&nbsp;response=urllib2.urlopen(request)<br>&nbsp;contents&nbsp;=&nbsp;response.read()<br>&nbsp;soup&nbsp;=&nbsp;BeautifulSoup(contents,&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">"html.parser"</span>)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;tag&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;soup.find_all(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'dd'</span>):<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#短租房名称</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;name&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"room-detail&nbsp;clearfloat"</span>}):<br>&nbsp;fname&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[短租房名称]'</span>,&nbsp;fname.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#短租房价格</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;price&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;tag.find_all(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"class"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"moy-b"</span>}):<br>&nbsp;string&nbsp;=&nbsp;price.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'p'</span>).get_text()<br>&nbsp;fprice&nbsp;=&nbsp;re.sub(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"[¥]+"</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">""</span>.decode(<span class="hljs-string" style="color: #98c379; line-height: 26px;">"utf8"</span>),string)<br>&nbsp;fprice&nbsp;=&nbsp;fprice[0:5]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[短租房价格]'</span>,&nbsp;fprice.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#评分及评论人数</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">for</span>&nbsp;score&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">in</span>&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>):<br>&nbsp;fscore&nbsp;=&nbsp;name.find(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'ul'</span>).get_text()<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[短租房评分/评论/居住人数]'</span>,&nbsp;fscore.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#网页链接url</span><br>&nbsp;url_dzf&nbsp;=&nbsp;tag.find(attrs={<span class="hljs-string" style="color: #98c379; line-height: 26px;">"target"</span>:<span class="hljs-string" style="color: #98c379; line-height: 26px;">"_blank"</span>})<br>&nbsp;urls&nbsp;=&nbsp;url_dzf.attrs[<span class="hljs-string" style="color: #98c379; line-height: 26px;">'href'</span>]<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'[网页链接]'</span>,&nbsp;urls.replace(<span class="hljs-string" style="color: #98c379; line-height: 26px;">'\n'</span>,<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span>).strip()<br>&nbsp;urlss&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com'</span>&nbsp;+&nbsp;urls&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">''</span><br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;urlss<br>&nbsp;getInfo(urlss,fname,fprice,fscore,user_agent)<br>&nbsp;<br><span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#主函数</span><br><span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>&nbsp;__name__&nbsp;==&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'__main__'</span>:<br>&nbsp;i&nbsp;=&nbsp;0<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">while</span>&nbsp;i&lt;33:<br>&nbsp;<span class="hljs-built_in" style="color: #e6c07b; line-height: 26px;">print</span>&nbsp;u<span class="hljs-string" style="color: #98c379; line-height: 26px;">'页码'</span>,&nbsp;(i+1)<br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>(i==0):<br>&nbsp;url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/?map=no'</span><br>&nbsp;<span class="hljs-keyword" style="color: #c678dd; line-height: 26px;">if</span>(i&gt;0):<br>&nbsp;num&nbsp;=&nbsp;i+2&nbsp;<span class="hljs-comment" style="color: #5c6370; font-style: italic; line-height: 26px;">#除了第一页是空的,第二页开始按2顺序递增</span><br>&nbsp;url&nbsp;=&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'http://www.mayi.com/guiyang/'</span>&nbsp;+&nbsp;str(num)&nbsp;+&nbsp;<span class="hljs-string" style="color: #98c379; line-height: 26px;">'/?map=no'</span><br>&nbsp;gydzf(url)<br>&nbsp;i=i+1<br>&nbsp;<br>c.close()<br></code></pre> </section> 输出结果如下,存储本地CSV文件: ![file](https://oscimg.oschina.net/oscnet/up-a99c96e8e914edcc1997d6ee51516f0fb84.png) 同时,大家可以尝试Selenium爬取蚂蚁短租,应该也是可行的方法。最后希望文章对您有所帮助,如果存在不足之处,请海涵~

*声明:本文于网络整理,版权归原作者所有,如来源信息有误或侵犯权益,请联系我们删除或授权事宜。

[拿走不谢!Python 3.9 官方中文文档,限时领!] (http://dwz.date/dE6v)

[限时!速领!14张高清Python速查表,效率提升必备!] (http://dwz.date/dE6w)

[GitHub标星3W+,80个Python案例,带你轻松玩转Python学习!] (http://dwz.date/dE64)

版权声明
本文为[爱学习的豆包]所创,转载请带上原文链接,感谢
https://my.oschina.net/u/4630617/blog/4958194

  1. Python Tkinter inserts all the pictures in a directory into the docx file
  2. 解决忽略VScode中Python插件pylint报错的问题
  3. To solve the problem of ignoring the error of Python plug-in in vscode
  4. python 毫秒级时间,时间戳转换
  5. Python millisecond time, timestamp conversion
  6. python try except 出现异常时,except 中如何返回异常的信息字符串
  7. When an exception occurs in Python try except, how to return the exception information string in except
  8. 手机最强Python编程神器,在手机上运行Python
  9. The strongest Python Programming artifact on mobile phones, running Python on mobile phones
  10. 2021年Python程序员薪资待遇如何?
  11. 「python安装」Windows上安装和创建python开发环境
  12. What is the salary of Python programmers in 2021?
  13. "Python installation" to install and create a python development environment on Windows
  14. python解决组合问题
  15. Python to solve the problem of composition
  16. Python中的Lasso回归之最小角算法LARS
  17. Lars, the least angle algorithm of lasso regression in Python
  18. 利用python提取网站曲线图数据
  19. Using Python to extract website graph data
  20. Python3中urllib详细使用方法(header,代理,超时,认证,异常处理)
  21. Detailed usage of urllib in Python 3 (header, proxy, timeout, authentication, exception handling)
  22. python 第三方库paramiko
  23. python 第三方库paramiko
  24. Python third party library paramiko
  25. Python third party library paramiko
  26. 卸载 PyCharm!这才是 Python 小白的最理想的 IDE
  27. 卸载 PyCharm!这才是 Python 小白的最理想的 IDE
  28. Uninstall pycharm! This is the ideal IDE for Python Xiaobai
  29. django学习-27.admin管理后台里:对列表展示页面的数据展示进行相关优化
  30. Uninstall pycharm! This is the ideal IDE for Python Xiaobai
  31. Django learning - 27. Admin management background: optimize the data display of the list display page
  32. python day2
  33. python day2
  34. Python 内存泄漏问题排查
  35. Troubleshooting of Python memory leak
  36. Python 与 excel的简单应用
  37. Simple application of Python and excel
  38. Python 与 excel的简单应用
  39. Simple application of Python and excel
  40. 2.7万 Star!最全面的 Python 设计模式集合
  41. 27000 stars! The most comprehensive collection of Python design patterns
  42. python day3
  43. python day3
  44. Commonly used data operation functions of Python
  45. (数据科学学习手札108)Python+Dash快速web应用开发——静态部件篇(上)
  46. (learning notes of data science 108) Python + dash rapid web application development -- static components (I)
  47. (数据科学学习手札108)Python+Dash快速web应用开发——静态部件篇(上)
  48. (learning notes of data science 108) Python + dash rapid web application development -- static components (I)
  49. [Python] Matplotlib 图表的绘制和美化技巧
  50. Drawing and beautifying skills of [Python] Matplotlib chart
  51. [Python] Matplotlib 图表的绘制和美化技巧
  52. Drawing and beautifying skills of [Python] Matplotlib chart
  53. Virtual environment of Python project
  54. 翻译:《实用的Python编程》02_01_Datatypes
  55. Translation: practical Python Programming 02_ 01_ Datatypes
  56. 翻译:《实用的Python编程》02_01_Datatypes
  57. 翻译:《实用的Python编程》02_01_Datatypes
  58. Translation: practical Python Programming 02_ 01_ Datatypes
  59. Translation: practical Python Programming 02_ 01_ Datatypes
  60. Python 3 入门,看这篇就够了