python爬虫
2021-05-17 14:13:14 20 举报
AI智能生成
爬虫思维导图,不定时更新项目文件
作者其他创作
大纲/内容
<div>import requests # 网页下载工具<br>from bs4 import BeautifulSoup # 分析网页数据的工具<br>from scrapy.cmdline import excute #调用后可执行scrapy脚本<br> excute(["scrapy","crawl","jobbole"]) #在scrapy环境下执行scrapy crawl jobbole命令<br><br>import sys<br> sys.path.append #设置工程目录<br><br>import re # 正则表达式,用于数据清洗\整理等工作<br> re.comple(r"/view/\d+\.htm") #<br><br>import os # 系统模块,用于创建结果文件保存等<br> os.path.abspath(__file__) #当前文件路径<br> os.path.dirname(os.patg.abspath()) #当前文件所在的目录<br><br>import codecs # 用于创建结果文件并写入<br><br>import urllib.request #等同于python2中的urllib2<br> urllib.request.urlopen(url) #打开网页<br> urllib.request.urlopen(url).getcode() #获取网页的状态码<br> urllib.urljoin(url) #将主url与其他url拼接<br> parse.urljoin(response.url, url) #scrapy中将url与主url拼接<br><br></div>
python模块
写入json文件的逻辑
<pre><span>class </span><span>ScrapyJsonPipeline</span>()<span>:<br></span><span> def </span><span>__init__</span>(<span>self</span>)<span>:<br></span><span> </span><span>self</span>.file <span>=</span>codecs.open(<span>'scrapy_jobbole.json'</span>,<span>'w'</span>,<span>encoding</span><span>=</span><span>"utf-8"</span>)<br><br> <span>def </span><span>process_item</span>(<span>self</span>, <span>item</span>, <span>spider</span>)<span>:<br></span><span> </span>lines <span>= </span>json.dumps(<span>dict</span>(<span>item</span>),<span>ensure_ascii</span><span>=False</span>) <span>+ </span><span>"</span><span>\n</span><span>"<br></span><span> </span><span>self</span>.file.write(lines)<br> <span>return </span><span>item<br></span><span><br></span><span> </span><span>def </span><span>close_spider</span>(<span>self</span>,<span>spider</span>)<span>:<br></span><span> </span><span>self</span>.file.close()</pre>
生成md_5值
<pre><span>#!/usr/bin/env python<br></span><span># -*- coding: utf-8 -*-<br></span><span># @Time : 2017/9/6 17:58<br></span><span># @Author: jecht<br></span><span># @File : get_md5.py<br></span><span>import </span>hashlib<br><br><span>def </span><span>get_md5</span>(<span>url</span>)<span>:<br></span><span> if </span><span>isinstance</span>(<span>url</span>,<span>str</span>)<span>:<br></span><span> </span>url <span>= </span><span>url</span>.encode(<span>"utf-8"</span>)<br> m <span>= </span>hashlib.md5()<br> m.update(<span>url</span>)<br> <span>return </span>m.hexdigest()<br><br><span>if </span>__name__ <span>== </span><span>"__main__"</span><span>:<br></span><span> </span><span>print </span>(get_md5(<span>"http://jobbole.com"</span>))<br><br></pre>
设置文件路径
添加当前文件的上上个目录的目录名的路径
<pre>path <span>= </span>os.path.dirname(os.path.abspath(os.path.dirname(__file__)))<br>sys.path.insert(<span>0</span>,path)</pre>
添加当前文件的目录路径(当时用非绝对路径时)
path1 = os.path.abspath(os.path.dirname(__file__))
添加当前文件的绝对路径
os.path.join(path1,'filename')
list使用filter函数过滤
<pre>page_urls <span>= </span><span>filter</span>(<span>lambda </span><span>x</span><span>:True if </span>x.startswith(<span>"https"</span>) <span>else False</span>, page_urls)</pre>
将list中的元素修改成“,”隔开的字符串
<pre><span>raink_item </span><span>= </span><span>","</span>.join(<span>str</span>(i.stip()) <span>for </span>i <span>in </span><span>self</span>[<span>"raink_item"</span>])</pre>
date与时间、时间戳的转换
时间戳转换成date
<pre>datetime.datetime.fromtimestamp(<span>self</span>[<span>"ans_create_time"</span>]).strftime(<span>"%Y/%m/%d %H:%M:%S"</span>)</pre>
str类型时间转换成datetime
<pre>datetime.datetime.strptime(date_value, <span>"%Y/%m/%d"</span>).date()<br>如2017/11/11<br></pre>
<pre>datetime.datetime.strptime(date_value, <span>"%Y/%m/%d %H:%M:%S")</span><br>如2017/11/11 17:00:00<br></pre>
年月日转换成datetime类型
<pre>datetime.datetime.strptime(date_value, <span>"%Y年%m月%d日"</span>).date()<br>如2017年11月11日<br></pre>
数据类型转换
str转换成int(带千位分隔符,使用replace替换)
<pre><span>int</span>(get_number(<span>""</span>.join(<span>self</span>[<span>"comment_num"</span>])).replace(<span>","</span>,<span>""</span>))</pre>
tuple类型和list类型转换成·str类型
"".join(list(x))
两个list组成一个字典dict
a = list()<br>b = list()<br>c = dict(zip(a,b))<br>for (x,y) in zip(a,b)
生成随机数字
from random import randint<br>randint(0,21)<br>#生成0,到20以内的任意数字<br>
爬虫爬取机制结构
url管理器
防止重复抓取(最小功能范围)<br>1.添加新的url到待爬去集合中 3.获取待爬取url<br>2.同时判断待添加的url是否在容器中 4.同时判断是否还有待爬取的url<br> 5.将url从待爬取移动到已爬取<br>
实现方式<br>1.内存: 将待爬取url和已爬取url存放到url集合set()中<br>2.关系型数据库:mysql:urls(url.is_crawled)is_crawled标志url是否被爬取<br>3.缓存数据库: redis:将爬取url和已爬取url存放到set中<br>
url下载器
import urllib2<br>url = "http://www.baidu.com"<br>第一种方法<br>response = urllib2.urlopen(url)<br>print response.gedcode() 访问url后的获取码<br>print response.read() 访问url的内容<br><br>第二种方法(增加url头文件)<br>request = urllib2.Request(url)<br>request.add_header("user-agent","Mozilla/5.0")<br>response2 = urllib2.urlopen(url)<br>print response2.getcode()<br>print len(response2.read())<br><br>第三种方法(增强cook处理能力)<br>cj = cookielib.CookieJar()<br>opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))<br>urllib2.install_opener(opener)<br>response3 = urllib2.urlopen(url)<br>print response3.getcode()<br>print len(response3.read())<br>
url网页解析器
Beautiful Soup
beautlful soup --- python第三方库,用于html或xml中提取数据<br>pip install beautifulsoup4<br>from bs4 import BeautifulSoup<br>
1. 创建BeutifulSoup对象,并传入三个参数:html的文档字符串、解析器、文档的编码<br>soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf8')<br>
2. 搜索节点(find_all,find)<br>find:搜索第一个节点<br>find_all:搜索所有符合的节点<br>soup.find_all('a')<br>soup.find_all('a',href=re.compile(r'/view/\d+\.htm'))<br>soup.find_all('div',class_='abc',string='python') 查找href为abc,文字为python的节点<br>
访问节点信息<br>node.name 获取查找到的该节点的名字<br>node['href'] 获取查找到的a节点的href属性<br>node.get_text() 获取a节点下的文字
给定一个html_doc文件,获取所有链接信息<br>html_doc= " "<br>soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf8')<br>links = soup.find_all('a')<br>for link in links:<br> print link.name,linke['href'],link.get_text()
session、cookie的区别
cookie
浏览器的本地存储方式,是以dict的形式存储<br>{"sessionkey":"value"}value在浏览器中为一段文本,浏览器会自动对其解析<br>
浏览器向服务器发起无状态请求,服务器返回请求数据并返回一个标志id,浏览器将id存储到本地cookie上,<br>当下次浏览器带着有请求id的报文到服务器上时,服务器就能判断出该浏览器上次请求的内容
session
由于cookie不安全性考虑:本地cook不能保存用户名和密码,服务器在返回请求数据和标志id时,<br>将登录信息保存服务器本地,并把登陆信息形成session_id返回给浏览器,当下次浏览器进行登录时,<br>将本地的session_id发送给服务器时,服务器通过session_id查找浏览器的登录信息进行匹配登陆
http常见状态码
200 :请求被处理成功<br>301/302:永久性重定向/临时重定向<br>403 :没有权限访问<br>404 :表示没有对应的数据<br>500 :服务器错误<br>503 :服务器停机或正在服务<br>
分支主题
scrapy框架
scrapy框架图
<br>
1.spider通过yield Request发出一个request请求,通过engine将request发送给scheduler<br>2.scheduler收到request,通过engine将request通过downloader middleware层层过滤后发送给downloader<br>3.downloader通过downloader middleware过滤返回一个response,通过spider middleware过滤后engine转发给spiders<br>4.spiders通过spider middleware过滤后返回items和requests,engine收到后,判断items发给item pipelines进行下载,<br>requests发送给scheduler重新进入循环步骤2<br>
源码在site-page/tscrapy/core中
request和response的参数
request
url:url<br>回调函数:callback=None<br>method='GET'/'POST'<br>头部信息:headers=None<br>body=None<br>cook信息:cookies=None(可以是或list一个dict)<br>传入下一个函数的参数:meta=None<br>设置编码:sencoding='utf-8'<br>影响scheduler的调度优先级:priority=0<br>判断多个request是否被过滤:dont_filter=False(不被过滤)<br>错误回调函数:errback=None<br>
errback实例:<br>yield scrapy.Request(url,callback=self.parse_httpbin,errback=self.errback_httpbin,dontfilter=True)<br><br>def errback_httpbin(self,failure):<br> self.logger.error(repr(failure))<br><br> if failure.check(HttpError):<br> response=failure.value.response<br> self.logger.error('HttpError on %s',response.url)<br><br> elif failure.check(DnsLookupError):<br> response=failure.response<br> self.logger.error('DnsLookupError on %s',response.url)<br><br> elif failure.check(TimeoutError,TCPTimedOutError):<br> response=failure.response<br> self.logger.error('TimeoutError on %s',response.url)<br>
errback实例:<br>yield scrapy.Request(url,callback=self.parse_httpbin,errback=self.errback_httpbin,dontfilter=True)<br><br>def errback_httpbin(self,failure):<br> self.logger.error(repr(failure))<br> if failure.check(HttpError):<br> response=failure.value.response<br> self.logger.error('HttpError on %s',response.url)<br> <br>
方法:copy()、replace()<br>查看scrapy.org官方文档<br>
response
url:url<br>状态码:status = 200<br>头部信息:headers=None<br>页面信息:body=b' '<br>flags=None<br>request=None<br><br><br>
方法:<br>copy()<br>urljoin()<br>replase()<br>
子类:<br>TextResponse<br>HtmlResponse<br>XmlResponse<br><br>
HtmlResponse:<br>HtmlResponse继承了TextRensponse<br>TextRensponse有两个方法xpath,css<br>
做云图
创建并建立用户<br>mkdir demo ; cd demo<br>安装wheel库<br>pip install wheel<br>下载并安装词云模块<br>http://www.lfd.uci.edu/~gohlke/pythonlibs/<br>pip install wordcloud-1.3.2-cp36-cp36m-win32.whl<br>安装notebook<br>pip install jupyter<br>jupyter notebook<br>
网上爬取的数据生成jay.txt文本
<div>filename = "jay.txt"</div><div>with open(filename) as f:</div><div> mytext = f.read()</div>
mytext
<div>from wordcloud import WordCloud</div><div>wordcloud = WordCloud().generate(mytext)</div>
中文引入分词工具,及中文字体下载到demo目录<br>中文字体:<br>https://s3-us-west-2.amazonaws.com/notion-static/b869cb0c7f4e4c909a069eaebbd2b7ad/simsun.ttf<br>安装结巴:<br>pip install jieba<br>import jieba<br>mytext = " ".join(jieba.cut(mytext))<br>修改:wordcloud = WordCloud(font_path = "simsun.ttf").generate(mytext)<br>
<div>%pylab inline</div><div>import matplotlib.pyplot as plt</div><div>plt.imshow(wordcloud, interpolation='bilinear')</div><div>plt.axis("off")</div>
1.python使用beautlfulsoup爬取百度百科网页
spider_main(爬虫主函数)
<pre><span>#!/usr/bin/env python<br></span><span># -*- coding: utf-8 -*-<br></span><span># @Time : 2017/8/29 2:33<br></span><span># @Author: jecht<br></span><span># @File : spider_main.py<br><br></span><span>from </span>baidubaike_spider <span>import </span>url_manager, url_download, url_parser, spider_output<br><br><span>class </span><span>SpiderMain</span>(<span>object</span>)<span>:<br></span><span> def </span><span>__init__</span>(<span>self</span>)<span>:<br></span><span> </span><span>self</span>.urls <span>= </span>url_manager.UrlManager()<br> <span>self</span>.downloader <span>= </span>url_download.UrlDownload()<br> <span>self</span>.parser <span>= </span>url_parser.UrlParser()<br> <span>self</span>.output <span>= </span>spider_output.SpiderOutput()<br> <span>def </span><span>craw</span>(<span>self</span>, <span>root_url</span>)<span>:<br></span><span> </span>count <span>= </span><span>1<br></span><span> </span><span>self</span>.urls.add_new_url(<span>root_url</span>)<br> <span>#当有待爬取的url时<br></span><span> </span><span>while </span><span>self</span>.urls.has_new_url()<span>:<br></span><span> try:<br></span><span> </span><span>#获取新的url<br></span><span> </span>new_url <span>= </span><span>self</span>.urls.get_new_url()<br> <span>print</span>(<span>'craw %d : %s' </span><span>%</span>(count, new_url))<br> <span>#下载新的url页面<br></span><span> </span>html_count <span>= </span><span>self</span>.downloader.download(new_url)<br> <span>#将url和所在的页面内容通过parser函数得到新的new_url数据和new_data数据<br></span><span> </span>new_urls,new_data <span>= </span><span>self</span>.parser.parser(new_url,html_count)<br> <span>#将新的url数据补充到url管理器当中<br></span><span> </span><span>self</span>.urls.add_new_urls(new_urls)<br> <span>#将新的数据存放到output文件当中<br></span><span> </span><span>self</span>.output.collect_data(new_data)<br> <span>if </span>count <span>== </span><span>1000</span><span>:<br></span><span> break<br></span><span> </span>count <span>= </span>count <span>+ </span><span>1<br></span><span> </span><span>except:<br></span><span> </span><span>print</span>(<span>'craw failed!'</span>)<br> <span>self</span>.output.output_html()<br><br><span>if </span>__name__<span>==</span><span>"__main__"</span><span>:<br></span><span> </span>root_url <span>= </span><span>"http://baike.baidu.com/view/21087.htm"<br></span><span> </span>obj_spider <span>= </span>SpiderMain()<br> obj_spider.craw(root_url)<br><br></pre>
url_manager(url管理器)
<pre><span>#!/usr/bin/env python<br></span><span># -*- coding: utf-8 -*-<br></span><span># @Time : 2017/8/29 2:35<br></span><span># @Author: jecht<br></span><span># @File : url_manager.py<br><br></span><span>class </span><span>UrlManager</span>(<span>object</span>)<span>:<br></span><span> def </span><span>__init__</span>(<span>self</span>)<span>:<br></span><span> </span><span>self</span>.new_urls <span>= </span><span>set</span>()<br> <span>self</span>.old_urls <span>= </span><span>set</span>()<br> <span>#管理器中新添加的url<br></span><span> </span><span>def </span><span>add_new_url</span>(<span>self</span>, <span>url</span>)<span>:<br></span><span> if </span><span>url </span><span>is None:<br></span><span> return<br></span><span> if </span><span>url </span><span>not in </span><span>self</span>.new_urls <span>and </span><span>url </span><span>not in </span><span>self</span>.old_urls<span>:<br></span><span> </span><span>self</span>.new_urls.add(<span>url</span>)<br> <span>#管理器中新添加批量的url<br></span><span> </span><span>def </span><span>add_new_urls</span>(<span>self</span>, <span>urls</span>)<span>:<br></span><span> if </span><span>urls </span><span>is None or </span><span>len</span>(<span>urls</span>) <span>== </span><span>0</span><span>:<br></span><span> return<br></span><span> for </span>url <span>in </span><span>urls</span><span>:<br></span><span> </span><span>self</span>.add_new_url(url)<br> <span>#判断管理器中是否有新的待爬取url<br></span><span> </span><span>def </span><span>has_new_url</span>(<span>self</span>)<span>:<br></span><span> return </span><span>len</span>(<span>self</span>.new_urls) <span>!= </span><span>0<br></span><span> </span><span>#从管理器中获取一个新的待爬取的url<br></span><span> </span><span>def </span><span>get_new_url</span>(<span>self</span>)<span>:<br></span><span> </span>new_url <span>=</span><span>self</span>.new_urls.pop()<br> <span>self</span>.old_urls.add(new_url)<br> <span>return </span>new_url<br></pre>
url_downloader(url下载器)
<pre><span>#!/usr/bin/env python<br></span><span># -*- coding: utf-8 -*-<br></span><span># @Time : 2017/8/29 2:36<br></span><span># @Author: jecht<br></span><span># @File : url_download.py<br><br></span><span>import </span>urllib.request<br><br><span>class </span><span>UrlDownload</span>(<span>object</span>)<span>:<br></span><span> def </span><span>download</span>(<span>self</span>, <span>url</span>)<span>:<br></span><span> if </span><span>url </span><span>is None:<br></span><span> return None<br></span><span> </span>response <span>= </span>urllib.request.urlopen(<span>url</span>)<br> <span>if </span>response.getcode() <span>!= </span><span>200</span><span>:<br></span><span> return None<br></span><span> return </span>response.read()<br><br></pre>
url_parser(url分析器)
<pre><span>#!/usr/bin/env python<br></span><span># -*- coding: utf-8 -*-<br></span><span># @Time : 2017/8/29 15:41<br></span><span># @Author: jecht<br></span><span># @File : url_parser.py<br><br></span><span>import </span>re<br><span>from </span>anaconda_project.plugins.network_util <span>import </span>urlparse<br><span>from </span>bs4 <span>import </span>BeautifulSoup<br><br><span>class </span><span>UrlParser</span>(<span>object</span>)<span>:<br></span><span><br></span><span> def </span><span>_get_new_urls</span>(<span>self</span>,<span>page_url</span>,<span>soup</span>)<span>:<br></span><span> </span>new_urls <span>=</span><span>set</span>()<br> links <span>= </span><span>soup</span>.find_all(<span>'a'</span>,<span>href</span><span>=</span>re.compile(<span>r"/view/\d+\.htm"</span>))<br> <span>for </span>link <span>in </span>links<span>:<br></span><span> </span>new_url <span>= </span>link[<span>'href'</span>]<br> <span>#拼接url<br></span><span> </span>new_full_url <span>= </span>urlparse.urljoin(<span>page_url</span>,new_url)<br> new_urls.add(new_full_url)<br> <span>return </span>new_urls<br><br> <span>def </span><span>_get_new_data</span>(<span>self</span>, <span>page_url</span>, <span>soup</span>)<span>:<br></span><span> </span>res_data<span>=</span>{}<br> <span>#url<br></span><span> </span>res_data[<span>'url'</span>] <span>= </span><span>page_url<br></span><span><br></span><span> </span><span>#<div class="lemma-summary" label-module="lemmaSummary"><br></span><span> </span>title_node <span>=</span><span>soup</span>.find(<span>'dd'</span>,<span>class_</span><span>=</span><span>"lemmaWgt-lemmaTitle-title"</span>).find(<span>"h1"</span>)<br> res_data[<span>'title'</span>] <span>= </span>title_node.get_text()<br><br> summary_node <span>= </span><span>soup</span>.find(<span>'div'</span>, <span>class_</span><span>=</span><span>"lemma-summary"</span>)<br> res_data[<span>'summary'</span>] <span>= </span>summary_node.get_text()<br><br> <span>return </span>res_data<br><br> <span>def </span><span>parser</span>(<span>self</span>, <span>page_url</span>, <span>html_count</span>)<span>:<br></span><span> if </span><span>page_url </span><span>is None or </span><span>html_count </span><span>is None:<br></span><span> return<br></span><span> </span>soup <span>= </span>BeautifulSoup(<span>html_count</span>,<span>'html.parser'</span>,<span>from_encoding</span><span>=</span><span>'utf-8'</span>)<br> new_urls <span>= </span><span>self</span>._get_new_urls(<span>page_url</span>,soup)<br> new_data <span>= </span><span>self</span>._get_new_data(<span>page_url</span>,soup)<br> <span>return </span>new_urls,new_data</pre>
spider_output(内容输出器)
<pre><span>#!/usr/bin/env python<br></span><span># -*- coding: utf-8 -*-<br></span><span># @Time : 2017/8/29 12:33<br></span><span># @Author: jecht<br></span><span># @File : spider_output.py<br><br></span><span>class </span><span>SpiderOutput</span>(<span>object</span>)<span>:<br><br></span><span> def </span><span>__init__</span>(<span>self</span>)<span>:<br></span><span> </span><span>self</span>.datas <span>= </span>[]<br><br> <span>def </span><span>collect_data</span>(<span>self</span>, <span>data</span>)<span>:<br></span><span> if </span><span>data </span><span>is None:<br></span><span> return<br></span><span> </span><span>self</span>.datas.append(<span>data</span>)<br><br> <span>def </span><span>output_html</span>(<span>self</span>)<span>:<br></span><span> </span>fout <span>= </span><span>open</span>(<span>'output.html'</span>,<span>'w'</span>)<br><br> fout.write(<span>"<html>"</span>)<br> fout.write(<span>"<body>"</span>)<br> fout.write(<span>"<table>"</span>)<br> <span>#ascii<br></span><span> </span><span>for </span>data <span>in </span><span>self</span>.datas<span>:<br></span><span> </span>fout.write(<span>"<tr>"</span>)<br> fout.write(<span>"<td>%s</td>" </span><span>% </span>data[<span>'url'</span>])<br> fout.write(<span>"<td>%s</td>" </span><span>% </span>data[<span>'title'</span>].encode(<span>'utf-8'</span>))<br> fout.write(<span>"<td>%s</td>" </span><span>% </span>data[<span>'summary'</span>].encode(<span>'utf-8'</span>))<br> fout.write(<span>"</tr>"</span>)<br> fout.write(<span>"</table>"</span>)<br> fout.write(<span>"</body>"</span>)<br> fout.write(<span>"</html>"</span>)<br><br> fout.close()<br></pre>
vir虚拟环境及scrapy的安装
windows(wrapper)
windows:(建议安装,虚拟环境管理包)<br>安装虚拟环境:pip install virtualenvwrapper-win<br><br>修改虚拟环境目录:计算机——右键属性——高级系统设置——环境变量——新建变量——<br>变量名:WORKON_HOME/变量值:D:\python_project\<br><br>新建虚拟环境并进入:mkvirtualenv scrapy_virtualenv(通过--python c:\***来指定python不同版本的虚拟环境)<br>查看虚拟环境:workon<br>进入虚拟环境:workon scrapy_test<br>退出虚拟环境:deactivate<br><br><div>在虚拟环境中安装scrapy(需要安装Visual C++ Build Tools)<br>pip install -i https://pypi.douban.com/simple scrapy<br>查看依赖包:pip list<br>进入虚拟环境:workon scrapy_virtualenv<br>新建scrapy项目:<b>scrapy startproject ArticleSpider</b><br>(New Scrapy project 'ArticleSpider', using template directory <br>'d:\\envs\\article_scrapytest\\lib\\site-packages\\scrapy\\templates\\project',<br>created in:e:\virtualenv_project\ArticleSpider)<br><br>现在可以用pycharm打开scrapy项目进行爬虫编写<br>进入项目创建scrapy工程:<b>scrapy genspider jobbole blog.jobbole.com</b>(通过basic的模板创建jobble.py文件)<br><br></div><div>pycharm使用<br>(设置全局编译器)file——*settings——*project interpreter——*add local——*添加虚拟环境的python.exe文件<br>scrapy shell http://blog.jobbole.com/110287/ 可以在shell环境下调试网页中scrapy代码<br> respose.body 查看页面所有内容<br> response.css("")<br> response.xpath("")<br></div>
windows
<div>安装虚拟环境<br>windows:(需要pip和python环境,pip下载解压后再cmd中运行python setup.py install,并把python\Scripts路径加到path中)<br>安装虚拟环境:pip install virtualenv<br>新建虚拟环境:virtualenv --python /usr/bin/python3 scrapy_test(使用python的真实路径)<br>进入环境运行环境:cd scrapy_test\<br> activate.bat<br> deactivate.bat<br><br>在虚拟环境中安装scrapy:pip install -i https://pypi.douban.com/simple scrapy<br>在虚拟环境中你所需要的目录下安装project目录:scrapy startproject scrapy_bolezaixian<br>在虚拟环境中的project目录下安装pypiwin32 :pip install pypiwin32<br></div>
linux
linux:<br>安装虚拟环境:yum -y install python-virtualenv<br>新建虚拟环境:virtualenv -p /usr/bin/python3 scrapy_test(virtualenv scrapy_test)<br>进入环境运行环境:cd scrapy_test/bin<br> source activate<br> source deactivate<div>linux(wrapper版scrapy)<br>安装依赖包<br>yum install libxslt-devel libffi libffi-devel python-devel gcc openssl openssl-devel <br><br></div><div>tar xf pyOpenSSL-0.11.tar.gz -C /usr/local/src/<br>python setup.py install<div>安装scrapy<br>easy_install -U Scrapy</div><br>**************************************************************************<br>安装Twisted<br>wget http://pypi.python.org/packages/source/s/setuptools/setuptools-0.6c11.tar.gz -P /opt/<br>tar zxvf setuptools-0.6c11.tar.gz -C /usr/local/src/<br>python setup.py install</div><div>安装lxml<br>easy_install Twisted<br>easy_install -U w3lib<br>easy_install lxml</div><div>安装pyOpenSSL<br>wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt/</div>
字段分析语法
xpath语法
<div>xpath语法<br>text() :以文本格式输出data数据值<br>selector方法extract() :只输出data值<br>strip() :除去换行,回车,空格等标识符输出list值<br>replace(“。”,“”) :替换 把。换成空格<br>contains() :包含()内函数就匹配成功<br>@href :等同于::attr()<br><br><br>title = response.xpath('//*[@id="post-111585"]/div[1]/h1/text()').extract()[0]<br>creat_date = response.xpath('//*[@id="post-111585"]/div[2]/p/text()').extract()[0].replace("·","").strip()<br>praise = response.xpath('//span[contains(@class,"vote-post-up")]/h10/text()').extract()[0]<span><br>page_url </span><span>= response.xpath('//div[@class="grid-8"]/div/div/a/@href').extract()</span></div><div><br> <br>xpath语法<br>article 选取所有article元素的所有子节点<br>/article 选取article元素<br>//* 选取所有元素<br>//article 选取所有article元素<br>//@class 选取所有class属性<br>article/a 选取所有article元素的子元素a元素<br>article//a 选取所有article元素的后代元素a元素<br>//article/a |//article/b 选取所有article元素下的a和b元素</div><div><br>xpath语法-谓语<br>/article/a[1] 选取article元素下的第一个a元素<br>/article/a[last()] 选取article元素下的倒数第一个a元素<br>/article/a[last()-1] 选取article元素下的倒数第二个a元素<br>//a[@class] 选取所有拥有class属性的a元素<br>//a[@class='eng'] 选取所有拥有class属性等于eng的a元素<br>//a[@*] 选取所有拥有属性的a元素<br></div>
css语法
<div>css语法<br>::text :等同于xpath中的text()<br>::attr(href) :获取href=“”的数据<br>extract()[0] extract_first("") :当数组为空时会报错,最好用后者代替<br>yield :讲给scrapy下载<br>Request(url="",callback="") :讲url地址,回调到另一个或自身的解析函数中<br>parse.urljoin(response.url,post_url) :讲两个不同的url去重拼接<br></div><div><br>title = response.css(".entry-header h1::text").extract()<br>datetime = response.css('"p.entry-met-hide-on-moblile::text').extract()[0].strip().replace(".","").strip()<br><span>page_url = response.css("#archive .floated-thumb .post-thumb a::attr(href)").extract()</span></div><div><br>css选择器<br>* 选取所有节点<br>#container 选取所有id=“container”的节点<br>.container 选取所有class包含container的节点<br>container a 选取所有container元素下的所有a节点(子节点)<br>container + ul 选取container元素后面的ul节点(兄弟节点)<br>div#container>ul 选取id=“container”的div元素的第一个ul子元素</div><div><br>ul ~ p 选取与ul相邻的所有p元素(后面的后面)<br>a[title] 选取所有包含title属性的a元素<br>a[href="http://jobbole.com"] 选取所有href="http://jobbole.com"的a元素<br>a[href*="jobole"] 选取所有href包含jobole的a元素<br>a[href^="http" 选取所有href以http开头的a元素<br>a[href$=".jpg"] 选取所有href以.jpg结尾的a元素<br>input[type=radio]:checked 选取checked状态的type=“radio”的input元素</div><div><br>div:not(#container) 选取所有id不等于container的div属性<br>li:nth-child(3) 选取第三个li元素<br>ty:nth-child(2n) 选取第偶数个tr元素<br>ty:nth-child(-n+4) 选取小于等于4个元素<br>ty:nth-child(n+4) 选取大于等于4个元素<br>ty:last-child(1) 选取最后一个ty元素<br></div>
2.使用scrapy爬取伯乐在线网页
scrapy爬取伯乐在线网页
main函数(在scrapy_spide总目录下)
<pre><font color="#000000">#!/usr/bin/env python<br># -*- coding: utf-8 -*-<br># @Time : 2017/9/3 14:21<br># @Author: jecht<br># @File : main.py<br><br>import sys<br>import os<br>from scrapy.cmdline import execute<br><br>sys.path.append(os.path.dirname(os.path.abspath(__file__)))<br>execute(["scrapy","crawl","jobbole"])<br><b>#设置settings.py中ROBOTSTXT_OBEY为False并注释,不然爬虫很多url会被过滤掉</b></font><br></pre>
from scrapy.cmdline import excute #调用后可执行scrapy脚本<br> excute(["scrapy","crawl","jobbole"]) #在scrapy环境下执行scrapy crawl jobbole命令<br>import sys<br> sys.path.append #设置工程目录<br>import os # 系统模块,用于创建结果文件保存等<br> os.path.abspath(__file__) #当前文件路径<br> os.path.dirname(os.patg.abspath()) #当前文件所在的目录
spiders(所有spider的总目录)
jobbole.py(通过模板scrapy genspider jobbole blog.jobbole.com创建)<br>
<pre><span># -*- coding: utf-8 -*-<br></span><span>import </span>re<br><span>from </span>urllib <span>import </span>parse<br><span>import </span>scrapy<br><span>from </span>scrapy <span>import </span>Request<br><br><span>class </span><span>JobboleSpider</span>(scrapy.Spider)<span>:<br></span><span> </span>name <span>= </span><span>'jobbole'<br></span><span> </span>allowed_domains <span>= </span>[<span>'blog.jobbole.com'</span>]<br> start_urls <span>= </span>[<span>'http://blog.jobbole.com/all-posts/'</span>]<br><span> </span><b>def parse(self,response):<br> def parse_page(self,response):<br></b><span> </span><span>pass<br></span></pre>
文章的列表页爬取,并把每篇文章的url和下一页的url交给scrapy下载
def parse(self, response):<br> <b>post_nodes</b> = response.xpath('//div[@class="grid-8"]/div/div/a')<br> for post_node in post_nodes:<br> <b>post_urls</b> = post_node.xpath('@href').extract_first("")<br> <b>image_url</b> = post_node.xpath('img/@src').extract_first("")<br> #page_url = response.css("#archive .floated-thumb .post-thumb a::attr(href)").extract()<br> <b>yield</b> Request(url=parse.urljoin(response.url,post_url),<b><u>meta=("front_image_url":image_url)</u></b>,callback=self.parse_page)<br> <b>next_url</b> = response.xpath('//div[contains(@class,"navigation")]/a[@class="next page-numbers"]/@href').extract()[0]<br> if next_url:<br> yield <u>Request(url=parse.urljoin(response.url,next_url),callback=self.parse)</u>
.extract()[0] #选取有效字段第一个[1]表示第二个可以换成extract_first("")<br>yield #scrapy下载<br>Request(url= ,callback= ) #请求url的内容,并将url交给eicallback的函数进行解析或再次进行下载<br>parse.urljoin(baseurl,url) #将baseurl与rl拼接成整的url<br>
每个页面的item的爬取
通过分别定义每个item的内容进行加载
def parse_page(self, response):<br> <b>jobboleItem = JobboleScrapyItem()<br></b><br> <b>font_image_url</b><span> </span><span>= </span><u>response.meta.get("font_image_urls","")</u><br> <b>title</b> = response.xpath('//div[@class="entry-header"]/h1/text()').extract_first("")<br> <b>datetime</b> = response.xpath('//div[@class="entry-meta"]/p/text()').<u>extract_first("").strip().replace("·","").strip()</u><br> <b>prase_number</b> =response.xpath('//div[@class="post-adds"]/span[contains(@class,"vote-post-up")]/h10/text()').extract_first("")<br> <b>collections_number</b> = response.xpath('//div[@class="post-adds"]/span[contains(@class,"bookmark-btn")]/text()').extract_first("")<br> collections_re = re.match(".*?(\d+).*",collections_number)<br> if collections_re is None:<br> collections_number = 0<br> else:<br> collections_number = collections_re.group(1)<br> <b>comments_number</b> = response.xpath('//div[@class="post-adds"]/a/span[contains(@class,"hide-on-480")]/text()').extract_first("")<br> #提取两位数及以上的数字<br> comments_re = re.match(".*?(\d+).*",comments_number)<br> if comments_re is None:<br> comments_number = 0<br> else:<br> comments_number = comments_re.group(1)<br> <b>targe_list</b> =response.xpath('//div[@class="entry-meta"]/p/a/text()').extract()<br> targes = <u>[targe for targe in targe_list if not targe.strip().endswith('评论')]</u><br> targe_list = ",".join(targes)<br> <b>content</b> = response.xpath('//div[@class="entry"]').extract()<br><br><b>#实例化item</b><br> jobboleItem[<span>"title"</span>] <span>= </span>title<br><span> try:<br></span><span> </span> create_date <span>= </span>datetime.datetime.strptime(create_date, <span>"%Y/%m/%d"</span>).date()<br><span> except </span><span>Exception </span><span>as </span><span>e</span><span>:<br></span><span> </span> create_date <span>= </span>datetime.datetime.now().date()<br> jobboleItem[<span>"create_date"</span>] <span>= </span>create_date<br> jobboleItem[<span>"prase_number"</span>] <span>= </span>prase_number<br> jobboleItem[<span>"collections_number"</span>] <span>= </span>collections_number<br> jobboleItem[<span>"comments_number"</span>] <span>= </span>comments_number<br> jobboleItem[<span>"targe_list"</span>] <span>= </span>targe_list<br> jobboleItem[<span>"content"</span>] <span>= </span>content<br><br> jobboleItem[<span>"front_image_url"</span>] <span>= </span>[front_image_url]<br> jobboleItem[<span>"url"</span>] <span>= </span><span>response</span>.url<br> jobboleItem[<span>"url_object_id"</span>] <span>= </span>get_md5.get_md5(<span>response</span>.url)<br>
字段分析函数:parse(self,response)<br>.exextract_first("") #选取有效字段第一个[1]表示第二个可以换成extract()[0]。<br>.strip() #去除换行等格式符<br>.replace("。","") #替换。为空格 <br>comments_number =comments.group(1) #group(1)截取第一个字段<br>[ele for ele in targe_list if not ele.strip().endwith('评论')] #在targe_list数组中,排除有评论结尾的单元<br>
<span>通过itemloader加载item</span>
<pre><span># 通过itemloader加载item<br></span>front_image_url <span>= </span><span>response</span>.meta.get(<span>"front_image_urls"</span>, <span>""</span>)<br>item_loaders <span>= </span>JobboleLoaderItem(<span>item</span><span>=</span>JobboleScrapyItem(),<span>response</span><span>=</span><span>response</span>)<br>item_loaders.add_xpath(<span>"title"</span>, <span>'//div[@class="entry-header"]/h1/text()'</span>)<br>item_loaders.add_xpath(<span>"create_date"</span>, <span>'//div[@class="entry-header"]/h1/text()'</span>)<br>item_loaders.add_value(<span>"front_image_url"</span>, [front_image_url])<br>item_loaders.add_xpath(<span>"comments_number"</span>,<span>'//div[@class="post-adds"]/a/span[contains(@class,"hide-on-480")]/text()'</span>)<br>item_loaders.add_xpath(<span>"collections_number"</span>,<span>'//div[@class="post-adds"]/span[contains(@class,"bookmark-btn")]/text()'</span>)<br>item_loaders.add_xpath(<span>"prase_number"</span>, <span>'//div[@class="post-adds"]/span[contains(@class,"vote-post-up")]/h10/text()'</span>)<br>item_loaders.add_xpath(<span>"targe_list"</span>, <span>'//div[@class="entry-meta"]/p/a/text()'</span>)<br>item_loaders.add_xpath(<span>"content"</span>, <span>'//div[@class="entry"]'</span>)<br>item_loaders.add_value(<span>"url"</span>, <span>response</span>.url)<br>item_loaders.add_value(<span>"url_object_id"</span>, get_md5.get_md5(<span>response</span>.url))<br><br>jobboleItem <span>= </span>item_loaders.load_item()<br><br><span>yield </span>jobboleItem</pre>
实例化item模板,并将填充后yield的数据交给pipeline下载
1.在item下编写jobbole的实例化项目
通过自定义item进行item Filed
<pre><span>class </span><span>JobboleScrapyItem</span>(scrapy.Item)<span>:<br></span><span> </span><b>front_imagr_url</b> <span>= </span>scrapy.Field()<br> <b>title</b> <span>= </span>scrapy.Field()<br> <b>datetime</b> <span>= </span>scrapy.Field()<br> <b>prase_number</b> <span>= </span>scrapy.Field()<br> <b>collections_number</b> <span>= </span>scrapy.Field()<br> <b>comments_number</b> <span>= </span>scrapy.Field()<br> <b>targe_list</b> <span>= </span>scrapy.Field()<br> <b>content</b> <span>= </span>scrapy.Field()<br><br> <b>front_image_path</b> <span>= </span>scrapy.Field()<br> <b>url</b> <span>= </span>scrapy.Field()<br> <b>url_object_id</b> <span>= </span>scrapy.Field()</pre>
将所有需要填充的数据加入item中,并规范数据类类型为filed<br>scrapy的数据类型只有filed
通过itemloader加载item
<pre><span>import </span>datetime<br><span>import </span>re<br><br><span>import </span>scrapy<br><span>from </span>scrapy.loader <span>import </span>ItemLoader<br><span>from </span>scrapy.loader.processors <span>import </span>MapCompose,TakeFirst,Join<br><br><span>class </span><span>ScrapyProjectItem</span>(scrapy.Item)<span>:<br></span><span> </span><span># define the fields for your item here like:<br></span><span> # name = scrapy.Field()<br></span><span> </span><span>pass<br></span><span>#定义自己的itemloader,将list转化为str,并通过TakeFirst取到第一个<br></span><br><span>def </span><span>date_convert</span>(<span>value</span>)<span>:<br></span><span> try:<br></span><span> </span>create_date <span>= </span>datetime.datetime.strptime(<span>value</span>,<span>"%Y/%m/%d"</span>).date()<br> <span>except </span><span>Exception </span><span>as </span><span>e</span><span>:<br></span><span> </span>create_date <span>= </span>datetime.datetime.now().date()<br> <span>return </span>create_date<br><span><br></span><span><br></span><span>def </span><span>get_number</span>(<span>value</span>)<span>:<br></span><span> </span>collections_re <span>= </span>re.match(<span>".*?(\d+).*"</span>, <span>value</span>)<br> <span>if </span>collections_re <span>is None:<br></span><span> </span>value_number <span>= </span><span>"0"<br></span><span> </span><span>else:<br></span><span> </span>value_number <span>= </span>collections_re.group(<span>1</span>)<br> <span>return </span>value_number<br><span><br><br>def </span><span>delect_comment</span>(<span>value</span>)<span>:<br></span><span> if </span><span>"评论" </span><span>in </span><span>value</span><span>:<br></span><span> return </span><span>""<br></span><span> </span><span>else:<br></span><span> return </span><span>value<br></span><span><br><br>def </span><span>return_value</span>(<span>value</span>)<span>:<br></span><span> return </span><span>value<br></span><span><br>class </span><span>JobboleLoaderItem</span>(ItemLoader)<span>:<br></span><span> </span>default_output_processor <span>= </span>TakeFirst()<br><span><br>class </span><span>JobboleScrapyItem</span>(scrapy.Item)<span>:<br></span><span> </span>front_image_url <span>= </span>scrapy.Field(<br> <span>output_processor </span><span>= </span>MapCompose(return_value)<br> )<br> title <span>= </span>scrapy.Field()<br> create_date <span>= </span>scrapy.Field(<br> <span>input_processor </span><span>= </span>MapCompose(date_convert)<br> )<br> prase_number <span>= </span>scrapy.Field(<br> <span>input_processor </span><span>= </span>MapCompose(get_number)<br> )<br> collections_number <span>= </span>scrapy.Field(<br> <span>input_processor </span><span>= </span>MapCompose(get_number)<br> )<br> comments_number <span>= </span>scrapy.Field(<br> <span>input_processor </span><span>= </span>MapCompose(get_number)<br> )<br> targe_list <span>= </span>scrapy.Field(<br> <span>input_processor </span><span>= </span>MapCompose(delect_comment),<br> <span>output_processor </span><span>= </span>Join(<span>","</span>)<br> )<br> content <span>= </span>scrapy.Field()<br><br> front_image_path <span>= </span>scrapy.Field()<br> url <span>= </span>scrapy.Field()<br> url_object_id <span>= </span>scrapy.Field()</pre>
2.开启在setting中的pipeline文件<br>#ITEM_PIPELINE = {<br> 'Article.Spider.pipelines.ArticlespiderPipeline':300,<br> '<span>scrapy.pipelines.images.ImagesPipeline'</span><span>:</span><span>1</span>,<br>}<br>#Image保存功能<br>IMAGE_URLS_FIELD = 'front_image_url'<br>project_path = os.path.abspath(os.path.dirname(__path__))<br>IMAGE_STORE = os.path.join(project_path,images/jobbole_images)<br>#(由于下载图片需要TIL插件,进入虚拟环境下安装pillow)<br>#(pip install -i https://pypi.douban.com/simple pillow)<br>
1.在setting中设置你的pipeline的优先使用权<br>2.使用指定os.path功能指定image的路径<br>3.'<span>scrapy.pipelines.images.ImagesPipeline'<br></span>定制化你的图片,可以在pipeline中重新定义一个class,在class中引用images包中的ImagePipeline,使用该class的函数功能,实现控制图片文件的格式,和过滤非重要的图片<br>4.IMAGE_URLS_FIELD,IMG_STORE<br>
交给pipelines爬取数据并下载到相应文件或数据库
设置settings开启执行pipeline的顺序
将图片保存到本地文件夹
1.在settings中定义图片下载到本地文件的路径<br><pre>IMAGES_URLS_FIELD <span>= </span><span>'front_image_url'<br></span>project_path <span>= </span>os.path.abspath(os.path.dirname(__file__))<br>IMAGES_STORE <span>= </span>os.path.join(project_path,<span>'images/jobbole_images'</span>)</pre>2.<span>引用ImagePipeline将文件下载后,自定义item_completed获取文件实际下载地址<br></span><span>class </span><span>ScrapyImagePipeline</span>(ImagesPipeline)<span>:<br></span><span> </span><span>#将image_url通过for循环,凑成一个request交给yield下载<br></span><span> # get_media_requests<br></span><span><br></span><span> #获取文件实际下载地址<br></span><span> def </span><span>item_completed</span>(<span>self</span>, <span>results</span>, <span>item</span>, <span>info</span>)<span>:<br></span><span> for </span>ok, value <span>in </span><span>results</span><span>:<br></span><span> </span> value_image_path <span>= </span>value[<span>"path"</span>]<br> <span> item</span>[<span>"front_image_path"</span>] <span>= </span>value_image_path<br><span> return </span><span>item<br></span><span> pass</span><br>
将文件下载到本地生成json文件
自定义下载到本地json文件
<pre><span>import codecs<br>import json<br><br>class </span><span>ScrapyJsonPipeline</span>(<span>object</span>)<span>:<br></span><span> def </span><span>__init__</span>(<span>self</span>)<span>:<br></span><span> </span><span>self</span>.file <span>=</span>codecs.open(<span>'scrapy_jobbole.json'</span>,<span>'w'</span>,<span>encoding</span><span>=</span><span>"utf-8"</span>)<br><br> <span>def </span><span>process_item</span>(<span>self</span>, <span>item</span>, <span>spider</span>)<span>:<br></span><span> </span>lines <span>= </span>json.dumps(<span>dict</span>(<span>item</span>),<span>ensure_ascii</span><span>=False</span>) <span>+ </span><span>"</span><span>\n</span><span>"<br></span><span> </span><span>self</span>.file.write(lines)<br> <span>return </span><span>item<br></span><span><br></span><span> </span><span>def </span><span>spider_close</span>(<span>self</span>, <span>spider</span>)<span>:<br></span><span> </span><span>self</span>.file.close()</pre>
利用scrapy下载到本地json文件
<pre><pre><span>from </span>scrapy.exporters <span>import </span>JsonItemExporter<br><br><span>class </span><span>ScrapyItemExportersPipeline</span>(<span>object</span>)<span>:<br></span><span> def </span><span>__init__</span>(<span>self</span>)<span>:<br></span><span> </span><span>self</span>.file <span>= </span><span>open</span>(<span>"scrapy_exporter.json"</span>, <span>"wb"</span>)<br> <span>self</span>.exporters <span>= </span>JsonItemExporter(<span>self</span>.file, <span>encoding</span><span>=</span><span>"utf-8"</span>, <span>ensure_ascii</span><span>=False</span>)<br> <span>self</span>.exporters.start_exporting()<br><br> <span>def </span><span>close_spider</span>(<span>self</span>, <span>spider</span>)<span>:<br></span><span> </span><span>self</span>.exporters.finish_exporting()<br> <span>self</span>.file.close()<br><br> <span>def </span><span>process_item</span>(<span>self</span>, <span>item</span>, <span>spider</span>)<span>:<br></span><span> </span><span>self</span>.exporters.export_item(<span>item</span>)<br> <span>return </span><span>item<br></span></pre></pre>
数据下载到mysql数据库中
通过同步操作将数据库同步到mysql
<pre><pre><span>import </span>MySQLdb<br><span><br>class </span><span>ScrapyMysqlExporterPipline</span>(<span>object</span>)<span>:<br></span><span> def </span><span>__init__</span>(<span>self</span>)<span>:<br></span><span> </span><span>self</span>.conn <span>= </span>MySQLdb.connect(<span>"127.0.0.1"</span>,<span>"root"</span>,<span>"wuting123"</span>,<span>"scrapy_jobbole"</span>,<span>charset</span><span>=</span><span>"utf8"</span>,<span>use_unicode</span><span>=True</span>)<br> <span>self</span>.cursor <span>= </span><span>self</span>.conn.cursor()<br><br> <span>def </span><span>process_item</span>(<span>self</span>,<span>item</span>,<span>spider</span>)<span>:<br></span><span> </span>insert_sql <span>= </span><span>'''<br></span><span> insert into jobbole(title,create_date,url,url_object_id,front_image_url,front_image_path,comments_number,<br>prase_number,collections_number,content,targe_list)<br></span><span> VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)<br></span><span> '''<br></span><span> </span><span>self</span>.cursor.execute(insert_sql,(<span>item</span>[<span>"title"</span>],<span>item</span>[<span>"create_date"</span>],<span>item</span>[<span>"url"</span>],<span>item</span>[<span>"url_object_id"</span>],<br><span>item</span>[<span>"front_image_url"</span>],<span>item</span>[<span>"front_image_path"</span>],<span>item</span>[<span>"comments_number"</span>],<span>item</span>[<span>"prase_number"</span>],<br><span>item</span>[<span>"collections_number"</span>],<span>item</span>[<span>"content"</span>],<span>item</span>[<span>"targe_list"</span>]))<br> <span>self</span>.conn.commit()</pre></pre>
通过异步操作将数据同步到mysql
<pre><span>import MySQLdb<br>import MySQLdb.cursors<br><br>class </span><span>ScrapyMysqlTwistedPipeline</span>(<span>object</span>)<span>:<br></span><span> def </span><span>__init__</span>(<span>self</span>,<span>dbpool</span>)<span>:<br></span><span> </span><span>self</span>.dbpool <span>= </span><span>dbpool<br><br></span><span> </span><span>@</span><span>classmethod<br></span><span> </span><span>def </span><span>from_settings</span>(<span>cls</span>,<span>settings</span>)<span>:<br></span><span> </span>dbparms <span>= </span><span>dict</span>(<br> <span>host </span><span>= </span><span>settings</span>[<span>"MYSQL_IP"</span>],<br> <span>user </span><span>= </span><span>settings</span>[<span>"MYSQL_USER"</span>],<br> <span>password </span><span>= </span><span>settings</span>[<span>"MYSQL_PASSWD"</span>],<br> <span>db </span><span>= </span><span>settings</span>[<span>"MYSQL_DBNAME"</span>],<br> <span>charset </span><span>= </span><span>"utf8"</span>,<br> <span>cursorclass </span><span>= </span>MySQLdb.cursors.DictCursor,<br> <span>use_unicode </span><span>= True</span>,<br> )<br> dbpool <span>= </span>adbapi.ConnectionPool(<span>"MySQLdb"</span>,<span>**</span>dbparms)<br> <span>return </span><span>cls</span>(dbpool)<br><br> <span>def </span><span>process_item</span>(<span>self</span>,<span>item</span>,<span>spider</span>)<span>:<br></span><span> </span><span>#使用twisted将mysql插入变成异步执行<br></span><span> </span>query <span>= </span><span>self</span>.dbpool.runInteraction(<span>self</span>.do_insert,<span>item</span>)<br> query.addErrback(<span>self</span>.handle_error, <span>item</span>, <span>spider</span>)<span>#处理异常<br><br></span><span> </span><span>def </span><span>handle_error</span>(<span>selfself</span>,<span>failure</span>,<span>item</span>,<span>spider</span>)<span>:<br></span><span> </span><span>print</span>(<span>failure</span>)<br><br> <span>def </span><span>do_insert</span>(<span>self</span>,<span>cursor</span>,<span>item</span>)<span>:<br></span><span> </span><span>#执行具体插入操作<br></span><span> </span>insert_sql <span>= </span><span>'''<br></span><span> insert into jobbole(title,create_date,url,url_object_id,front_image_url,front_image_path,comments_number,<br>prase_number,collections_number,content,targe_list)<br></span><span> VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)<br></span><span> '''<br></span><span> </span><span>cursor</span>.execute(insert_sql, (<span>item</span>[<span>"title"</span>], <span>item</span>[<span>"create_date"</span>], <span>item</span>[<span>"url"</span>], <span>item</span>[<span>"url_object_id"</span>], <br><span>item</span>[<span>"front_image_url"</span>], <span>item</span>[<span>"front_image_path"</span>], <span>item</span>[<span>"comments_number"</span>], <span>item</span>[<span>"prase_number"</span>], <br><span>item</span>[<span>"collections_number"</span>], <span>item</span>[<span>"content"</span>],<span>item</span>[<span>"targe_list"</span>]))<br></pre>
3.使用scrapy登陆知乎爬取知乎网站
知乎登录
使用request库登录知乎
<pre><span>#!/usr/bin/env python<br></span><span># -*- coding: utf-8 -*-<br></span><span># @Time : 2017/9/20 9:59<br></span><span># @Author: jecht<br></span><span># @File : zhihu_login_request.py<br></span><br><span>import </span>time<br><span>import </span>cookiejar<br><span>import </span>requests<br><span>from </span>PIL <span>import </span>Image<br><span>import </span>re<br><span>try:<br></span><span> import </span>cookielib<br><span>except:<br></span><span> import </span>http.cookiejar <span>as </span>cookielib<br><br><br>session <span>= </span>requests.session()<br>session.cookies <span>= </span>cookielib.LWPCookieJar(<span>filename</span><span>=</span><span>"cookies.txt"</span>)<br><span>try:<br></span><span> </span>session.cookies.load(<span>ignore_discard</span><span>=True</span>)<br><span>except:<br></span><span> </span><span>print</span>(<span>"cookie未能加载"</span>)<br><br>url <span>= </span><span>"https://www.zhihu.com"<br></span><span>#agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safar<br>i/537.36 Edge/14.14393"<br></span>agent <span>= </span><span>"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari<br>/537.36"<br></span>header <span>= </span>{<br> <span>"Host"</span><span>: </span><span>"www.zhihu.com"</span>,<br> <span>"Referer"</span><span>: </span><span>"https://www.zhihu.com/"</span>,<br> <span>'User-Agent'</span><span>: </span>agent,<br>}<br><span><br></span><span>#zhihu_login_request("17512009387", "wuting123")<br></span><span>#get_captcha()<br></span><span>#get_xsrf()<br></span><span>#get_index()<br></span><span>#get_ingore()</span></pre>
获取xsrf
<span>def </span><span>get_xsrf</span>()<span>:<br></span><span> </span>response <span>= </span>session.get(url, <span>headers</span><span>=</span>header)<br><span># text = '<input type="hidden" name="_xsrf" value="9af460db3704806af07819ad14626e08"/>'<br></span><span> </span>match <span>= </span>re.match(<span>'[\s\S]*name="_xsrf" value="?(.*)"'</span>, response.text)<br> <span>if </span>match<span>:<br></span><span> return </span>match.group(<span>1</span>)<br> <span>else:<br></span><span> return </span><span>""</span>
获取captcha验证码
<span>def </span><span>get_captcha</span>()<span>:<br></span><span> </span>t <span>= </span><span>str</span>(<span>int</span>(time.time()<span>*</span><span>1000</span>))<br> captcha_url <span>= </span><span>'https://www.zhihu.com/captcha.gif?r=' </span><span>+ </span>t <span>+ </span><span>"&type=login&lang=cn"<br></span><span> </span>response <span>= </span>session.get(captcha_url,<span>headers</span><span>=</span>header)<br> <span>with </span><span>open</span>(<span>"captcha_image.gif"</span>, <span>'wb'</span>)<span>as </span>f<span>:<br></span><span> </span>f.write(response.content)<br> <span>try:<br></span><span> </span>img <span>= </span>Image.open(<span>"captcha.jpg"</span>)<br> img.show()<br> img.close()<br> <span>except:<br></span><span> pass<br></span><span> </span>points <span>= </span>[[<span>20.7735</span>, <span>22.7614</span>], [<span>45.7835</span>, <span>22.6225</span>], [<span>66.7824</span>, <span>22.6114</span>], [<span>90.7914</span>, <span>21.6136</span>], [<span>118.7937</span>, <span>23.6114</span>], [<span>143.7936</span>, <span>22.6185</span>], [<span>160.7935</span>, <span>22.6125</span>]]<br> <span>#points = ["%5B20.77%2C22.76%5D", "%5B45.78%2C22.62%5D", "%5B66.78%2C22.61%5D", "%5B90.79%2C21.61%5D", "%5B118.79%2C23.61%5D", "%5B143.79%2C22.61%5D", "%5B160.79%2C22.61%5D"]<br></span><span> </span>seq <span>= </span><span>input</span>(<span>'请输入倒立文字的位置</span><span>\n</span><span>>'</span>)<br> s <span>= </span><span>""<br></span><span> </span><span>for </span>i <span>in </span>seq<span>:<br></span><span> </span><span>#s += str(points[int(i)-1]) + "%2C"<br></span><span> </span>s <span>+= </span><span>str</span>(points[<span>int</span>(i) <span>- </span><span>1</span>]) <span>+ </span><span>", "<br></span><span> </span><span>#captcha_base = '%7B%22img_size%22%3A%5B200%2C44%5D%2C%22input_points%22%3A%5B' + s[:-3] + '%5D%7D'<br></span><span> </span>captcha_base <span>= </span><span>'{"img_size":[200,44],"input_points":[' </span><span>+ </span>s[<span>:-</span><span>2</span>] <span>+ </span><span>']}'<br></span><span> </span><span>return </span>captcha_base
表单登陆
<span>def </span><span>zhihu_login_request</span>(<span>account</span>, <span>password</span>)<span>:<br></span><span> if </span>re.match(<span>"^1\d{10}"</span>, <span>account</span>)<span>:<br></span><span> </span><span>print</span>(<span>"手机号码登录"</span>)<br> post_url <span>= </span><span>"https://www.zhihu.com/login/phone_num"<br></span><span> </span>post_data <span>= </span>{<br> <span>"_xsrf"</span><span>: </span>get_xsrf(),<br> <span>"captcha"</span><span>: </span>get_captcha(),<br> <span>"captcha_type"</span><span>: </span><span>'cn'</span>,<br> <span>"password"</span><span>: </span><span>password</span>,<br> <span>"phone_num"</span><span>: </span><span>account</span>,<br> <span>"remember_me"</span><span>: </span><span>'true'</span>,<br> }<br> <span>else:<br></span><span> if </span><span>"@" </span><span>in </span><span>account</span><span>:<br></span><span> </span><span>print</span>(<span>"邮箱登录"</span>)<br> post_url <span>= </span><span>"https://www.zhihu.com/login/email"<br></span><span> </span>post_data <span>= </span>{<br> <span>"_xsrf"</span><span>: </span>get_xsrf(),<br> <span>"captcha"</span><span>: </span>get_captcha(),<br> <span>"captcha_type"</span><span>: </span><span>'cn'</span>,<br> <span>"password"</span><span>: </span><span>password</span>,<br> <span>"email"</span><span>: </span><span>account</span>,<br> <span>"remember_me"</span><span>: </span><span>'true'</span>,<br> }<br> <span>response </span><span>= </span>session.post(post_url,<span>data</span><span>=</span>post_data, <span>headers</span><span>=</span>header)<br> session.cookies.save()<br>
加载cookie
<span>def </span><span>get_index</span>()<span>:<br></span><span> </span>response <span>= </span>session.get(url, <span>headers</span><span>=</span>header)<br> <span>with </span><span>open</span>(<span>"zhihu_index.html"</span>,<span>"wb"</span>) <span>as </span>f<span>:<br></span><span> </span>f.write(response.text.encode(<span>"utf-8"</span>))<br> <span>print</span>(<span>"已从cookie加载登录信息"</span>)
判断是否登录?
<span>def </span><span>get_ingore</span>()<span>:<br></span><span> </span>ignore_url <span>= </span><span>"https://www.zhihu.com/inbox"<br></span><span> </span>response <span>= </span>session.get(ignore_url,<span>headers </span><span>= </span>header)<br> <span>if </span>response.status_code <span>!= </span><span>200</span><span>:<br></span><span> return False<br></span><span> else:<br></span><span> return True</span>
使用scrapy的request登陆知乎
<pre><span># -*- coding: utf-8 -*-<br></span><span>import </span>json<br><span>import </span>re<br><span>import </span>scrapy<br><span>import </span>time<br><span>from </span>PIL <span>import </span>Image<br><span>import </span>scrapy_project<br><span>import </span>requests<br><br><br><span>class </span><span>ZhihuSpider</span>(scrapy.Spider)<span>:<br></span><span> </span>name <span>= </span><span>'zhihu'<br></span><span> </span>allowed_domains <span>= </span>[<span>'www.zhihu.com'</span>]<br> start_urls <span>= </span>[<span>'http://www.zhihu.com/'</span>]<br> header <span>= </span>{<br> <span>"Host"</span><span>: </span><span>"www.zhihu.com"</span>,<br> <span>"Referer"</span><span>: </span><span>"https://www.zhihu.com/"</span>,<br> <span>'User-Agent'</span><span>: </span><span>"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.<br>3112.90 Safari/537.36"</span>,<br> }<br><br> <span>def </span><span>parse</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> pass</span><br></pre>
<span>def </span><span>start_requests</span>(<span>self</span>)<span>:<br></span><span> return </span>[scrapy.Request(<span>'https://www.zhihu.com/#signin'</span>, <span>headers</span><span>=</span><span>self</span>.header, <span>callback</span><span>=</span><span>self</span>.login)]
<span>def </span><span>login</span>(<span>self</span>,<span>response</span>)<span>:<br></span><span> </span>response_text <span>= </span><span>response</span>.text<br> re_xsrf <span>= </span>re.match(<span>'[\s\S]*name="_xsrf" value="?(.*)"'</span>, response_text)<br> get_xsrf <span>= </span><span>""<br></span><span> </span><span>if </span>re_xsrf<span>:<br></span><span> </span>get_xsrf <span>= </span>re_xsrf.group(<span>1</span>)<br> <span>if </span>get_xsrf<span>:<br></span><span> </span><span>post_url </span><span>= </span><span>'https://www.zhihu.com/login/phone_num'<br></span><span> </span>post_data <span>= </span>{<br> <span>"_xsrf"</span><span>: </span>get_xsrf,<br> <span>"captcha"</span><span>: </span><span>""</span>,<br> <span>"captcha_type"</span><span>: </span><span>'cn'</span>,<br> <span>"password"</span><span>: </span><span>"wuting123"</span>,<br> <span>"phone_num"</span><span>: </span><span>"17512009387"</span>,<br> <span>"remember_me"</span><span>: </span><span>'true'<br></span><span> </span> }<br> captcha_url <span>= </span><span>'https://www.zhihu.com/captcha.gif?r=' </span><span>+ </span><span>str</span>(<span>int</span>(time.time() <span>* </span><span>1000</span>)) <span>+ </span><span>"&type=login&lang=cn"<br></span><span> </span><span> return </span>[scrapy.Request(captcha_url, <span>headers</span><span>=</span><span>self</span>.header, <span>meta</span><span>=</span>{<span>'post_data'</span><span>: </span>post_data}, <span>callback</span><span>=</span><span>self</span>.login_after)]
编写登陆函数,函数中获取登入所需的xsrf,以及空白的captcha字段<br>通过scrapy.Request()函数,回调给下一个login_after函数,并通过meta传入post_data数据<br>
<span>def </span><span>login_after</span>(<span>self</span>,<span>response</span>)<span>:<br></span><span> with </span><span>open</span>(<span>"captcha.gif"</span>, <span>'wb'</span>)<span>as </span>f<span>:<br></span><span> </span>f.write(<span>response</span>.body)<br> <span>try:<br></span><span> </span>img <span>= </span>Image.open(<span>"captcha.gif"</span>)<br> img.show()<br> img.close()<br> <span>except:<br></span><span> pass<br></span><span> </span>points <span>= </span>[[<span>20.7735</span>, <span>22.7614</span>], [<span>45.7835</span>, <span>22.6225</span>], [<span>66.7824</span>, <span>22.6114</span>], [<span>90.7914</span>, <span>21.6136</span>], [<span>118.7937</span>, <span>23.6114</span>],[<span>143.7936</span>, <span>22.6185</span>], [<span>160.7935</span>, <span>22.6125</span>]]<br> seq <span>= </span><span>input</span>(<span>'请输入倒立文字的位置</span><span>\n</span><span>>'</span>)<br> s <span>= </span><span>""<br></span><span> </span><span>for </span>i <span>in </span>seq<span>:<br></span><span> </span>s <span>+= </span><span>str</span>(points[<span>int</span>(i) <span>- </span><span>1</span>]) <span>+ </span><span>", "<br></span><span> </span>captcha <span>= </span><span>'{"img_size":[200,44],"input_points":[' </span><span>+ </span>s[<span>:-</span><span>2</span>] <span>+ </span><span>']}'<br></span><span> </span>post_data <span>= </span><span>response</span>.meta.get(<span>"post_data"</span>,{})<br> post_data[<span>'captcha'</span>] <span>= </span>captcha<br> <span>#post_data = json.dumps(post_data)<br></span><span> </span><span>if </span>post_data<span>:<br></span><span> return </span>[scrapy.FormRequest(<br> <span>url</span><span>=</span><span>'https://www.zhihu.com/login/phone_num'</span>,<br> <span>formdata</span><span>=</span>post_data,<br> <span>headers</span><span>=</span><span>self</span>.header,<br> <span>callback</span><span>=</span><span>self</span>.check_login<br> )]
通过引入Image函数,实现验证码图片的查看功能<br>通过抓包,查看到<br>
<span>def </span><span>check_login</span>(<span>self</span>,<span>response</span>)<span>:<br></span> login_view <span>= </span><span>response</span>.text<br> <span>if </span><span>"errcode" </span><span>in </span>login_view<span>:<br></span><span> </span><span> print</span>(<span>"登入失败"</span>)<br> <span>else:<br></span><span> </span><span> print </span>(<span>"登入成功"</span>)<br> <span>for </span>url <span>in </span><span>self</span>.start_urls<span>:<br></span><span> yield </span>scrapy.Request(url, <span>dont_filter</span><span>=True</span>, <span>headers</span><span>=</span><span>self</span>.header)
爬取知乎数据
1.爬取总览页
<pre><span>import </span>json<br><span>import </span>re<br><span>from </span>urllib <span>import </span>parse<br><span>import </span>scrapy<br><span>import </span>time<br><span>from </span>PIL <span>import </span>Image<br><span>import </span>scrapy_project<br><span>import </span>requests<br><span>from </span>scrapy.loader <span>import </span>ItemLoader<br><span>from </span>scrapy_project.items <span>import </span>ZhihuQuestionItem, ZhihuAnswerItem<br><br><br><span>class </span><span>ZhihuSpider</span>(scrapy.Spider)<span>:<br></span><span> </span>name <span>= </span><span>'zhihu'<br></span><span> </span>allowed_domains <span>= </span>[<span>'www.zhihu.com'</span>]<br> start_urls <span>= </span>[<span>'https://www.zhihu.com/'</span>]<br> <span>#起始请求url<br></span><span> </span>start_answer_url <span>= </span><span>'http://www.zhihu.com/api/v4/questions/{0}/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit={1}&offset={2}'<br></span><span> </span>header <span>= </span>{<br> <span>"Host"</span><span>: </span><span>"www.zhihu.com"</span>,<br> <span>"Referer"</span><span>: </span><span>"https://www.zhihu.com/"</span>,<br> <span>'User-Agent'</span><span>: </span><span>"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36"</span>,<br> }<br><br> <span>def </span><span>parse</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> </span>page_url <span>= </span><span>response</span>.css(<span>'a[data-za-detail-view-element_name="Title"]::attr(href)'</span>).extract()<br><br> <span>for </span>url <span>in </span>page_url<span>:<br></span><span> </span>url <span>= </span>parse.urljoin(<span>response</span>.url, url)<br> <span>if </span><span>"question" </span><span>in </span>url<span>:<br></span><span> </span>re_url <span>= </span>re.match(<span>'(.*)answer.*'</span>, url)<br> url <span>= </span>re_url.group(<span>1</span>)<br> question_id <span>= </span>re.match(<span>'.*/(\d+).*'</span>, url).group(<span>1</span>)<br> <span>#解析出来的url保存并传递给parse_page解析<br></span><span> </span><span>yield </span>scrapy.Request(url, <span>headers</span><span>=</span><span>self</span>.header, <span>meta</span><span>=</span>{<span>"questions_id"</span><span>: </span>question_id}, <span>callback</span><span>=</span><span>self</span>.question_parse)</pre>
1.headers可以通过setting中定义默认的,否则每次request时都需要定义headers<br>2.通过parser函数,爬取总览页,并爬取每一个详情页的url进行yield给详情页分析函数<br>3.并通过next_url进行循环给自身,一直爬取,直到爬取到最后一页<br>
2.爬取详情页
<pre><span>def </span><span>question_parse</span>(<span>self</span>,<span>response</span>)<span>:<br></span><span> if </span><span>"QuestionHeader" </span><span>in </span><span>response</span>.text<span>:<br></span><span> </span>que_question_id <span>= </span><span>response</span>.meta.get(<span>'questions_id'</span>, <span>''</span>)<br> Item_Que_Loaders <span>= </span>ItemLoader(<span>item</span><span>=</span>ZhihuQuestionItem(), <span>response</span><span>=</span><span>response</span>)<br><br> Item_Que_Loaders.add_value(<span>'que_question_id'</span>,que_question_id)<br> Item_Que_Loaders.add_css(<span>'que_topic'</span>, <span>'div.QuestionTopic span a div div::text'</span>)<br> Item_Que_Loaders.add_css(<span>'que_title'</span>, <span>'h1.QuestionHeader-title::text'</span>)<br> Item_Que_Loaders.add_css(<span>'que_content'</span>, <span>'div.QuestionHeader-detail div div span'</span>)<br> Item_Que_Loaders.add_css(<span>'que_attention_num'</span>, <span>'button.NumberBoard-item div.NumberBoard-value::text'</span>)<br> Item_Que_Loaders.add_css(<span>'que_view_num'</span>, <span>'div.NumberBoard-item div.NumberBoard-value::text'</span>)<br> Item_Que_Loaders.add_css(<span>'que_comment_num'</span>, <span>'div.QuestionHeader-Comment button::text'</span>)<br> Item_Que_Loaders.add_css(<span>'que_answer_num'</span>, <span>'h4.List-headerText span::text'</span>)<br> Item_Que_Loaders.add_value(<span>'que_url'</span>, <span>response</span>.url)<br><br> Question_Item <span>= </span>Item_Que_Loaders.load_item()<br><br> <span>yield </span>scrapy.Request(<span>self</span>.start_answer_url.format(que_question_id, <span>20</span>, <span>0</span>), <span>headers</span><span>=</span><span>self</span>.header, <span>callback</span><span>=</span><span>self</span>.answer_parse)<br> <span>yield </span>Question_Item<br><br><span>def </span><span>answer_parse</span>(<span>self</span>, <span>response</span>)<span>:</span><span><br></span><span> </span>json_data <span>= </span>json.loads(<span>response</span>.text)<br> is_end <span>= </span>json_data[<span>'paging'</span>][<span>'is_end'</span>]<br> next_url <span>= </span>json_data[<span>'paging'</span>][<span>'next'</span>]<br> <span>total_answer_num </span><span>= </span>json_data[<span>'paging'</span>][<span>'totals'</span>]<br> <span>#ans_data = json_data['data']<br><br></span><span> </span><span>for </span>ans_data <span>in </span>json_data[<span>'data'</span>]<span>:<br></span><span> </span>Item_Ans_Loaders <span>= </span>ZhihuAnswerItem()<br> Item_Ans_Loaders[<span>'ans_author_name'</span>] <span>= </span>ans_data[<span>'author'</span>][<span>'name'</span>]<br> Item_Ans_Loaders[<span>'ans_author_idname'</span>] <span>= </span>ans_data[<span>'author'</span>][<span>'url_token'</span>] <span>if </span><span>"url_token" </span><span>in </span>ans_data[<span>"author"</span>] <span>else None<br></span><span> </span>Item_Ans_Loaders[<span>'ans_data_url'</span>] <span>= </span>ans_data[<span>'url'</span>]<br> Item_Ans_Loaders[<span>'ans_question_id'</span>] <span>= </span>ans_data[<span>'question'</span>][<span>'id'</span>]<br> Item_Ans_Loaders[<span>'ans_voters_num'</span>] <span>= </span>ans_data[<span>'voteup_count'</span>]<br> Item_Ans_Loaders[<span>'ans_comment_num'</span>] <span>= </span>ans_data[<span>'comment_count'</span>]<br> Item_Ans_Loaders[<span>'ans_content'</span>] <span>= </span>ans_data[<span>'content'</span>]<br> Item_Ans_Loaders[<span>'ans_create_time'</span>] <span>= </span>ans_data[<span>'created_time'</span>]<br> Item_Ans_Loaders[<span>'ans_update_time'</span>] <span>= </span>ans_data[<span>'updated_time'</span>]<br><br> <span>yield </span>Item_Ans_Loaders<br> <span>if not </span>is_end<span>:<br></span><span> yield </span>scrapy.Request(next_url, <span>headers</span><span>=</span><span>self</span>.header, <span>callback</span><span>=</span><span>self</span>.answer_parse)</pre>
1.详情页包括:问题详情页、回答详情页<br>2.详情页的字节爬取依靠scrapy.loader中的ItemLoader()函数进行loader,此时loader中的value属于list<br>3.通过yield将不同的question_id的answer页面传递到下一个answer_parse函数中,传递的answer页面是一个有answer内容的json页面<br>4.由于answer内容是json格式,可以通过json.loads()格式化json页面到json_data,通过json_data['paging']['is_end']获取数据<br>
3.编写item
子主题
4.编写pipeline
4.使用scrapy的crawlspider爬取拉钩网
使用crawl模板,不是用默认的basic模板
1.进入项目所在文件夹<br>cd d:\python_project\scrapy_project\scrapy_project<br><br>2.进入虚拟环境<br>workon scrapy_virtualenv<br><br>2.查看spider模板列表<br>scprapy genspider --list<br><br>3.使用crawl模板<br>scrapy genspider -t crawl lagou www.lagou.com<br>
CrawlSpider源码分析
<pre><span>class </span><span>CrawlSpider</span>(Spider)<span>:<br></span><span><br></span><span> </span>rules <span>= </span>()<br><br> <span>def </span><span>__init__</span>(<span>self</span>, <span>*a</span>, <span>**kw</span>)<span>:<br></span><span> </span><span>super</span>(CrawlSpider, <span>self</span>).<span>__init__</span>(<span>*</span><span>a</span>, <span>**</span><span>kw</span>)<br> <span>self</span>._compile_rules()<br><br> <span>def </span><span>parse</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> return </span><span>self</span>._parse_response(<span>response</span>, <span>self</span>.parse_start_url, <span>cb_kwargs</span><span>=</span>{}, <span>follow</span><span>=True</span>)<br><br> <span>def </span><span>parse_start_url</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> return </span>[]<br><br> <span>def </span><span>process_results</span>(<span>self</span>, <span>response</span>, <span>results</span>)<span>:<br></span><span> return </span><span>results<br></span><span><br></span><span> </span><span>def </span><span>_build_request</span>(<span>self</span>, <span>rule</span>, <span>link</span>)<span>:<br></span><span> </span>r <span>= </span>Request(<span>url</span><span>=</span><span>link</span>.url, <span>callback</span><span>=</span><span>self</span>._response_downloaded)<br> r.meta.update(<span>rule</span><span>=</span><span>rule</span>, <span>link_text</span><span>=</span><span>link</span>.text)<br> <span>return </span>r<br><br> <span>def </span><span>_requests_to_follow</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> if not </span><span>isinstance</span>(<span>response</span>, HtmlResponse)<span>:<br></span><span> return<br></span><span> </span>seen <span>= </span><span>set</span>()<br> <span>for </span>n, rule <span>in </span><span>enumerate</span>(<span>self</span>._rules)<span>:<br></span><span> </span>links <span>= </span>[lnk <span>for </span>lnk <span>in </span>rule.link_extractor.extract_links(<span>response</span>)<br> <span>if </span>lnk <span>not in </span>seen]<br> <span>if </span>links <span>and </span>rule.process_links<span>:<br></span><span> </span>links <span>= </span>rule.process_links(links)<br> <span>for </span>link <span>in </span>links<span>:<br></span><span> </span>seen.add(link)<br> r <span>= </span><span>self</span>._build_request(n, link)<br> <span>yield </span>rule.process_request(r)<br><br> <span>def </span><span>_response_downloaded</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> </span>rule <span>= </span><span>self</span>._rules[<span>response</span>.meta[<span>'rule'</span>]]<br> <span>return </span><span>self</span>._parse_response(<span>response</span>, rule.callback, rule.cb_kwargs, rule.follow)<br><br> <span>def </span><span>_parse_response</span>(<span>self</span>, <span>response</span>, <span>callback</span>, <span>cb_kwargs</span>, <span>follow</span><span>=True</span>)<span>:<br></span><span> if </span><span>callback</span><span>:<br></span><span> </span>cb_res <span>= </span><span>callback</span>(<span>response</span>, <span>**</span><span>cb_kwargs</span>) <span>or </span>()<br> cb_res <span>= </span><span>self</span>.process_results(<span>response</span>, cb_res)<br> <span>for </span>requests_or_item <span>in </span>iterate_spider_output(cb_res)<span>:<br></span><span> yield </span>requests_or_item<br><br> <span>if </span><span>follow </span><span>and </span><span>self</span>._follow_links<span>:<br></span><span> for </span>request_or_item <span>in </span><span>self</span>._requests_to_follow(<span>response</span>)<span>:<br></span><span> yield </span>request_or_item<br><br> <span>def </span><span>_compile_rules</span>(<span>self</span>)<span>:<br></span><span> def </span><span>get_method</span>(<span>method</span>)<span>:<br></span><span> if </span><span>callable</span>(<span>method</span>)<span>:<br></span><span> return </span><span>method<br></span><span> </span><span>elif </span><span>isinstance</span>(<span>method</span>, six.string_types)<span>:<br></span><span> return </span><span>getattr</span>(self, <span>method</span>, <span>None</span>)<br><br> <span>self</span>._rules <span>= </span>[copy.copy(r) <span>for </span>r <span>in </span><span>self</span>.rules]<br> <span>for </span>rule <span>in </span><span>self</span>._rules<span>:<br></span><span> </span>rule.callback <span>= </span>get_method(rule.callback)<br> rule.process_links <span>= </span>get_method(rule.process_links)<br> rule.process_request <span>= </span>get_method(rule.process_request)<br><br> <span>@</span><span>classmethod<br></span><span> </span><span>def </span><span>from_crawler</span>(<span>cls</span>, <span>crawler</span>, <span>*args</span>, <span>**kwargs</span>)<span>:<br></span><span> </span>spider <span>= </span><span>super</span>(CrawlSpider, <span>cls</span>).from_crawler(<span>crawler</span>, <span>*</span><span>args</span>, <span>**</span><span>kwargs</span>)<br> spider._follow_links <span>= </span><span>crawler</span>.settings.getbool(<br> <span>'CRAWLSPIDER_FOLLOW_LINKS'</span>, <span>True</span>)<br> <span>return </span>spider<br><br> <span>def </span><span>set_crawler</span>(<span>self</span>, <span>crawler</span>)<span>:<br></span><span> </span><span>super</span>(CrawlSpider, <span>self</span>).set_crawler(<span>crawler</span>)<br> <span>self</span>._follow_links <span>= </span><span>crawler</span>.settings.getbool(<span>'CRAWLSPIDER_FOLLOW_LINKS'</span>, <span>True</span>)</pre>
1.CrawlSpider类继承了__init__中的Spider类,Spider类的入口函数为start_requests()
<pre><span>def </span><span>start_requests</span>(<span>self</span>)<span>:<br></span><span> </span>cls <span>= </span><span>self</span>.__class__<br> <span>if </span>method_is_overridden(cls, Spider, <span>'make_requests_from_url'</span>)<span>:<br></span><span> </span>warnings.warn(<br> <span>"Spider.make_requests_from_url method is deprecated; it "<br></span><span> "won't be called in future Scrapy releases. Please "<br></span><span> "override Spider.start_requests method instead (see %s.%s)." </span><span>% </span>(<br> cls.<span>__module__</span>, cls.<span>__name__<br></span><span> </span>),<br> )<br> <span>for </span>url <span>in </span><span>self</span>.start_urls<span>:<br></span><span> yield </span><span>self</span>.make_requests_from_url(url)<br> <span>else:<br></span><span> for </span>url <span>in </span><span>self</span>.start_urls<span>:<br></span><span> yield </span>Request(url, <span>dont_filter</span><span>=True</span>)</pre>
2.start_requests函数默认的返回函数为parse函数,parse函数调用了_parse_response函数
<span>def </span><span>parse</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> return </span><span>self</span><b>._parse_response</b>(<span>response</span>, <span>self</span><b>.parse_start_url</b>, <span>cb_kwargs</span><span>=</span>{}, <span>follow</span><span>=True</span>)<br>
<pre><span>def </span><span>_parse_response</span>(<span>self</span>, <span>response</span>, <span>callback</span>, <span>cb_kwargs</span>, <span>follow</span><span>=True</span>)<span>:<br></span><span> if </span><span>callback</span><span>:<br></span><span> </span>cb_res <span>= </span><span>callback</span>(<span>response</span>, <span>**</span><span>cb_kwargs</span>) <span>or </span>()<br> cb_res <span>= </span><span>self</span><b>.process_results</b>(<span>response</span>, cb_res)<br> <span>for </span>requests_or_item <span>in </span>iterate_spider_output(cb_res)<span>:<br></span><span> yield </span>requests_or_item<br> <span>if </span><span>follow </span><span>and </span><span>self</span><b>._follow_links</b><span>:<br></span><span> for </span>request_or_item <span>in </span><span>self</span><b>._requests_to_follow</b>(<b>response</b>)<span>:<br></span><span> yield </span>request_or_item<b></b><i></i><u></u><sub></sub><sup></sup><strike></strike><br></pre>
if callback为if parse_start_url函数存在,则将cb_kwargs参数传入callback函数中得到一个数组<br>数组又传入process_result函数中,由于初定义的两个函数为空,cb_res为空数组,后面可以自定义<br>for循环将判断过可迭代的参数cb_res进行迭代,然后进行yield下载,如果两个自定义的函数没重写,则_parse_response基本属于没有调用<br>
if _parse_reponse中默认的follw=True和settings设置中('CRAWLSPIDER_FLLOW_LINKS'=True)<br>没有修改,则执行下面的for循环,通过linkextractor的函数遍历links,否则无法调用rule规则
3._parse_response函数允许scrapy中自定义调用parse_start_url和process_results对response进行处理<br>4. 并通过_requests_to_follow调用scrapy中的rule,将response结果交给了rule中的LinkExtractor的extract_links方法用于将link全部抽取出来,并对每一个link进行一次类似yield scrapy.request(r)的功能<br>5.在yield之前在r内通过_build_request再到_response_downloaded加入了处理<br>
<pre><span>def </span><span>parse_start_url</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> return </span>[]</pre>
<pre><span>def </span><span>process_results</span>(<span>self</span>, <span>response</span>, <span>results</span>)<span>:<br></span><span> return </span><span>results</span></pre>
<pre>rules <span>= </span>(<br> Rule(<b>LinkExtractor</b>(<b>allow=r'Items/'</b>), <span>callback</span><span>=</span><span>'parse_item'</span>, <span>follow</span><span>=True</span>),<br>)</pre>
<pre><span>def </span><span>_requests_to_follow</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> if not </span><span>isinstance</span>(<span>response</span>, HtmlResponse)<span>:<br></span><span> return<br></span><span> </span>seen <span>= </span><span>set</span>()<br> <span>for </span>n, rule <span>in </span><span>enumerate</span>(<span>self</span>._rules)<span>:<br></span><span> </span>links <span>= </span><b>[lnk for lnk in rule.link_extractor.extract_links(response)</b><br> <span>if </span>lnk <span>not in </span>seen]<br> <span>if </span>links <span>and </span>rule.process_links<span>:<br></span><span> </span>links <span>= </span>rule.process_links(links)<br> <span>for </span><b>link</b> <span>in </span>links<span>:<br></span><span> </span>seen.add(link)<br> r <span>= </span><span>self</span><b>._build_request</b>(n, link)<br> <b>yield rule.process_request(r)</b></pre>
1.首先判断response是否是一个response对象<br>2.seen是一个set(),后面通过for循环将response中的url提取并去重<br>3._rules通过_compile_rules函数将每条rule进行添加callback和一些预处理方法<br>4.将rules通过enumerate函数变成一个可迭代的rules,得到一个n或者rule<br>5.links是等于拿着link_extractor在response中使用extracr_links方法进行抽取link(或者说url),并确定link不在seen中<br>6.如果对使用extract_links方法匹配link_extractor的结果link不满意的话,可以再使用process_links函数再进行过滤一遍<br>7.将最终确定数据全部加入到seen中<br>8.通过process_request和build_requestdownload下来<br>
6.LinkExtractor根据传进来的参数,例如(allow=r'Item/')或deny、allow_domains等其他参数<br>7._build_request到_response_downloaded的处理是将response中的rule提取出来<br>
<pre><span>def </span><span>__init__</span>(<span>self</span>, <b>allow</b><span>=</span>(), <b>deny</b><span>=</span>(), <b>allow_domains</b><span>=</span>(), <b>deny_domains</b><span>=</span>(), <b>restrict_xpaths</b><span>=</span>(),<br> <b>tags</b><span>=</span>(<span>'a'</span>, <span>'area'</span>), <b>attrs</b><span>=</span>(<span>'href'</span>,), <b>canonicalize</b><span>=False</span>,<br> <b> unique</b><span>=True</span>, <b>process_value</b><span>=None</span>, <b>deny_extensions</b><span>=None</span>, <b>restrict_css</b><span>=</span>(),<br> <b>strip</b><span>=True</span>)</pre>
<pre><span>def </span><span>_compile_rules</span>(<span>self</span>)<span>:<br></span><span> def </span><span>get_method</span>(<span>method</span>)<span>:<br></span><span> if </span><span>callable</span>(<span>method</span>)<span>:<br></span><span> return </span><span>method<br></span><span> </span><span>elif </span><span>isinstance</span>(<span>method</span>, six.string_types)<span>:<br></span><span> return </span><span>getattr</span>(self, <span>method</span>, <span>None</span>)<br><br> <span>self</span>._rules <span>= </span>[copy.copy(r) <span>for </span>r <span>in </span><span>self</span>.rules]<br> <b> for rule in self._rules:<br></b><span> </span>rule.callback <span>= </span>get_method(rule.<b>callback</b>)<br> rule.process_links <span>= </span>get_method(rule.<b>process_links</b>)<br> rule.process_request <span>= </span>get_method(rule.<b>process_request</b>)</pre>
<pre><span>def </span><span>_build_request</span>(<span>self</span>, <span>rule</span>, <span>link</span>)<span>:<br></span><span> </span>r <span>= </span>Request(<span>url</span><span>=</span><span>link</span>.url, <span>callback</span><span>=</span><span>self</span><b>._response_downloaded</b>)<br> r.meta.update(<span>rule</span><span>=</span><span>rule</span>, <span>link_text</span><span>=</span><span>link</span>.text)<br> <span>return </span>r</pre>
<pre><span>def </span><span>_response_downloaded</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> </span>rule <span>= </span><span>self</span>._rules[<b>response.meta['rule'</b>]]<br> <span>return </span><span>self</span><b>._parse_response</b>(<span>response</span>, rule.callback, rule.cb_kwargs, rule.follow)</pre>
最后又返回给_pare_response函数了
Rule和LinkExtractor的参数用法
<pre>rules <span>= </span>(<br> Rule(<b>LinkExtractor</b>(<b>allow=r'Items/'</b>), <span>callback</span><span>=</span><span>'parse_item'</span>, <span>follow</span><span>=True</span>),<br>)</pre>
Rule的参数
<pre><pre><span>class </span><span>Rule</span>(<span>object</span>)<span>:</span></pre><span><br>def </span><span>__init__</span>(<span>self</span>, <span>link_extractor</span>, <span>callback</span><span>=None</span>, <span>cb_kwargs</span><span>=None</span>, <span>follow</span><span>=None</span>, <span>process_links</span><span>=None</span>, <span>process_request</span><span>=</span>identity)<span>:<br></span><span> </span><span>self</span>.link_extractor <span>= </span><span>link_extractor<br></span><span> </span><span>self</span>.callback <span>= </span><span>callback<br></span><span> </span><span>self</span>.cb_kwargs <span>= </span><span>cb_kwargs </span><span>or </span>{}<br> <span>self</span>.process_links <span>= </span><span>process_links<br></span><span> </span><span>self</span>.process_request <span>= </span><span>process_request<br></span><span> </span><span>if </span><span>follow </span><span>is None:<br></span><span> </span><span>self</span>.follow <span>= False if </span><span>callback </span><span>else True<br></span><span> else:<br></span><span> </span><span>self</span>.follow <span>= </span><span>follow</span></pre>
1.link_extractor:基本交给extract_link使用<br>2.callback:回调函数,在logo.py中定义的<br>3.cb_kwargs:传递给link_extractor的参数<br>4.follow:判断是否跟踪满足本条rule的url<br>5.process_links:对links传入预处理函数<br>6.process_request=identity:identity是一个可以自定义的空函数,类似于process_links<br><br>
LinkExtractor
LinkExtractor的参数
<pre><span>class </span><span>LxmlLinkExtractor</span>(FilteringLinkExtractor)<span>:<br></span><span><br></span><span> def </span><span>__init__</span>(<span>self</span>, <span>allow</span><span>=</span>(), <span>deny</span><span>=</span>(), <span>allow_domains</span><span>=</span>(), <span>deny_domains</span><span>=</span>(), <span>restrict_xpaths</span><span>=</span>(),<br> <span>tags</span><span>=</span>(<span>'a'</span>, <span>'area'</span>), <span>attrs</span><span>=</span>(<span>'href'</span>,), <span>canonicalize</span><span>=False</span>,<br> <span>unique</span><span>=True</span>, <span>process_value</span><span>=None</span>, <span>deny_extensions</span><span>=None</span>, <span>restrict_css</span><span>=</span>(),<br> <span>strip</span><span>=True</span>)<span>:</span></pre>
1.allow:符合正则表达式中的字段的url进行提取<br>2.deny:符合正则表达式中的字段的url进行舍弃<br>3.allow_domains:符合lagou.py中allowed_domains的字段的url进行提取<br>4.deny_domains:符合lagou.py中allowed_domains的字段的url进行舍弃<br>5.restrict_xpaths:限定xpath字符处理时的路径范围<br>6.<span>tags</span><span>=</span>(<span>'a'</span>, <span>'area'</span>):默认值;默认通过a标签和area标签中寻找url<br>7.<span>attrs</span><span>=</span>(<span>'href'</span>,):默认值;默认提取href值<br><b>8.restrict_css=():限定css字符处理时的路径范围</b><br>
xpath处理xml,css处理html,当使用css时,会从继承的FilteringExtractor类中条用HTMLTranslator()去将css转化成xpath
LinkExtractor的extract_links函数
<pre><span>def </span><span>extract_links</span>(<span>self</span>, <span>response</span>)<span>:<br></span><span> </span>base_url <span>= </span><b>get_base_url</b>(<span>response</span>)<br> <span>if </span><span>self</span>.<b>restrict_xpaths</b><span>:<br></span><span> </span>docs <span>= </span>[subdoc<br> <span>for </span>x <span>in </span><span>self</span>.restrict_xpaths<br> <span>for </span>subdoc <span>in </span><span>response</span>.xpath(x)]<br> <span>else:<br></span><span> </span>docs <span>= </span>[<span>response</span>.selector]<br> all_links <span>= </span>[]<br> <span>for </span>doc <span>in </span>docs<span>:<br></span><span> </span>links <span>= </span><span>self</span>._extract_links(doc, <span>response</span>.url, <span>response</span>.encoding, base_url)<br> all_links.extend(<span>self</span>._process_links(links))<br> <span>return </span>unique_list(all_links)</pre>
1.根据response的url调用get_base_url函数去获取url<br>2.如果有设置restrict_xpath,则进行遍历路径进行xpath处理<br>3.得到的符合路径的字段形成list<br>4.将list进行遍历,分别执行_extract_links()匹配之前link_extractor的字段,形成一个links<br>
编写rule和linkextractor进行爬取拉钩全站
编写items
class LagouItemLoader(ItemLoader):<br> default_output_processor = TakeFirst()<br><br>class LagouItem(scrapy.Item):<br> title = scrapy.Field()<br> url = scrapy.Field()<br> url_object = scrapy.Field()<br> degree_need = scrapy.Field()<br> crawl_time = scrapy.Field()<br> publish_time = scrapy.Field()<br> targs = scrapy.Field()<br> company_name = scrapy.Field()<br> company_url = scrapy.Field()<br> job_city = scrapy.Field()<br> job_type = scrapy.Field()<br> job_advantage = scrapy.Field()<br> job_desc = scrapy.Field()<br> job_addr = scrapy.Field()<br> salsry_max = scrapy.Field()<br> salsry_min = scrapy.Field()<br> work_years_max = scrapy.Field()<br> work_years_min = scrapy.Field()
在lagou.py中解析字段
子主题
5.爬虫与反爬虫策略
1.随机更换User-Agent
因为request变成response时都需要经过downloader middleware,通过设置middleware,<br>在request转化成response时,将UserAgent<br>(可以借鉴源码site-package/scrapy/downloadermiddlewares/useragent.py)<br><br>#激活middleware<br>在settings中:<br>DOWNLOADER_MIDDLEWARES <span>= </span>{<br> <span>'scrapy_project.middlewares.MyCustomDownloaderMiddleware'</span><span>: </span><span>543</span>,<br> <span>'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'</span><span>: None</span>,<br>}<br>
1.通过自建user_agent_list,使用random.randint获取随机UserAgent<br>
在settings中需要设置一个user_agent_list的列表<br><br>在middleware.py中:<br><br>import Random<br>class RandomUserAgentMiddlware(object):<br> def __init__(self,crawler):<br> super(RandomUserAgentMiddlware, self).__init__()<br> <b>self.user_agent_list = crawler.settings.get("user_agent_list",)</b><br><br> @classmethod<br> def from_crawler(cls,crawler):<br> return cls(crawler)<br><br> def process_request(self,request,spider):<br> <b>random_num = random.randint(0, len(self.user_agent_list) - 1)<br> random_agent = self.user_agent_list[random_num]<br> request.headers.setdeafult('User-Agent',random_agent)</b>
(推荐!)1.通过github上的fake-useragent进行随机获取useragent
pip install fake-useragent<br><br><b>from fake_useragent import UserAgent<br></b><br>class RandomUserAgentMiddlware(object):<br> def __init__(self,crawler):<br> super(RandomUserAgentMiddlware, self).__init__()<br> <b>self.ua=UserAgent()<br> self.user_type = crawler.settings.get('RANDOM_UA_TYPE',"Random")</b><br><br> @classmethod<br> def from_crawler(cls,crawler):<br> return cls(crawler)<br><br> def process_request(self,request,spider):<br> <b>def get_ua():<br> return getattr(self.ua,self.ua_type)</b><br> request.headers.setdefault('User-Agent',<b>get_ua()</b>)<br>
1.super(,self).__init__():使子类调用父类的属性并进行初始化函数
2.使用代理ip
使用ip代理
使用西刺免费的高匿ip代理将主机ip隐藏<br>填入代理ip与端口<br>request.meta["proxy"] = "https://183.71.136.98:8118"<br>
使用ip代理池
自己写一个爬虫,爬取西刺代理网站的高匿免费ip代理crawl_xici_ip.py
import datetime<br>import requests<br>from scrapy.selector import Selector<br>import MySQLdb<br><br>conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="wuting123", db="scrapy_lagou", charset="utf8")<br>cursor = conn.cursor()<br><br>def crawl_ips():<br> headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393"}<br> for i in range(1,5):<br> re = requests.get("http://www.xicidaili.com/nn/{0}".format(i),headers=headers)<br> selector = Selector(text=re.text)<br> all_trs = selector.css("#ip_list tr")<br> ip_list = []<br> for tr in all_trs[1:]:<br> speed_str = tr.css(".bar::attr(title)").extract()[0]<br> if speed_str:<br> speed = float(speed_str.split("秒")[0])<br> text = tr.css("td::text").extract()<br> all_text = list()<br> for i in text:<br> if '\n' not in i:<br> all_text.append(i)<br> ip = all_text[0]<br> port = all_text[1]<br> proxy_type = all_text[3]<br> validate = "20"+ all_text[5]<br> validate_time = datetime.datetime.strptime(validate,"%Y-%m-%d %H:%M")<br> ip_list.append((ip,port,proxy_type,speed,validate_time))<br> for ip_add in ip_list:<br> cursor.execute(<br> "insert proxy_ip(ip, port, proxy_type, speed, validate_time) VALUES('{0}','{1}','{2}','{3}','{4}')".format(<br> ip_add[0],ip_add[1],ip_add[2],ip_add[3],ip_add[4]<br> )<br> )<br> conn.commit()<br><br>#从数据库中获取ip<br>class GetIP(object):<br> #删除测试后无法使用的ip<br> def delete_ip(self,ip):<br> delect_sql = """<br> DELETE FROM proxy_ip WHERE ip = '{0}' <br> """.format(ip)<br> cursor.execute(delect_sql)<br> conn.commit()<br><br> #判断ip是否可用<br> def judge_ip(self,ip,port,proxy_type):<br> http_url = "http://www.baidu.com"<br> proxy_url = "{0}://{1}:{2}".format(proxy_type,ip,port)<br> try:<br> proxy_dirt = {<br> "http":proxy_url,<br> "https":proxy_url<br> }<br> response = requests.get(http_url,proxies=proxy_dirt)<br> except Exception as e:<br> print("invalid ip and port")<br> print(proxy_url)<br> self.delete_ip(ip)<br> return False<br> else:<br> code =response.status_code<br> if code >= 200 and code < 300:<br> print (proxy_url)<br> print("effective ip")<br> return True<br> else:<br> print("invalid ip and port")<br> self.delete_ip(ip)<br> return False<br><br> # 使用SQL语句从数据库中随机获取数据<br> def get_random_ip(self):<br> random_sql = '''SELECT ip,port,proxy_type FROM proxy_ip <br> ORDER BY RAND()<br> LIMIT 1<br> '''<br> result = cursor.execute(random_sql)<br> for ip_add in cursor.fetchall():<br> ip = ip_add[0]<br> port = ip_add[1]<br> proxy_type = ip_add[2]<br> judge_ip = self.judge_ip(ip,port,proxy_type)<br> if judge_ip:<br> return "{0}://{1}:{2}".format(proxy_type,ip,port)<br> else:<br> return self.get_random_ip()<br><br>if __name__ == "__main__":<br> get_ip = GetIP()<br> get_ip.get_random_ip()
在settings和middleware中,调用西刺代理ip
DOWNLOADER_MIDDLEWARES = {<br> 'scrapy_project.middlewares.RandomUserAgentMiddlware': 10,<br> 'scrapy_project.middlewares.RandomProxyMiddleware': 11,<br> 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,<br>}
middleware<br><br>class RandomProxyMiddleware(object):<br> def process_request(self,request,spider):<br> get_ip = GetIP()<br> request.meta["proxy"] = get_ip.get_random_ip()
使用github上的开源库scrapy-proxies
功能比自己的写的更强大更齐全
使用github上官方收费的开源库scrapy-crawlera
使用洋葱网络
将ip经过多次的转发后达到隐藏ip的功能(但是需要用到vpn)
3.验证码识别
1.google开源工具tesseract-ocr
缺点:干扰比较大
2.在线打码
云打码
3.人工打码
超速打码
4.限速
http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/autothrottle.html
1.在settings中设置最低的下载延迟:
DOWNLOAD_DELAY
2.在settings中设置更高的并发数:
CONCURRENT_REQUESTS_PER_DOMAIN 或( CONCURRENT_REQUESTS_PER_IP )
3.在settings中设置自动限速
AUTOTHROTTLE_ENABLED 启用autothrottle模式<br>AUTOTHROTTLE_START_DELAY 设置自动限速的初始延迟(单位:秒)<br>AUTOTHROTTLE_MAX_DELAY 设置自动限速的最大延迟(单位:秒)<br>AUTOTHROTTLE_DEBUG 启用自动限速的调试模式<br>
5.在不同情况下设置不同的settings
1.默认情况下不需要登陆状态,是不需要开启cookie值的,所以需要在settings中设置COOKIES_ENABLED = False
2.如知乎,需要的另外开启cookie状态的,则只需要在spider中的zhihu.py直接设置custom_settings={<br> COOKIES_ENABLED = True<br> }
6.selenium动态网问网站
1.了解selinium
1.安装selenium
在虚拟环境下<br>pip install selenium<br>
百度selenium python api,找到官方文档,查找不同浏览器的drivers<br>(firefox的试了很多次,需要对应版本都没成功)<br>(使用了google的driver,地址:http://npm.taobao.org/mirrors/chromedriver/)<br>(环境:python3.6.1,selinum 3.11,google浏览器65.0,chromedriver_win32.zip:2.37)<br>
2.使用selinum的webdriver模块,自动加载网页
from selenium import webdriver<br><br>browser = webdriver.Chrome(executable_path="D:\python_project\selenium_drviers\chromedriver.exe")<br>browser.get("https://item.taobao.com/item.htm?spm=a230r.1.14.71.31207d9buGsqEv&id=561063544221&ns=1&abbucket=8#detail")<br><br>print(browser.page_source)
3.selenium的字段分析
1.使用scrapy自带的xml的字段分析
from scrapy.selector import Selector<br>t_selector = Selector(text=bowser.page_source)<br>t_selector.xpath( )<br>t_selector.css( )<br>
2.使用selinum的字段分析
browser.find_ele
2.使用selinum的webdriver.chrom()加载页面
1.用selnium模拟登陆知乎
from selenium import webdriver<br>from scrapy.selector import Selector<br>browser = webdriver.Chrome(executable_path="D:\python_project\selenium_drviers\chromedriver.exe")<br>browser.get("https://www.zhihu.com/signin")<br>browser.find_element_by_css_selector(".Login-content input[name = 'username']").send_keys("17512009387")<br>browser.find_element_by_css_selector(".Login-content input[name = 'password']").send_keys("wuting123")<br>browser.find_element_by_css_selector("button.SignFlow-submitButton").click()
2.使用selinium爬取微博
1.使用微博开放平台:http://open.weibo.com/wiki/%E9%A6%96%E9%A1%B5
2.遇到的问题:无法定位元素
1.因为存在iframe框架,需要跳转网页,所以可能无法立马定位元素<br>可以通过time.sleep(5),来通过5秒延时加载数据<br>
3.登陆微博<br>(碰到验证码可以手动输入)<br>
def weibo():<br> url = "https://weibo.com/"<br> browser.get("https://weibo.com/")<br> time.sleep(5)<br> browser.find_element_by_css_selector("div.WB_miniblog div[node-type='username_box'] input[name = 'username']").send_keys("15870635250")<br> browser.find_element_by_css_selector("div.WB_miniblog div[node-type='password_box'] input[name = 'password']").send_keys("tumeihong")<br> browser.find_element_by_css_selector("div.WB_miniblog div[node-type='normal_form'] div.info_list.login_btn a[node-type = 'submitBtn']").click()<br>
4.模拟下拉刷新(适用于javascript)<br>使用javascripts代码<br>
for i in range(3):<br> browser.execute_script("window.scrollTo(0, document.body.scrollHeight); var lenOfPage=document.body.scrollHeight; return lenOfPage")<br> time.sleep(3)
window.scrollTo():把窗口调整到某个位置
document.body.scrollHeight:表示body标签最大可以滚动到的坐标
3.使用selinum设置chromdriver不加载图片
#不加载图片<br> chrom_opt = webdriver.ChromeOptions()<br> prefs = {"profile.managed_default_content_settings.images":2}<br> chrom_opt.add_experimental_option("prefs",prefs)<br> browser = webdriver.Chrome(executable_path = "D:\python_project\selenium_drviers\chromedriver.exe",chrome_options=chrom_opt)<br> browser.get("http://www.taobao.com")
chrom_opt:调用chromOptions的方法<br>prefs: 修改chromdriver的图片设置为2,表示不显示图片<br> 导入外置的设置到chromOptions类中:chrom_opt.add_experimental_option()<br> 把chrom_options=设置导入browser实例里<br>
注意:webdriver.chrom()不能放在函数内,会闪退。可通过try......exception设置多个browser
3.使用无界面浏览器:phantomjs<br>(多进程情况下phantomjs性能会下降很严重)<br>
1.安装phantomjs
1.下载地址:http://phantomjs.org/,<br>解压后放入D:/python_scrapy/phantomjs
2.配置phantomjs
browser = webdriver.PhantomJS(executable_path = "D:\python_project\selenium_drviers\chromedriver.exe")<br>browser.get("http://www.baidu.com")<br><br>print (browser.page_sourse)<br>browser.quit()<br>
不显示图面,自动化运行
4.将selinum集成到scrapy中
子主题
通过建立middleware中间件
0 条评论
下一页