irpas技术客

Python 开发-网络爬虫与信息提取(Requests,Beautiful Soup4,Scrapy)__abcdef

大大的周 3291

文章目录 Requests常用的 Response 属性`r.encoding `与 `r.apparent_encodeing` 区别Requests库异常requests 通用的一个代码框架requests 7个主要方法与13个访问控制参数requests.request() 方法requests.get()robots 协议实例爬取京东商品百度360 搜索关键词提交爬取一个站点的所有图片API接口解析 Beautiful Soup4简单使用bs4库四种解析器bs4 库 基本元素基于bs4 遍历向下遍历向上遍历平行遍历 prettify 格式化 HTML 信息 以增加可读性 信息组织与提取信息标记信息提取实例提取一个HTML 中的所有超连接 bs4 库的html 内容查找 find_allbs4 find 系列的其它 七个 方法 bs4 库的 CSS 选择器 select实例校友会中国高等职业院校2021排名定向爬取淘宝商品信息定向爬虫 Scrapy 爬虫框架Scrapy 爬虫 提取信息的方法框架 5+2 结构框架的数据路径requests 和 scrapyscrapy 命令行常用命令 实例yield 关键字 摘录

Requests import requests url = 'http://blog.wpnet.info' r = requests.get(url) print('URL内容:',r.text) print("Response:",type(r)) print('状态码',r.status_code) print('头部:',r.headers) print('编码格式:',r.encoding) print('分析编码方式',r.apparent_encoding) print('二进制方式显示HTTP响应:',r.content)

r = requests.get(url) 返回一个 Response 类

r.status_code 返回一个状态码

r.headers 返回页面的头部信息

这个头部信息其实就是浏览器中 response headers 中的信息

而 Response 对象包含服务器返回的所有信息,也包含我们向服务器请求信息

常用的 Response 属性 属性说明r.status_code返回HTTP状态码r.text访问的URL页面内容r.encoding猜测相应编码方式r.apparent_encodeing从内容中分析相应编码方式(备用)r.contentHTTP响应以二进制方式显示(图片等)
r.encoding与 r.apparent_encodeing 区别

r.encoding 是 从 头部 得到的,而 r.apparent_encodeing 是 分析内容得到的,并不是所有服务器都有这个头部信息,如果没有这个头部信息,它会返回一个国际标准编码 ISO-8859-1 它不支持中文。

Requests库异常 异常说明requests.ConnectionError网络连接异常,拒绝连接等requests.HTTPErrorHTTP 错误异常requests.URLRequiredURL 缺失异常requests.TooManyRedirects超过最大重定向,重定向异常requests.ConnectionError连接服务器超时异常requests.Timeout请求URL超时异常

requests.Timeout 表示整个过程超时

requests.ConnectionError 只是连接异常

requests 通用的一个代码框架 import requests def getText(url): try: r = requests.get(url,timeout = 5) # 如果状态码不是 200,抛出 HTTPError异常 r.raise_for_status() # 使用 apparent_encoding 使解码更加准确 r.encoding = r.apparent_encoding return r.text except Exception as exc: print('异常:',exc) exit(0) if __name__ == '__main__': url = 'http://blog.wpnet.info' text = getText(url) print(text) requests 7个主要方法与13个访问控制参数 方法说明requests.request()构造一个请求,支撑以下各种方法的基础方法requests.get()HTTP get方式requests.post()HTTP post 方式requests.head()获取请求头requests.putHTTP put 方式requests.patch()向网页局部修改请求requests.delete()向网页提交删除请求

requests.request() 方法

requests.request() 方法有三个参数,分别是

method:请求方式(get,post,put 等)

url:目标

**kwargs(13个控制访问参数)

params:(字典或字节序列)作为参数增加到url中(get) import requests url = 'http://blog.wpnet.info' mdict = { 'key1': 'value1', 'key2': 'value2' } r = requests.request('GET',url,params=mdict) print(r.url) # http://blog.wpnet.info/?key1=value1&key2=value2

data:(字典,字节序列,文件对象,文件是通过post传递的)类似params一样的功能与参数,不过它放在post的body里(post)

json:(json格式)数据传递

headers:(字典)自定义HTTP头

import requests url = 'http://blog.wpnet.info' mdict = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' } r = requests.request('get',url,headers= mdict) print(r.text)

cookies:(字典或CookieJar)自定义Cookie auth:(元组)HTTP 认证功能files:(字典)传输文件 import requests url = 'http://blog.wpnet.info' fp = { 'file':open('test.txt','rb') } r = requests.request('POST',url,files=fp) print(r.text)

timeout:(数值)超时,单位秒proxies:(字典)设置代理服务器 allow_redirects:(布尔)默认为True,重定向开关stream:(布尔)默认为True,获取内容立即下载开关verify:(布尔)默认为True,认证 SSL 证书开关cert:(字符串)本地 SSL 证书路径 requests.get()

requests.get() 方法提供三个参数

url:(字符串)目标params:(字典,字节流)参数**kwargs:12个访问控制参数 import requests url = 'http://blog.wpnet.info' mdict = { 'key1':'value1', 'key2':'value2' } r = requests.get(url,params=mdict) print(r.text)

其它方法基本与 requests.request() 参数与控制参数基本上一致

robots 协议

通过基本语法告知爬虫那些目录可以访问,那些不能访问,robots.txt 一定放网站根目录,如果没有robots.txt,代表允许无限制访问以及爬取内容

https://·/

import requests import bs4 import os def getimage(url): header = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.3538.77 Safari/537.36' } r = requests.request('GET',url,headers = header) if r.encoding == 'ISO-8859-1': r.encoding = r.apparent_encoding elif r.status_code != 200: print('状态错误: ',r.status_code) exit(0) src = bs4.BeautifulSoup(r.text,'html.parser') listsrc = src.find_all('img') listimg = [] for i in listsrc: listimg.append(i.attrs['src']) dir = os.getcwd() + r'\images' if not os.path.exists(dir): os.mkdir(dir) for i in listimg: image = dir + '\\' + i.split(r'/')[-1] if not os.path.exists(image): if i.split(r'/')[0] != 'http' or 'https': i = url + i image_download = requests.get(i,headers=header) if image_download.status_code == 200: try: with open(image,'wb') as file: file.write(image_download.content) except Exception as exc: print('异常:',exc) print('正在下载: ',i) print('ok') if __name__ == '__main__': url = 'https://cc.cqcet.edu.cn/' getimage(url) API接口解析

接口一般不需要 header 头,数据返回的一般是json格式,进行json反序列化取出即可

import requests import json def getipaddr(ip): key = 'null' url = f'https://binstd.apistd.com/ip/location?ip={ip}&key={key}' r = requests.request('GET',url) if r.status_code != 200: print('网络故障或 key 错误') exit(0) info = r.text info = json.loads(info) country = info['result']['country'] ip = info['result']['ip'] addr = info['result']['area'] types = info['result']['type'] info = [ country, ip, addr, types ] return info if __name__ == '__main__': ip = input('查询IP: ') info = getipaddr(ip) print('国家: ' + info[0] + ' IP: ' + info[1] + ' 地址: ' + info[2] + ' 类型: ' + info[3])

Beautiful Soup4

bs4 一般搭配 requests 使用或者直接解析 .html 文件,对bs4 中,在实例化之后对自己的处理就是对 html 内容的处理。

简单使用 import requests import bs4 def getimage(url): r = requests.request("GET",url) url_text = r.text soup = bs4.BeautifulSoup(url_text,'html.parser') print(soup.prettify()) if __name__ == '__main__': url = 'http://blog.wpnet.info/' getimage(url)

soup = bs4.BeautifulSoup(url_text,'html.parser') 使用 Bs4中 BeautifulSoup 以 html解析的方式解析 url_text

bs4库四种解析器 解析器条件bs4的HTML解析器(bs4.BeautifulSoup(url_text,‘html.parser’))bs4库自带lxml的HTML解析器 (bs4.BeautifulSoup(url_text,‘lxml’))安装 lxmllxml的XML解析器 (bs4.BeautifulSoup(url_text,‘xml’))安装 lxmlhtml5lib的解析器 (bs4.BeautifulSoup(url_text,‘html5lib’))安装 html5lib
bs4 库 基本元素

五种基本元素

基本元素说明tag标签name标签名,格式:.nameattributes标签属性,格式:.attrsnavigablestring标签内非属性字符串,格式:.stringcomment标签内注释的内容,特殊类型
import requests import bs4 def getimage(url): r = requests.request("GET",url) url_text = r.text soup = bs4.BeautifulSoup(url_text,'html.parser') # 打印 a 标签 print(soup.a) # 打印 a 标签的 父标签名字 print(soup.a.parent.name) # 打印 a 标签 属性 print(soup.a.attrs) # 打印 a 标签中的 字符串 print(soup.a.string) if __name__ == '__main__': url = 'http://blog.wpnet.info/' getimage(url) 基于bs4 遍历

bs4 的 HTML 标签的遍历分为

向下遍历向上遍历平行遍历 #mermaid-svg-XPkzY15rZVbkXy0K .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-XPkzY15rZVbkXy0K .label text{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .node rect,#mermaid-svg-XPkzY15rZVbkXy0K .node circle,#mermaid-svg-XPkzY15rZVbkXy0K .node ellipse,#mermaid-svg-XPkzY15rZVbkXy0K .node polygon,#mermaid-svg-XPkzY15rZVbkXy0K .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-XPkzY15rZVbkXy0K .node .label{text-align:center;fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .node.clickable{cursor:pointer}#mermaid-svg-XPkzY15rZVbkXy0K .arrowheadPath{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-XPkzY15rZVbkXy0K .flowchart-link{stroke:#333;fill:none}#mermaid-svg-XPkzY15rZVbkXy0K .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-XPkzY15rZVbkXy0K .edgeLabel rect{opacity:0.9}#mermaid-svg-XPkzY15rZVbkXy0K .edgeLabel span{color:#333}#mermaid-svg-XPkzY15rZVbkXy0K .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-XPkzY15rZVbkXy0K .cluster text{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-XPkzY15rZVbkXy0K .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-XPkzY15rZVbkXy0K text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-XPkzY15rZVbkXy0K .actor-line{stroke:grey}#mermaid-svg-XPkzY15rZVbkXy0K .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-XPkzY15rZVbkXy0K .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-XPkzY15rZVbkXy0K #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-XPkzY15rZVbkXy0K .sequenceNumber{fill:#fff}#mermaid-svg-XPkzY15rZVbkXy0K #sequencenumber{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K #crosshead path{fill:#333;stroke:#333}#mermaid-svg-XPkzY15rZVbkXy0K .messageText{fill:#333;stroke:#333}#mermaid-svg-XPkzY15rZVbkXy0K .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-XPkzY15rZVbkXy0K .labelText,#mermaid-svg-XPkzY15rZVbkXy0K .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-XPkzY15rZVbkXy0K .loopText,#mermaid-svg-XPkzY15rZVbkXy0K .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-XPkzY15rZVbkXy0K .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-XPkzY15rZVbkXy0K .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-XPkzY15rZVbkXy0K .noteText,#mermaid-svg-XPkzY15rZVbkXy0K .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-XPkzY15rZVbkXy0K .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-XPkzY15rZVbkXy0K .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-XPkzY15rZVbkXy0K .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-XPkzY15rZVbkXy0K .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .section{stroke:none;opacity:0.2}#mermaid-svg-XPkzY15rZVbkXy0K .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-XPkzY15rZVbkXy0K .section2{fill:#fff400}#mermaid-svg-XPkzY15rZVbkXy0K .section1,#mermaid-svg-XPkzY15rZVbkXy0K .section3{fill:#fff;opacity:0.2}#mermaid-svg-XPkzY15rZVbkXy0K .sectionTitle0{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .sectionTitle1{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .sectionTitle2{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .sectionTitle3{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-XPkzY15rZVbkXy0K .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .grid path{stroke-width:0}#mermaid-svg-XPkzY15rZVbkXy0K .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-XPkzY15rZVbkXy0K .task{stroke-width:2}#mermaid-svg-XPkzY15rZVbkXy0K .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .taskText:not([font-size]){font-size:11px}#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-XPkzY15rZVbkXy0K .task.clickable{cursor:pointer}#mermaid-svg-XPkzY15rZVbkXy0K .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-XPkzY15rZVbkXy0K .taskText0,#mermaid-svg-XPkzY15rZVbkXy0K .taskText1,#mermaid-svg-XPkzY15rZVbkXy0K .taskText2,#mermaid-svg-XPkzY15rZVbkXy0K .taskText3{fill:#fff}#mermaid-svg-XPkzY15rZVbkXy0K .task0,#mermaid-svg-XPkzY15rZVbkXy0K .task1,#mermaid-svg-XPkzY15rZVbkXy0K .task2,#mermaid-svg-XPkzY15rZVbkXy0K .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutside0,#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutside2{fill:#000}#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutside1,#mermaid-svg-XPkzY15rZVbkXy0K .taskTextOutside3{fill:#000}#mermaid-svg-XPkzY15rZVbkXy0K .active0,#mermaid-svg-XPkzY15rZVbkXy0K .active1,#mermaid-svg-XPkzY15rZVbkXy0K .active2,#mermaid-svg-XPkzY15rZVbkXy0K .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-XPkzY15rZVbkXy0K .activeText0,#mermaid-svg-XPkzY15rZVbkXy0K .activeText1,#mermaid-svg-XPkzY15rZVbkXy0K .activeText2,#mermaid-svg-XPkzY15rZVbkXy0K .activeText3{fill:#000 !important}#mermaid-svg-XPkzY15rZVbkXy0K .done0,#mermaid-svg-XPkzY15rZVbkXy0K .done1,#mermaid-svg-XPkzY15rZVbkXy0K .done2,#mermaid-svg-XPkzY15rZVbkXy0K .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-XPkzY15rZVbkXy0K .doneText0,#mermaid-svg-XPkzY15rZVbkXy0K .doneText1,#mermaid-svg-XPkzY15rZVbkXy0K .doneText2,#mermaid-svg-XPkzY15rZVbkXy0K .doneText3{fill:#000 !important}#mermaid-svg-XPkzY15rZVbkXy0K .crit0,#mermaid-svg-XPkzY15rZVbkXy0K .crit1,#mermaid-svg-XPkzY15rZVbkXy0K .crit2,#mermaid-svg-XPkzY15rZVbkXy0K .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-XPkzY15rZVbkXy0K .activeCrit0,#mermaid-svg-XPkzY15rZVbkXy0K .activeCrit1,#mermaid-svg-XPkzY15rZVbkXy0K .activeCrit2,#mermaid-svg-XPkzY15rZVbkXy0K .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-XPkzY15rZVbkXy0K .doneCrit0,#mermaid-svg-XPkzY15rZVbkXy0K .doneCrit1,#mermaid-svg-XPkzY15rZVbkXy0K .doneCrit2,#mermaid-svg-XPkzY15rZVbkXy0K .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-XPkzY15rZVbkXy0K .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-XPkzY15rZVbkXy0K .milestoneText{font-style:italic}#mermaid-svg-XPkzY15rZVbkXy0K .doneCritText0,#mermaid-svg-XPkzY15rZVbkXy0K .doneCritText1,#mermaid-svg-XPkzY15rZVbkXy0K .doneCritText2,#mermaid-svg-XPkzY15rZVbkXy0K .doneCritText3{fill:#000 !important}#mermaid-svg-XPkzY15rZVbkXy0K .activeCritText0,#mermaid-svg-XPkzY15rZVbkXy0K .activeCritText1,#mermaid-svg-XPkzY15rZVbkXy0K .activeCritText2,#mermaid-svg-XPkzY15rZVbkXy0K .activeCritText3{fill:#000 !important}#mermaid-svg-XPkzY15rZVbkXy0K .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-XPkzY15rZVbkXy0K g.classGroup text .title{font-weight:bolder}#mermaid-svg-XPkzY15rZVbkXy0K g.clickable{cursor:pointer}#mermaid-svg-XPkzY15rZVbkXy0K g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-XPkzY15rZVbkXy0K g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-XPkzY15rZVbkXy0K .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-XPkzY15rZVbkXy0K .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-XPkzY15rZVbkXy0K .dashed-line{stroke-dasharray:3}#mermaid-svg-XPkzY15rZVbkXy0K #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K .commit-id,#mermaid-svg-XPkzY15rZVbkXy0K .commit-msg,#mermaid-svg-XPkzY15rZVbkXy0K .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-XPkzY15rZVbkXy0K g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-XPkzY15rZVbkXy0K g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-XPkzY15rZVbkXy0K g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-XPkzY15rZVbkXy0K .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-XPkzY15rZVbkXy0K .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-XPkzY15rZVbkXy0K .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-XPkzY15rZVbkXy0K .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-XPkzY15rZVbkXy0K .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-XPkzY15rZVbkXy0K .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-XPkzY15rZVbkXy0K .edgeLabel text{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-XPkzY15rZVbkXy0K .node circle.state-start{fill:black;stroke:black}#mermaid-svg-XPkzY15rZVbkXy0K .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-XPkzY15rZVbkXy0K #statediagram-barbEnd{fill:#9370db}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-state .divider{stroke:#9370db}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-XPkzY15rZVbkXy0K .note-edge{stroke-dasharray:5}#mermaid-svg-XPkzY15rZVbkXy0K .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-XPkzY15rZVbkXy0K .error-icon{fill:#522}#mermaid-svg-XPkzY15rZVbkXy0K .error-text{fill:#522;stroke:#522}#mermaid-svg-XPkzY15rZVbkXy0K .edge-thickness-normal{stroke-width:2px}#mermaid-svg-XPkzY15rZVbkXy0K .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-XPkzY15rZVbkXy0K .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-XPkzY15rZVbkXy0K .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-XPkzY15rZVbkXy0K .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-XPkzY15rZVbkXy0K .marker{fill:#333}#mermaid-svg-XPkzY15rZVbkXy0K .marker.cross{stroke:#333} :root { --mermaid-font-family: "trebuchet ms", verdana, arial;} #mermaid-svg-XPkzY15rZVbkXy0K { color: rgba(0, 0, 0, 0.75); font: ; } 向下遍历 向上遍历 平行遍历 <html> <body> <head> <title> <p> <b> <p> <a> <a> <a> 向下遍历 属性说明.contents一个节点的子节点的列表,将所有子节点存列表.children一个节点的子节点的迭代类型,与 contents 类似.descendants一个节点后所有的子孙节点的迭代类型,包含所有子孙节点
import requests import bs4 def getimage(url): r = requests.request("GET",url) url_text = r.text soup = bs4.BeautifulSoup(url_text,'html.parser') # 输出 head 标签 # print(soup.head) # 输出 head 下的 子节点 print(soup.head.contents) # 输出子 head 下面的 标签数量(子节点数量) print(len(soup.head.contents)) # 输出第一个 head 下面 的第一个 标签 print(soup.head.contents[1]) # 遍历 head 下的 子标签 for label in soup.head.children: print(label) if __name__ == '__main__': url = 'http://blog.wpnet.info/' getimage(url) 向上遍历 属性说明.parent节点的父亲标签.parents父亲节点的迭代,用于遍历
import requests import bs4 def getimage(url): r = requests.request("GET",url) url_text = r.text soup = bs4.BeautifulSoup(url_text,'html.parser') # 输出html 标签会输出所有前端内容,因为html上面没有父标签 print(soup.html.parent) print('--------------------------------') # 打印 a 标签上层的所有 父类 标签 for label in soup.a.parents: if label == None: print(label) else: print(label.name) if __name__ == '__main__': url = 'http://blog.wpnet.info/' getimage(url) 平行遍历

平行遍历需要同一个父类标签 平行遍历的下一数据不一定是标签

属性说明.next_sibling按HTML 文本顺序的下个平行节点标签.pravious_sibling按HTML 文本顺序的上一个平行标签.next_siblings迭代类型,按HTML 顺序的后续所有平行标签.previous_siblings迭代类型,按HTML 文本顺序的前序所有平行标签
prettify 格式化 HTML 信息 以增加可读性

prettify()函数可以在每个标签后面加上换行,从而格式化有些不标准的html 信息,让数据具有跟高的可读性。

import requests import bs4 def getimage(url): r = requests.request("GET",url) url_text = r.text soup = bs4.BeautifulSoup(url_text,'html.parser') print(soup) print('--------------------------------------------') print(soup.prettify()) if __name__ == '__main__': url = 'https://python123.io/ws/demo.html' getimage(url)

信息组织与提取 标记后的信息可以形成组织结构,增加信息纬度标记后的信息用于通信,存储标记后的信息便于人类理解 信息标记

国际信息标记一般三种形式

xmljsonyaml 信息提取

信息提取一般方法

完整提取信息,在提取关键信息 – 优点:信息解析准确 – 缺点:效率低,需完全了解信息结构

无视标记形式,通过正则等关键字提取 – 优点:提取效率高 – 缺点:需要调试正确内容

结合前两种方法

实例 提取一个HTML 中的所有超连接

http://blog.wpnet.info

import requests import bs4 def getinfo(url): r = requests.request('GET',url) if r.status_code != 200: print('status error:',r.status_code) exit(0) r_text = r.text soup = bs4.BeautifulSoup(r_text,'html.parser') for i in soup.find_all('a'): add = str(i.get('href')) if add.split('//')[0] == 'https:' or add.split('//')[0] == 'http:': print(add) else: print(url+add) if __name__ == '__main__': url = 'http://blog.wpnet.info' getinfo(url) bs4 库的html 内容查找 find_all

ba4 库中有个 find_all 方法,用于查找对应结果 ,它有 五个参数

find_all(name,attrs,recursive,string,**kwargs)

参数作用name对标签名称进行检索attrs对标签属性值检索recursive是否对子孙标签全部检索(默认True)string对标签中字符串区域检索

find_all 方法的简写

<tag>(...) === <tag>.find_all(...) soup(...) === soup.find_all(...) 虽然可以这样写,但不建议这样写,可读性并不高

import requests import bs4 def getinfo(url): r = requests.request('GET',url,timeout = 5) if r.status_code != 200: print('status error:',r.status_code) exit(0) r_text = r.text soup = bs4.BeautifulSoup(r_text,'html.parser') # 检索 a 标签所有 hover-underline 属性的标签 print(soup.find_all('a','hover-underline')) # 检索 字符串 信息安全 print(soup.find_all(string = '信息安全')) # 简写 find_all 方法 print(soup(string='信息安全')) if __name__ == '__main__': url = 'http://blog.wpnet.info' getinfo(url) bs4 find 系列的其它 七个 方法

除了常用的 find_all() 方法以外的七种方法

方法描述find()搜索但只返回一个结果,返回字符串find_parents()在先辈节点中搜索,返回列表find_parent()在先辈节点中搜索但只返回一个结果,返回字符串find_next_siblings()在后续平行节点中搜索,返回列表find_next_sibling()在后续平行节点中搜索但只返回一个结果,返回字符串find_previous_siblings()在前序平行节点中搜索,返回列表find_previous_sibling()在前序平行节点中搜索但只返回一个结果,返回字符串
bs4 库的 CSS 选择器 select soup.select('title') #通过标签获取元素 soup.select('html body p') #获取html下的body标签下的p标签中内容 soup.select('div [class="text"]')[0] #获取 标签为 div 元素为 class="text" 里的内容,类型为Tag soup.select('div [class="text"]') #同上,但是类型为 ResultSet 实例 校友会中国高等职业院校2021排名定向爬取

ps:截至2021/07/10 校友会 无robots信息,可以合法爬取。

http://·pile(r'>(.*\d)<') rank = rank.findall(getvalue) name = re.compile(r'<td nowrap=\"\" width=\"40%\">\n<p align=\"center\">(.*)</p></td>') name = name.findall(getvalue) name.pop(0) num = re.compile(r'<p align=\"center\">(\d*.*) </p></td>') num = num.findall(getvalue) chenci = re.compile(r'<p align=\"center\">(.*)</p></td></tr>') chenci = chenci.findall(getvalue) chenci.pop(0) for n in range(len(name)): info_dict.append({"校名": name[n], "名次": rank[n],"分数":num[n],"层次":chenci[n]}) sorted(info_dict, key=lambda i: (i["名次"])) return info_dict if __name__ == '__main__': url = 'http://·/search?q=鼠标&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20210714&ie=utf8

第二页:

https://s.taobao.com/search?q=鼠标&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20210714&ie=utf8&bcoffset=3&ntoffset=3&p4ppushleft=1,48&s=44

第三页

https://s.taobao.com/search?q=鼠标&imgfile=&js=1&stats_click=search_radio_all:1&initiative_id=staobaoz_20210714&ie=utf8&bcoffset=3&ntoffset=0&p4ppushleft=1,48&s=88

import requests import re def get_text(url): cookie_str = 'null' header = { 'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Brave Chrome/90.0.4430.72 Safari/537.36' } cookie = { } for i in cookie_str.split(';'): key,value = i.strip().split('=',1) cookie[key] = value r = requests.request('GET',url,headers=header,cookies=cookie) if r.status_code != 200: print('status_error:',r.status_code) exit(0) r.encoding = r.apparent_encoding return r.text def get_info(url_text): # 价格 price = re.compile(r'\"view_price\":\"(\d+\.\d*)') price = price.findall(url_text) # 地点 area = re.compile(r'\"item_loc\":\"(\D+)\",') area = area.findall(url_text) paynum = re.compile(r'\"view_sales\":\"(.+?)\"') paynum = paynum.findall(url_text) name = re.compile(r'\"nick\":\"(\D*)\",') name = name.findall(url_text) print(price) print(area) print(name) print(paynum) print(len(name)) def main(): depth = 2 findname = '鼠标' url = "https://s.taobao.com/search?q={}".format(findname) for i in range(depth): try: url = url + '&s=' + str(44*i) url_text = get_text(url) get_info(url_text) except: print('error') if __name__ == '__main__': main()

Scrapy 爬虫框架 Request 类 Request 类 和 Request 库不是一个东西,但跟request库相似 属性或方法说明.urlrequest对应请求的URL地址.method对应的请求方法,‘GET’'POST’等.headers请求头.body请求内容主体,字符串类型.meta用户添加扩展信息,在scrapy内部模块传递使用.copy()复制该请求
Response类 对应一个HTTP响应,由downloader生成,由 spider 处理 属性或方法说明.urlResponse对应的URL地址.statusHTTP状态码.headers响应头.bodyResponse响应内容,字符串类型.flags标记.request产生Response类的对应Request对象.copy()复制该响应
Item类 表示一个HTML页面中提取的信息内容 由 spider 生成,由 Item Pipeline 处理 Item类似字典类型 Scrapy 爬虫 提取信息的方法 Bs4lxmlreXpathCSS select 框架 5+2 结构

五个主要模块 其中 ENGINE SCHEDULER DOWNLOADER 一般无需用户修改

SPIDERS(常用) 解析downloader返回的 response ,产生爬取项,请求ENGINE(核心) 控制所有模块的数据流SCHEDULER 调度管理DOWNLOADER 根据请求下载网页INTERNET

两个中间件

MIDDLEWARE 对 ENGINE SCHEDULER DOWNLOADER 模块进行用户可配置控制(修改,丢弃,新增请求)ITEM PIPELINES 以流水线处理 SPIDERS 的 爬取项 框架的数据路径

路径1

#mermaid-svg-egJXP537Nf5lOVrf .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-egJXP537Nf5lOVrf .label text{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .node rect,#mermaid-svg-egJXP537Nf5lOVrf .node circle,#mermaid-svg-egJXP537Nf5lOVrf .node ellipse,#mermaid-svg-egJXP537Nf5lOVrf .node polygon,#mermaid-svg-egJXP537Nf5lOVrf .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-egJXP537Nf5lOVrf .node .label{text-align:center;fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .node.clickable{cursor:pointer}#mermaid-svg-egJXP537Nf5lOVrf .arrowheadPath{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-egJXP537Nf5lOVrf .flowchart-link{stroke:#333;fill:none}#mermaid-svg-egJXP537Nf5lOVrf .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-egJXP537Nf5lOVrf .edgeLabel rect{opacity:0.9}#mermaid-svg-egJXP537Nf5lOVrf .edgeLabel span{color:#333}#mermaid-svg-egJXP537Nf5lOVrf .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-egJXP537Nf5lOVrf .cluster text{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-egJXP537Nf5lOVrf .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-egJXP537Nf5lOVrf text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-egJXP537Nf5lOVrf .actor-line{stroke:grey}#mermaid-svg-egJXP537Nf5lOVrf .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-egJXP537Nf5lOVrf .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-egJXP537Nf5lOVrf #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-egJXP537Nf5lOVrf .sequenceNumber{fill:#fff}#mermaid-svg-egJXP537Nf5lOVrf #sequencenumber{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf #crosshead path{fill:#333;stroke:#333}#mermaid-svg-egJXP537Nf5lOVrf .messageText{fill:#333;stroke:#333}#mermaid-svg-egJXP537Nf5lOVrf .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-egJXP537Nf5lOVrf .labelText,#mermaid-svg-egJXP537Nf5lOVrf .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-egJXP537Nf5lOVrf .loopText,#mermaid-svg-egJXP537Nf5lOVrf .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-egJXP537Nf5lOVrf .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-egJXP537Nf5lOVrf .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-egJXP537Nf5lOVrf .noteText,#mermaid-svg-egJXP537Nf5lOVrf .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-egJXP537Nf5lOVrf .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-egJXP537Nf5lOVrf .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-egJXP537Nf5lOVrf .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-egJXP537Nf5lOVrf .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .section{stroke:none;opacity:0.2}#mermaid-svg-egJXP537Nf5lOVrf .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-egJXP537Nf5lOVrf .section2{fill:#fff400}#mermaid-svg-egJXP537Nf5lOVrf .section1,#mermaid-svg-egJXP537Nf5lOVrf .section3{fill:#fff;opacity:0.2}#mermaid-svg-egJXP537Nf5lOVrf .sectionTitle0{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .sectionTitle1{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .sectionTitle2{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .sectionTitle3{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-egJXP537Nf5lOVrf .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .grid path{stroke-width:0}#mermaid-svg-egJXP537Nf5lOVrf .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-egJXP537Nf5lOVrf .task{stroke-width:2}#mermaid-svg-egJXP537Nf5lOVrf .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .taskText:not([font-size]){font-size:11px}#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-egJXP537Nf5lOVrf .task.clickable{cursor:pointer}#mermaid-svg-egJXP537Nf5lOVrf .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-egJXP537Nf5lOVrf .taskText0,#mermaid-svg-egJXP537Nf5lOVrf .taskText1,#mermaid-svg-egJXP537Nf5lOVrf .taskText2,#mermaid-svg-egJXP537Nf5lOVrf .taskText3{fill:#fff}#mermaid-svg-egJXP537Nf5lOVrf .task0,#mermaid-svg-egJXP537Nf5lOVrf .task1,#mermaid-svg-egJXP537Nf5lOVrf .task2,#mermaid-svg-egJXP537Nf5lOVrf .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutside0,#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutside2{fill:#000}#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutside1,#mermaid-svg-egJXP537Nf5lOVrf .taskTextOutside3{fill:#000}#mermaid-svg-egJXP537Nf5lOVrf .active0,#mermaid-svg-egJXP537Nf5lOVrf .active1,#mermaid-svg-egJXP537Nf5lOVrf .active2,#mermaid-svg-egJXP537Nf5lOVrf .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-egJXP537Nf5lOVrf .activeText0,#mermaid-svg-egJXP537Nf5lOVrf .activeText1,#mermaid-svg-egJXP537Nf5lOVrf .activeText2,#mermaid-svg-egJXP537Nf5lOVrf .activeText3{fill:#000 !important}#mermaid-svg-egJXP537Nf5lOVrf .done0,#mermaid-svg-egJXP537Nf5lOVrf .done1,#mermaid-svg-egJXP537Nf5lOVrf .done2,#mermaid-svg-egJXP537Nf5lOVrf .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-egJXP537Nf5lOVrf .doneText0,#mermaid-svg-egJXP537Nf5lOVrf .doneText1,#mermaid-svg-egJXP537Nf5lOVrf .doneText2,#mermaid-svg-egJXP537Nf5lOVrf .doneText3{fill:#000 !important}#mermaid-svg-egJXP537Nf5lOVrf .crit0,#mermaid-svg-egJXP537Nf5lOVrf .crit1,#mermaid-svg-egJXP537Nf5lOVrf .crit2,#mermaid-svg-egJXP537Nf5lOVrf .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-egJXP537Nf5lOVrf .activeCrit0,#mermaid-svg-egJXP537Nf5lOVrf .activeCrit1,#mermaid-svg-egJXP537Nf5lOVrf .activeCrit2,#mermaid-svg-egJXP537Nf5lOVrf .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-egJXP537Nf5lOVrf .doneCrit0,#mermaid-svg-egJXP537Nf5lOVrf .doneCrit1,#mermaid-svg-egJXP537Nf5lOVrf .doneCrit2,#mermaid-svg-egJXP537Nf5lOVrf .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-egJXP537Nf5lOVrf .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-egJXP537Nf5lOVrf .milestoneText{font-style:italic}#mermaid-svg-egJXP537Nf5lOVrf .doneCritText0,#mermaid-svg-egJXP537Nf5lOVrf .doneCritText1,#mermaid-svg-egJXP537Nf5lOVrf .doneCritText2,#mermaid-svg-egJXP537Nf5lOVrf .doneCritText3{fill:#000 !important}#mermaid-svg-egJXP537Nf5lOVrf .activeCritText0,#mermaid-svg-egJXP537Nf5lOVrf .activeCritText1,#mermaid-svg-egJXP537Nf5lOVrf .activeCritText2,#mermaid-svg-egJXP537Nf5lOVrf .activeCritText3{fill:#000 !important}#mermaid-svg-egJXP537Nf5lOVrf .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-egJXP537Nf5lOVrf g.classGroup text .title{font-weight:bolder}#mermaid-svg-egJXP537Nf5lOVrf g.clickable{cursor:pointer}#mermaid-svg-egJXP537Nf5lOVrf g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-egJXP537Nf5lOVrf g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-egJXP537Nf5lOVrf .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-egJXP537Nf5lOVrf .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-egJXP537Nf5lOVrf .dashed-line{stroke-dasharray:3}#mermaid-svg-egJXP537Nf5lOVrf #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf .commit-id,#mermaid-svg-egJXP537Nf5lOVrf .commit-msg,#mermaid-svg-egJXP537Nf5lOVrf .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-egJXP537Nf5lOVrf g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-egJXP537Nf5lOVrf g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-egJXP537Nf5lOVrf g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-egJXP537Nf5lOVrf .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-egJXP537Nf5lOVrf .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-egJXP537Nf5lOVrf .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-egJXP537Nf5lOVrf .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-egJXP537Nf5lOVrf .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-egJXP537Nf5lOVrf .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-egJXP537Nf5lOVrf .edgeLabel text{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-egJXP537Nf5lOVrf .node circle.state-start{fill:black;stroke:black}#mermaid-svg-egJXP537Nf5lOVrf .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-egJXP537Nf5lOVrf #statediagram-barbEnd{fill:#9370db}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-state .divider{stroke:#9370db}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-egJXP537Nf5lOVrf .note-edge{stroke-dasharray:5}#mermaid-svg-egJXP537Nf5lOVrf .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-egJXP537Nf5lOVrf .error-icon{fill:#522}#mermaid-svg-egJXP537Nf5lOVrf .error-text{fill:#522;stroke:#522}#mermaid-svg-egJXP537Nf5lOVrf .edge-thickness-normal{stroke-width:2px}#mermaid-svg-egJXP537Nf5lOVrf .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-egJXP537Nf5lOVrf .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-egJXP537Nf5lOVrf .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-egJXP537Nf5lOVrf .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-egJXP537Nf5lOVrf .marker{fill:#333}#mermaid-svg-egJXP537Nf5lOVrf .marker.cross{stroke:#333} :root { --mermaid-font-family: "trebuchet ms", verdana, arial;} #mermaid-svg-egJXP537Nf5lOVrf { color: rgba(0, 0, 0, 0.75); font: ; } REQUESTS 转发 SPIDERS ENGINE SCHEDULER 对爬取请求进行调度

路径2

#mermaid-svg-oam9VmrfW9vtuBK4 .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .label text{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .node rect,#mermaid-svg-oam9VmrfW9vtuBK4 .node circle,#mermaid-svg-oam9VmrfW9vtuBK4 .node ellipse,#mermaid-svg-oam9VmrfW9vtuBK4 .node polygon,#mermaid-svg-oam9VmrfW9vtuBK4 .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-oam9VmrfW9vtuBK4 .node .label{text-align:center;fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .node.clickable{cursor:pointer}#mermaid-svg-oam9VmrfW9vtuBK4 .arrowheadPath{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-oam9VmrfW9vtuBK4 .flowchart-link{stroke:#333;fill:none}#mermaid-svg-oam9VmrfW9vtuBK4 .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-oam9VmrfW9vtuBK4 .edgeLabel rect{opacity:0.9}#mermaid-svg-oam9VmrfW9vtuBK4 .edgeLabel span{color:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-oam9VmrfW9vtuBK4 .cluster text{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-oam9VmrfW9vtuBK4 .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-oam9VmrfW9vtuBK4 text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-oam9VmrfW9vtuBK4 .actor-line{stroke:grey}#mermaid-svg-oam9VmrfW9vtuBK4 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-oam9VmrfW9vtuBK4 #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .sequenceNumber{fill:#fff}#mermaid-svg-oam9VmrfW9vtuBK4 #sequencenumber{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 #crosshead path{fill:#333;stroke:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .messageText{fill:#333;stroke:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-oam9VmrfW9vtuBK4 .labelText,#mermaid-svg-oam9VmrfW9vtuBK4 .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-oam9VmrfW9vtuBK4 .loopText,#mermaid-svg-oam9VmrfW9vtuBK4 .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-oam9VmrfW9vtuBK4 .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-oam9VmrfW9vtuBK4 .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-oam9VmrfW9vtuBK4 .noteText,#mermaid-svg-oam9VmrfW9vtuBK4 .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-oam9VmrfW9vtuBK4 .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-oam9VmrfW9vtuBK4 .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-oam9VmrfW9vtuBK4 .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-oam9VmrfW9vtuBK4 .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .section{stroke:none;opacity:0.2}#mermaid-svg-oam9VmrfW9vtuBK4 .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-oam9VmrfW9vtuBK4 .section2{fill:#fff400}#mermaid-svg-oam9VmrfW9vtuBK4 .section1,#mermaid-svg-oam9VmrfW9vtuBK4 .section3{fill:#fff;opacity:0.2}#mermaid-svg-oam9VmrfW9vtuBK4 .sectionTitle0{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .sectionTitle1{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .sectionTitle2{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .sectionTitle3{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-oam9VmrfW9vtuBK4 .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .grid path{stroke-width:0}#mermaid-svg-oam9VmrfW9vtuBK4 .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-oam9VmrfW9vtuBK4 .task{stroke-width:2}#mermaid-svg-oam9VmrfW9vtuBK4 .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .taskText:not([font-size]){font-size:11px}#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-oam9VmrfW9vtuBK4 .task.clickable{cursor:pointer}#mermaid-svg-oam9VmrfW9vtuBK4 .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-oam9VmrfW9vtuBK4 .taskText0,#mermaid-svg-oam9VmrfW9vtuBK4 .taskText1,#mermaid-svg-oam9VmrfW9vtuBK4 .taskText2,#mermaid-svg-oam9VmrfW9vtuBK4 .taskText3{fill:#fff}#mermaid-svg-oam9VmrfW9vtuBK4 .task0,#mermaid-svg-oam9VmrfW9vtuBK4 .task1,#mermaid-svg-oam9VmrfW9vtuBK4 .task2,#mermaid-svg-oam9VmrfW9vtuBK4 .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutside0,#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutside2{fill:#000}#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutside1,#mermaid-svg-oam9VmrfW9vtuBK4 .taskTextOutside3{fill:#000}#mermaid-svg-oam9VmrfW9vtuBK4 .active0,#mermaid-svg-oam9VmrfW9vtuBK4 .active1,#mermaid-svg-oam9VmrfW9vtuBK4 .active2,#mermaid-svg-oam9VmrfW9vtuBK4 .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-oam9VmrfW9vtuBK4 .activeText0,#mermaid-svg-oam9VmrfW9vtuBK4 .activeText1,#mermaid-svg-oam9VmrfW9vtuBK4 .activeText2,#mermaid-svg-oam9VmrfW9vtuBK4 .activeText3{fill:#000 !important}#mermaid-svg-oam9VmrfW9vtuBK4 .done0,#mermaid-svg-oam9VmrfW9vtuBK4 .done1,#mermaid-svg-oam9VmrfW9vtuBK4 .done2,#mermaid-svg-oam9VmrfW9vtuBK4 .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-oam9VmrfW9vtuBK4 .doneText0,#mermaid-svg-oam9VmrfW9vtuBK4 .doneText1,#mermaid-svg-oam9VmrfW9vtuBK4 .doneText2,#mermaid-svg-oam9VmrfW9vtuBK4 .doneText3{fill:#000 !important}#mermaid-svg-oam9VmrfW9vtuBK4 .crit0,#mermaid-svg-oam9VmrfW9vtuBK4 .crit1,#mermaid-svg-oam9VmrfW9vtuBK4 .crit2,#mermaid-svg-oam9VmrfW9vtuBK4 .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-oam9VmrfW9vtuBK4 .activeCrit0,#mermaid-svg-oam9VmrfW9vtuBK4 .activeCrit1,#mermaid-svg-oam9VmrfW9vtuBK4 .activeCrit2,#mermaid-svg-oam9VmrfW9vtuBK4 .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-oam9VmrfW9vtuBK4 .doneCrit0,#mermaid-svg-oam9VmrfW9vtuBK4 .doneCrit1,#mermaid-svg-oam9VmrfW9vtuBK4 .doneCrit2,#mermaid-svg-oam9VmrfW9vtuBK4 .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-oam9VmrfW9vtuBK4 .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-oam9VmrfW9vtuBK4 .milestoneText{font-style:italic}#mermaid-svg-oam9VmrfW9vtuBK4 .doneCritText0,#mermaid-svg-oam9VmrfW9vtuBK4 .doneCritText1,#mermaid-svg-oam9VmrfW9vtuBK4 .doneCritText2,#mermaid-svg-oam9VmrfW9vtuBK4 .doneCritText3{fill:#000 !important}#mermaid-svg-oam9VmrfW9vtuBK4 .activeCritText0,#mermaid-svg-oam9VmrfW9vtuBK4 .activeCritText1,#mermaid-svg-oam9VmrfW9vtuBK4 .activeCritText2,#mermaid-svg-oam9VmrfW9vtuBK4 .activeCritText3{fill:#000 !important}#mermaid-svg-oam9VmrfW9vtuBK4 .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-oam9VmrfW9vtuBK4 g.classGroup text .title{font-weight:bolder}#mermaid-svg-oam9VmrfW9vtuBK4 g.clickable{cursor:pointer}#mermaid-svg-oam9VmrfW9vtuBK4 g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-oam9VmrfW9vtuBK4 g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-oam9VmrfW9vtuBK4 .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-oam9VmrfW9vtuBK4 .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-oam9VmrfW9vtuBK4 .dashed-line{stroke-dasharray:3}#mermaid-svg-oam9VmrfW9vtuBK4 #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 .commit-id,#mermaid-svg-oam9VmrfW9vtuBK4 .commit-msg,#mermaid-svg-oam9VmrfW9vtuBK4 .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-oam9VmrfW9vtuBK4 g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-oam9VmrfW9vtuBK4 g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-oam9VmrfW9vtuBK4 g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-oam9VmrfW9vtuBK4 .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-oam9VmrfW9vtuBK4 .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-oam9VmrfW9vtuBK4 .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-oam9VmrfW9vtuBK4 .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-oam9VmrfW9vtuBK4 .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-oam9VmrfW9vtuBK4 .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-oam9VmrfW9vtuBK4 .edgeLabel text{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-oam9VmrfW9vtuBK4 .node circle.state-start{fill:black;stroke:black}#mermaid-svg-oam9VmrfW9vtuBK4 .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-oam9VmrfW9vtuBK4 #statediagram-barbEnd{fill:#9370db}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-state .divider{stroke:#9370db}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-oam9VmrfW9vtuBK4 .note-edge{stroke-dasharray:5}#mermaid-svg-oam9VmrfW9vtuBK4 .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-oam9VmrfW9vtuBK4 .error-icon{fill:#522}#mermaid-svg-oam9VmrfW9vtuBK4 .error-text{fill:#522;stroke:#522}#mermaid-svg-oam9VmrfW9vtuBK4 .edge-thickness-normal{stroke-width:2px}#mermaid-svg-oam9VmrfW9vtuBK4 .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-oam9VmrfW9vtuBK4 .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-oam9VmrfW9vtuBK4 .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-oam9VmrfW9vtuBK4 .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-oam9VmrfW9vtuBK4 .marker{fill:#333}#mermaid-svg-oam9VmrfW9vtuBK4 .marker.cross{stroke:#333} :root { --mermaid-font-family: "trebuchet ms", verdana, arial;} #mermaid-svg-oam9VmrfW9vtuBK4 { color: rgba(0, 0, 0, 0.75); font: ; } REQUESTS 步骤1 REQUESTS 步骤2 RESPONSE 步骤3 RESPONSE 步骤4 SCHEDULER ENGINE DOWNLOADER SPIDERS requests 和 scrapy

相同点

可用性好,文档丰富均为爬虫

不同点

requestsscrapy网页级爬虫网站级爬虫功能库框架并发性不足性能高重点页面下载重点爬虫结构定制灵活一般定制灵活,深度困难简单相对requests困难
scrapy 命令行

scrapy <命令> [参数] [参数]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-U9K0EIlI-1626343395582)(en-resource://database/1681:1)]

常用命令 命令说明格式startproject创建一个新工程scrapt startproject [dir]genspider创建一个爬虫scrapy genspider [options] settings获得爬虫配置信息scrapy settings [option]crawl运行一个爬虫scrapy crawl list列出工程中所有爬虫scrapy listshell启动 URL 调试命令行scrapy shell [url]
实例 新建 scrapy 工程 scrapy startproject wpsecblog编写 scrapy 爬虫 scrapy genspider wpsec blog.wpnet.info

创建一个工程后会生成一些目录

scrapy.cfg 部署scrapy爬虫的配置文件wpsecblog外层目录 – __init__.py 初始化脚本 – items.py Items 代码模板(继承类) – middlewares.py Middlewares代码模板(继承类) – pipelines.py Pipelines代码模板(继承类) – settings.py scrapy爬虫配置文件 – spiders/ 当前工程的爬虫

创建爬虫后会在 spiders 目录下生成一个你的爬虫

修改 spider 内容 爬取一个页面并保存它 import scrapy class WpsecSpider(scrapy.Spider): name = 'wpsec' allowed_domains = ['blog.wpnet.info'] start_urls = ['http://blog.wpnet.info/index.html'] def parse(self, response): fname = response.url.split('/')[-1] with open(fname, 'wb') as file: file.write(response.body) pass 运行 scrapy crawl wpsec

爬取并保存的相关文件

yield 关键字

yield <–>生成器

生成器就是一个不断生产的函数包含yield语句的函数是一个生成器生成器每次生产一个值(yield语句),函数被冻结,被唤醒后再生产一个值 摘录

北京理工大学 嵩天-Python网络爬虫与信息提取


1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,会注明原创字样,如未注明都非原创,如有侵权请联系删除!;3.作者投稿可能会经我们编辑修改或补充;4.本站不提供任何储存功能只提供收集或者投稿人的网盘链接。

标签: #Python #Beautiful #soup4 #Scrapy #文章目录Requests常用的 #response #属性rencoding