Scrapy – 蜘蛛合约

牛青 • 2021年4月22日 am7:00 • Scrapy2.3中文文档 - 最优秀的Python网络爬虫库 • 1225 阅读

测试蜘蛛会变得特别烦人，虽然没有什么可以阻止你编写单元测试，但是任务会很快变得很麻烦。Scrapy提供了一种综合的方法，可以通过合同的方式测试你的蜘蛛。

这允许您通过硬编码一个示例URL来测试蜘蛛的每个回调，并检查回调如何处理响应的各种约束。每个合同的前缀都是 @ 并包含在docstring中。请参见以下示例：

def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
    with this docstring.
    @url http://www.amazon.com/s?field-keywords=selfish+gene
    @returns items 1 16
    @returns requests 0 0
    @scrapes Title Author Year Price
    """

此回调使用三个内置合同进行测试：

class scrapy.contracts.default.UrlContract¶

本合同 (@url )设置检查此蜘蛛的其他合同条件时使用的示例URL。本合同是强制性的。运行检查时，忽略所有缺少此协定的回调：

@url url

class scrapy.contracts.default.CallbackKeywordArgumentsContract¶

本合同 (@cb_kwargs )设置 cb_kwargs 示例请求的属性。它必须是有效的JSON字典。:

@cb_kwargs {"arg1": "value1", "arg2": "value2", ...}

class scrapy.contracts.default.ReturnsContract¶

本合同 (@returns ）为蜘蛛返回的项和请求设置下限和上限。上限是可选的：

@returns item(s)|request(s) [min [max]]

class scrapy.contracts.default.ScrapesContract¶

本合同 (@scrapes ）检查回调返回的所有项是否具有指定的字段：：

@scrapes field_1 field_2 ...

使用 check 运行合同检查的命令。

定制合同¶

如果您发现您需要比内置Scrapy契约更多的功能，那么可以使用 SPIDER_CONTRACTS 设置：

SPIDER_CONTRACTS = {
'myproject.contracts.ResponseCheck': 10,
'myproject.contracts.ItemValidate': 10,
}

每个合同必须继承自 Contract 可以覆盖三种方法：

class scrapy.contracts.Contract(method, *args)[源代码]¶

参数

method (collections.abc.Callable) -- 与合同关联的回调函数
args (list) -- 传入docstring的参数列表（空格分隔）

adjust_request_args(args)[源代码]¶

这接收了 dict 作为包含请求对象的默认参数的参数。 Request 默认情况下使用，但可以使用 request_cls 属性。如果链中的多个合同定义了此属性，则使用最后一个。

必须返回相同或修改过的版本。

pre_process(response)¶: 这允许在将样本请求传递到回调之前，对从该请求接收的响应进行各种检查。

post_process(output)¶: 这允许处理回调的输出。迭代器在传递给这个钩子之前被转换为列表化的。

提高 ContractFail 从 pre_process 或 post_process 如果未达到预期：

class scrapy.exceptions.ContractFail[源代码]¶: 合同失败时出现的错误

下面是一个演示合同，它检查收到的响应中是否存在自定义头：

from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail
class HasHeaderContract(Contract):
""" Demo contract which checks the presence of a custom header
        @has_header X-CustomHeader
    """
name = 'has_header'
def pre_process(self, response):
for header in self.args:
if header not in response.headers:
raise ContractFail('X-CustomHeader not present')

正在检测检查运行¶

什么时候？ scrapy check 正在运行， SCRAPY_CHECK 环境变量设置为 true 字符串。你可以用 os.environ 在以下情况下对蜘蛛或设置执行任何更改： scrapy check 用途：

import os
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
if os.environ.get('SCRAPY_CHECK'):
pass  # Do some scraper adjustments when a check is running

定制合同¶

如果您发现您需要比内置Scrapy契约更多的功能，那么可以使用 SPIDER_CONTRACTS 设置：

SPIDER_CONTRACTS = {
'myproject.contracts.ResponseCheck': 10,
'myproject.contracts.ItemValidate': 10,
}

每个合同必须继承自 Contract 可以覆盖三种方法：

class scrapy.contracts.Contract(method, *args)[源代码]¶

参数

method (collections.abc.Callable) -- 与合同关联的回调函数
args (list) -- 传入docstring的参数列表（空格分隔）

adjust_request_args(args)[源代码]¶

必须返回相同或修改过的版本。

pre_process(response)¶: 这允许在将样本请求传递到回调之前，对从该请求接收的响应进行各种检查。

post_process(output)¶: 这允许处理回调的输出。迭代器在传递给这个钩子之前被转换为列表化的。

提高 ContractFail 从 pre_process 或 post_process 如果未达到预期：

class scrapy.exceptions.ContractFail[源代码]¶: 合同失败时出现的错误

下面是一个演示合同，它检查收到的响应中是否存在自定义头：

from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail
class HasHeaderContract(Contract):
""" Demo contract which checks the presence of a custom header
        @has_header X-CustomHeader
    """
name = 'has_header'
def pre_process(self, response):
for header in self.args:
if header not in response.headers:
raise ContractFail('X-CustomHeader not present')

正在检测检查运行¶

import os
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
def __init__(self):
if os.environ.get('SCRAPY_CHECK'):
pass  # Do some scraper adjustments when a check is running

以上是Scrapy – 蜘蛛合约的全部内容。

THE END

二维码

C语言输入输出 -C语言算术运算符

< <上一篇

Python binarytree模块的用法

下一篇>>

搜索内容

Scrapy – 蜘蛛合约

定制合同¶

正在检测检查运行¶

定制合同¶

正在检测检查运行¶

目录

目录

推荐文章

最新文章