Scrapy – 常用做法

yiyan • 2021年4月21日 pm10:59 • Scrapy2.3中文文档 - 最优秀的Python网络爬虫库

本节记录使用Scrapy时的常见做法。这些内容涵盖了许多主题，通常不属于任何其他特定部分。

从脚本中运行Scrapy¶

你可以使用 API 从脚本运行scrapy，而不是运行scrapy via的典型方式 scrapy crawl .

记住，scrappy构建在TwistedAsynchronicNetworkLibrary之上，所以需要在TwistedReactor中运行它。

你能用来运行蜘蛛的第一个工具是 scrapy.crawler.CrawlerProcess . 这个类将为您启动一个扭曲的反应器，配置日志记录和设置关闭处理程序。这个类是所有slapy命令使用的类。

下面是一个示例，演示如何使用它运行单个蜘蛛。

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

在CrawlerProcess中定义字典中的设置。一定要检查 CrawlerProcess 了解其使用细节的文档。

如果您在一个零碎的项目中，有一些额外的帮助器可以用来导入项目中的那些组件。你可以自动输入蜘蛛的名字 CrawlerProcess 及使用 get_project_settings 得到一个 Settings 具有项目设置的实例。

下面是一个如何使用 testspiders 以项目为例。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

还有另一个Scrapy实用程序，它提供了对爬行过程的更多控制： scrapy.crawler.CrawlerRunner . 这个类是一个薄包装器，它封装了一些简单的帮助器来运行多个爬行器，但是它不会以任何方式启动或干扰现有的反应器。

使用这个类，在调度spider之后应该显式地运行reactor。建议您使用 CrawlerRunner 而不是 CrawlerProcess 如果您的应用程序已经在使用Twisted，并且您希望在同一个反应器中运行Scrapy。

请注意，蜘蛛完成后，您还必须自己关闭扭曲的反应堆。这可以通过将回调添加到由 CrawlerRunner.crawl 方法。

下面是一个使用它的例子，以及在 MySpider 已完成运行。

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

参见

Reactor Overview

在同一进程中运行多个spider¶

默认情况下，当您运行时，scrapy为每个进程运行一个spider scrapy crawl . 但是，Scrapy支持使用 internal API .

下面是一个同时运行多个蜘蛛的示例：

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

使用相同的示例 CrawlerRunner ：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

同样的例子，但是通过链接延迟来按顺序运行spider：

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished

参见

从脚本中运行Scrapy .

分布式爬行¶

Scrapy不提供任何以分布式（多服务器）方式运行爬虫的内置工具。但是，有一些分发爬行的方法，这取决于您计划如何分发爬行。

如果您有许多蜘蛛，那么分配负载的最明显的方法就是设置许多ScrapyD实例，并将蜘蛛运行分布在这些实例中。

如果您想在多台机器上运行一个（大）蜘蛛，通常需要对URL进行分区，以便爬行并将它们发送到每个单独的蜘蛛。下面是一个具体的例子：

首先，准备要爬网的URL列表并将其放入单独的文件/URL:：

http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list

然后在3个不同的ScrapyD服务器上启动一个蜘蛛运行。蜘蛛会收到一个（蜘蛛）论点 part 使用要爬网的分区的编号：：

curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

避免被禁止¶

一些网站实施了某些措施，以防止僵尸爬行他们，不同程度的复杂度。绕开这些措施既困难又棘手，有时可能需要特殊的基础设施。请考虑联系 commercial support 如果有疑问。

以下是处理此类网站时要记住的一些提示：

将你的用户代理从浏览器中的一个著名的池中轮换出来（用google搜索以获得一个列表）。
禁用cookies（请参见 COOKIES_ENABLED ）因为有些网站可能会使用cookie来发现机器人行为
使用下载延迟（2或更高）。见 DOWNLOAD_DELAY 设置。
如果可能，使用 Google cache 获取页面，而不是直接访问站点
使用一个旋转的IP池。例如，自由 Tor project 或者像这样的付费服务 ProxyMesh . 开源替代方案是 scrapoxy ，可以将自己的代理附加到的超级代理。
使用一个在内部绕过BAN的高度分布式下载程序，这样您就可以专注于解析干净的页面。这种下载器的一个例子是 Crawlera

如果您仍然无法阻止您的bot被禁止，请考虑联系 commercial support .

从脚本中运行Scrapy¶

你可以使用 API 从脚本运行scrapy，而不是运行scrapy via的典型方式 scrapy crawl .

记住，scrappy构建在TwistedAsynchronicNetworkLibrary之上，所以需要在TwistedReactor中运行它。

下面是一个示例，演示如何使用它运行单个蜘蛛。

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

在CrawlerProcess中定义字典中的设置。一定要检查 CrawlerProcess 了解其使用细节的文档。

下面是一个如何使用 testspiders 以项目为例。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished

请注意，蜘蛛完成后，您还必须自己关闭扭曲的反应堆。这可以通过将回调添加到由 CrawlerRunner.crawl 方法。

下面是一个使用它的例子，以及在 MySpider 已完成运行。

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
# Your spider definition
...
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

参见

Reactor Overview

在同一进程中运行多个spider¶

默认情况下，当您运行时，scrapy为每个进程运行一个spider scrapy crawl . 但是，Scrapy支持使用 internal API .

下面是一个同时运行多个蜘蛛的示例：

import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

使用相同的示例 CrawlerRunner ：

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

同样的例子，但是通过链接延迟来按顺序运行spider：

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished

参见

从脚本中运行Scrapy .

分布式爬行¶

Scrapy不提供任何以分布式（多服务器）方式运行爬虫的内置工具。但是，有一些分发爬行的方法，这取决于您计划如何分发爬行。

如果您有许多蜘蛛，那么分配负载的最明显的方法就是设置许多ScrapyD实例，并将蜘蛛运行分布在这些实例中。

如果您想在多台机器上运行一个（大）蜘蛛，通常需要对URL进行分区，以便爬行并将它们发送到每个单独的蜘蛛。下面是一个具体的例子：

首先，准备要爬网的URL列表并将其放入单独的文件/URL:：

http://somedomain.com/urls-to-crawl/spider1/part1.list
http://somedomain.com/urls-to-crawl/spider1/part2.list
http://somedomain.com/urls-to-crawl/spider1/part3.list

然后在3个不同的ScrapyD服务器上启动一个蜘蛛运行。蜘蛛会收到一个（蜘蛛）论点 part 使用要爬网的分区的编号：：

curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

避免被禁止¶

以下是处理此类网站时要记住的一些提示：

将你的用户代理从浏览器中的一个著名的池中轮换出来（用google搜索以获得一个列表）。
禁用cookies（请参见 COOKIES_ENABLED ）因为有些网站可能会使用cookie来发现机器人行为
使用下载延迟（2或更高）。见 DOWNLOAD_DELAY 设置。
如果可能，使用 Google cache 获取页面，而不是直接访问站点
使用一个旋转的IP池。例如，自由 Tor project 或者像这样的付费服务 ProxyMesh . 开源替代方案是 scrapoxy ，可以将自己的代理附加到的超级代理。
使用一个在内部绕过BAN的高度分布式下载程序，这样您就可以专注于解析干净的页面。这种下载器的一个例子是 Crawlera

如果您仍然无法阻止您的bot被禁止，请考虑联系 commercial support .

以上是Scrapy – 常用做法的全部内容。

THE END

二维码

C语言输入输出 -C语言预处理命令：#include和#define

< <上一篇

如何隐藏 Nginx 版本号和 Web 服务器名称

下一篇>>

搜索内容

Scrapy – 常用做法

从脚本中运行Scrapy¶

在同一进程中运行多个spider¶

分布式爬行¶

避免被禁止¶

从脚本中运行Scrapy¶

在同一进程中运行多个spider¶

分布式爬行¶

避免被禁止¶

目录

目录

推荐文章

最新文章