前言
Scrapy是非常优秀的一个爬虫框架,基于twisted异步编程框架。yield的使用如此美妙。基于调度器,下载器可以对scrapy扩展编程。插件也是非常丰富,和Selenium,PlayWright集成也比较轻松。
当然,对网页中的ajax请求它是无能无力的,但结合mitmproxy几乎无所不能:Scrapy + PlayWright模拟用户点击,mitmproxy则在后台抓包取数据,登录一次,运行一天。
最终,我通过asyncio把这几个工具整合到了一起,基本达成了自动化无人值守的稳定运行,一篇篇的文章送入我的ElasticSearch集群,经过知识工厂流水线,变成知识商品。
”爬虫+数据,算法+智能“,这是一个技术人的理想。
配置与运行
安装:
pip install scrapy
当前目录下有scrapy.cfg和settings.py,即可运行scrapy
命令行运行:
scrapy crawl ArticleSpider
在程序中运行有三种写法:
from scrapy.cmdline import execute
execute('scrapy crawl ArticleSpider'.split())
采用CrawlerRunner:
# 采用CrawlerRunn服务器托管网er
from twisted.internet.asyncioreactor import AsyncioSelectorReactor
reactor = AsyncioSelectorReactor()
runner = CrawlerRunner(settings)
runner.crawl(ArticleSpider)
reactor.run()
采用CrawlerProcess
# 采用CrawlerProcess
process = CrawlerProcess(settings)
process.crawl(ArticleSpider)
process.start()
和PlayWright的集成
使用PlayWright的一大好处就是用headless browser做自动化数据采集。A headless browser 是一种特殊的Web浏览器,它为自动化提供API。通过安装 asyncio reactor ,则可以集成 asyncio 基础库,用于处理无头浏览器。
import scrapy
from playwright.async_api import async_playwright
class PlaywrightSpider(scrapy.Spider):
name = "playwright"
start_urls = ["data:,"] # avoid using the default Scrapy downloader
async def parse(self, response):
async with async_playwright() as pw:
browser = await pw.chromium.launch()
page = await browser.new_page()
await page.goto("https:/example.org")
title = await page.title()
return {"title": title}
使用 playwright-python 与上面的示例一样,直接绕过了大多数scrapy组件(中间件、dupefilter等)。建议使用 scrapy-playwright 进行整合。
安装
pip install scrapy-playwright
playwright install
playwright install firefox chromium
settings.py配置
BOT_NAME = 'ispider'
SPIDER_MODULES = ['ispider.spider']
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
DOWNLOAD_HANDLERS = {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
CONCURRENT_REQUESTS = 32
PLAYWRIGHT_MAX_PAGES_PER_CONTEXT = 4
CLOSESPIDER_ITEMCOUNT = 100
PLAYWRIGHT_CDP_URL = "http://localhost:9900"
爬虫定义
class ArticleSpider(Spider):
name = "ArticleSpider"
custom_settings = {
# "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
# "DOWNLOAD_HANDLERS": {
# "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# },
# "CONCURRENT_REQUESTS": 32,
# "PLAYWRIGHT_MAX_PAGES_PER_CONTEXT": 4,
# "CLOSESPIDER_ITEMCOUNT": 100,
}
start_urls = ["https://blog.csdn.net/nav/lang/javascript"]
def __init__(self, name=None, **kwargs):
super().__init__(name, **kwargs)
logger.debug('ArticleSpider initialized.')
def start_requests(self):
for url in self.start_urls:
yield Request(
url,
meta={
"playwright": True,
"playwright_context": "first",
"playwright_include_page": True,
"playwright_page_goto_kwargs": {
"wait_until": "domcontentloaded",
},
},
)
async def parse(self, response: Response, current_page: Optional[int] = None) -> Generator:
content = response.text
p服务器托管网age = response.meta["playwright_page"]
context = page.context
title = await page.title()
while True:
## 垂直滚动下拉,不断刷新数据
page.mouse.wheel(delta_x=0, delta_y=200)
time.sleep(3)
pass
参考链接
- scrapy文档
- 官方scrapy-playwright插件
- 崔庆才丨静觅写的插件GerapyPlaywright
服务器托管,北京服务器托管,服务器租用 http://www.fwqtg.net