V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
gsz2015
V2EX  ›  Python

Scrapy CrawlSpider rules 中的 callback 未被调用

  •  
  •   gsz2015 · 2020-03-13 13:14:22 +08:00 · 1865 次点击
    这是一个创建于 1700 天前的主题,其中的信息可能已经有所发展或是发生改变。
    • 使用 Scrapy CrawlSpider 时,在 rules 中定义了 callback 方法,但无法进入定义的 callback 函数 parse_item
    • 将 parse_item 替换成 parse 能正常进入 parse 回调( Scrapy 默认的回调,Scrapy 不建议替换)
    • 使用 Scrapy shell 能正常输出 response
    • 想请教下为什么这里进不去我定义的 parse_item 回调
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class CrspiderSpider(CrawlSpider):
        name = 'crSpider'
        allowed_domains = ['china-railway.com.cn']
        start_urls = ['http://www.china-railway.com.cn/xwzx/ywsl/']
    
        rules = (
            Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/[a-zA-Z]+/'), follow=True),
            Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/[a-zA-Z]+/index_\d+.html'), follow=True),
            Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/.+t\d{8}_\d{6}.html'), callback='parse_item')
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            print('-' * 40, '进入回调', '-' * 40, )
            newsName = response.xpath('//h1').get()
            print(newsName)
           
    
        # def parse(self, response):
        #     item = {}
        #     print('-' * 40, '进入 parse 回调', '-' * 40, )
        #     print(response.text)
        #     newsName = response.xpath('//h1').get()
        #     return item
    
    
    • 这里是部分输出,能获取到符合规则的页面但无法输出 parse_item 的回调
    2020-03-13 12:38:25 [scrapy.core.engine] INFO: Spider opened
    2020-03-13 12:38:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2020-03-13 12:38:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
    2020-03-13 12:38:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/> (referer: None)
    2020-03-13 12:38:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    2020-03-13 12:38:26 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.china-railway.com.cn/xwzx/ywsl/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
    2020-03-13 12:38:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200304_101019.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    2020-03-13 12:38:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200305_101067.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    2020-03-13 12:38:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200305_101100.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    2020-03-13 12:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200306_101120.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    2020-03-13 12:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200307_101174.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    2020-03-13 12:38:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200310_101326.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    2020-03-13 12:38:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200311_101362.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
    
    kasper4649
        1
    kasper4649  
       2020-03-13 13:24:35 +08:00 via Android
    第三个 rule 后面也加个逗号?
    gsz2015
        2
    gsz2015  
    OP
       2020-03-13 13:30:51 +08:00
    @kasper4649 加不加逗号都试过了😂,难道是 Scrapy 2.0 的问题吗
    IanPeverell
        3
    IanPeverell  
       2020-03-13 16:12:40 +08:00
    你把单引号去掉试试,你传的应该是函数不是字符串
    IanPeverell
        4
    IanPeverell  
       2020-03-13 16:26:53 +08:00
    @IanPeverell 哦,字符串也可以(捂脸逃)
    gsz2015
        5
    gsz2015  
    OP
       2020-03-13 16:35:12 +08:00
    @IanPeverell 刚刚解决了,是正则的问题,第一个正则也能匹配到第三个正则的 url,所以一直没有调用到 callback 😂
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   1029 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 22:15 · PVG 06:15 · LAX 14:15 · JFK 17:15
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.