V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
爱意满满的作品展示区。
WuMingyu
V2EX  ›  分享创造

分享一个基于 asyncio 的爬虫微框架 AntNest,来 fork 啊

  •  
  •   WuMingyu · 2018-07-06 09:11:31 +08:00 · 2442 次点击
    这是一个创建于 2351 天前的主题,其中的信息可能已经有所发展或是发生改变。

    楼主之前干过一段时间的爬虫,而且正好 python 的 asyncio 比较新,但逛了一圈没发现觉得好用的框架,就造了这么个轮子,废话少说,看看一个例子

    from ant_nest import *
    from yarl import URL
    
    
    class GithubAnt(Ant):
        """Crawl trending repositories from github"""
        item_pipelines = [
            ItemFieldReplacePipeline(
                ('meta_content', 'star', 'fork'),
                excess_chars=('\r', '\n', '\t', '  '))
        ]
        concurrent_limit = 1  # save the website`s and your bandwidth!
    
        def __init__(self):
            super().__init__()
            self.item_extractor = ItemExtractor(dict)
            self.item_extractor.add_pattern(
                'xpath', 'title', '//h1/strong/a/text()')
            self.item_extractor.add_pattern(
                'xpath', 'author', '//h1/span/a/text()', default='Not found')
            self.item_extractor.add_pattern(
                'xpath', 'meta_content',
                '//div[@class="repository-meta-content col-11 mb-1"]//text()',
                extract_type=ItemExtractor.EXTRACT_WITH_JOIN_ALL)
            self.item_extractor.add_pattern(
                'xpath',
                'star', '//a[@class="social-count js-social-count"]/text()')
            self.item_extractor.add_pattern(
                'xpath', 'fork', '//a[@class="social-count"]/text()')
    
        async def crawl_repo(self, url):
            """Crawl information from one repo"""
            response = await self.request(url)
            # extract item from response
            item = self.item_extractor.extract(response)
            item['origin_url'] = response.url
    
            await self.collect(item)  # let item go through pipelines(be cleaned)
            self.logger.info('*' * 70 + 'I got one hot repo!\n' + str(item))
    
        async def run(self):
            """App entrance, our play ground"""
            response = await self.request('https://github.com/explore')
            for url in response.html_element.xpath(
                    '/html/body/div[4]/div[2]/div/div[2]/div[1]/article//h1/a[2]/'
                    '@href'):
                # crawl many repos with our coroutines pool
                self.schedule_coroutine(
                    self.crawl_repo(response.url.join(URL(url))))
            self.logger.info('Waiting...')
    

    输出:

    >>> ant_nest -a ants.example2.GithubAnt
    INFO:GithubAnt:Opening
    INFO:GithubAnt:Waiting...
    INFO:GithubAnt:**********************************************************************I got one hot repo!
    {'title': 'NLP-progress', 'author': 'sebastianruder', 'meta_content': 'Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.', 'star': '3,743', 'fork': '327', 'origin_url': URL('https://github.com/sebastianruder/NLP-progress')}
    INFO:GithubAnt:**********************************************************************I got one hot repo!
    {'title': 'material-dashboard', 'author': 'creativetimofficial', 'meta_content': 'Material Dashboard - Open Source Bootstrap 4 Material Design Adminhttps://demos.creative-tim.com/materi …', 'star': '6,032', 'fork': '187', 'origin_url': URL('https://github.com/creativetimofficial/material-dashboard')}
    INFO:GithubAnt:**********************************************************************I got one hot repo!
    {'title': 'mkcert', 'author': 'FiloSottile', 'meta_content': "A simple zero-config tool to make locally-trusted development certificates with any names you'd like.", 'star': '2,311', 'fork': '60', 'origin_url': URL('https://github.com/FiloSottile/mkcert')}
    INFO:GithubAnt:**********************************************************************I got one hot repo!
    {'title': 'pure-bash-bible', 'author': 'dylanaraps', 'meta_content': '📖 A collection of pure bash alternatives to external processes.', 'star': '6,385', 'fork': '210', 'origin_url': URL('https://github.com/dylanaraps/pure-bash-bible')}
    INFO:GithubAnt:**********************************************************************I got one hot repo!
    {'title': 'flutter', 'author': 'flutter', 'meta_content': 'Flutter makes it easy and fast to build beautiful mobile apps.https://flutter.io', 'star': '30,579', 'fork': '1,337', 'origin_url': URL('https://github.com/flutter/flutter')}
    INFO:GithubAnt:**********************************************************************I got one hot repo!
    {'title': 'Java-Interview', 'author': 'crossoverJie', 'meta_content': '👨\u200d🎓 Java related : basic, concurrent, algorithm https://crossoverjie.top/categories/J …', 'star': '4,687', 'fork': '409', 'origin_url': URL('https://github.com/crossoverJie/Java-Interview')}
    INFO:GithubAnt:Closed
    INFO:GithubAnt:Get 7 Request in total
    INFO:GithubAnt:Get 7 Response in total
    INFO:GithubAnt:Get 6 dict in total
    INFO:GithubAnt:Run GithubAnt in 18.157656 seconds
    

    这里一些有趣的概念: Item、Pipeline 和 ItemExtractor

    Item 代表存储爬取数据的容器,这个容器有很多表现形式,可以是任何支持 set|get item|attribute 的对象, 比如 dict、标准 Class, AttrsClass 或者 DataClass 甚至 ORM 的 Class 对象

    Pipeline 类似 scrapy 的 Pipeline,每个 Request、Response 和 Item 都可以通过一系列的 Pipeline

    ItemExtractor 用来声明如何从源数据解析出 Item, 支持 xpath、jpath 和 regex

    还有一些比如协程的工作流、HTTP 的封装等等限于篇幅不再赘述,整个框架就 600 行左右的核心代码,思路也都很简单,不清楚的可以直接看代码。 有爬虫需求的可以试试,看看能不能加快写代码的速度,也欢迎大佬提建议,放上地址: https://github.com/strongbugman/ant_nest

    3 条回复    2018-07-08 23:06:26 +08:00
    yuanfnadi
        1
    yuanfnadi  
       2018-07-06 23:25:10 +08:00
    感觉爬虫难点在于 IP 反爬 分布式调度。楼主这方面有轮子推荐吗。
    WuMingyu
        2
    WuMingyu  
    OP
       2018-07-08 12:02:10 +08:00
    @yuanfnadi 如果说分布式是为了利用多个 ip 的话,不如部署一个分布式代理池,爬虫进程还是单点跑
    yuanfnadi
        3
    yuanfnadi  
       2018-07-08 23:06:26 +08:00 via iPhone
    @WuMingyu 分布式就是多个爬虫共享进度 任务 不会重复。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   5067 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 26ms · UTC 08:45 · PVG 16:45 · LAX 00:45 · JFK 03:45
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.