首页注册登录

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 2840 天前的主题，其中的信息可能已经有所发展或是发生改变。

https://github.com/intohole/xspider 是再重复造轮子！但让我们一起熟悉

xspider 简单 python 抓取框架

xspider

抓取单线程
简单 api 使用
xpath/css/json 提取器
多种队列
架构代码逻辑清晰，可以了解 spider 抓取过程
it's easy to crawl and extract web;

main.py:

    from xspider.spider.spider import BaseSpider
    from xspider.filters import urlfilter
    from kuailiyu import KuaiLiYu

if __name__ == "__main__":
    spider = BaseSpider(name = "kuailiyu"  , page_processor = KuaiLiYu() , allow_site = ["kuailiyu.cyzone.cn"] , start_urls = ["http://kuailiyu.cyzone.cn/"])
    spider.url_filters.append(urlfilter.UrlRegxFilter(["kuailiyu.cyzone.cn/article/[0-9]*\.html$","kuailiyu.cyzone.cn/index_[0-9]+.html$"]))
    spider.start()

kuailiyu.py
    from xspider import processor 
    from xspider.selector import xpath_selector
    from xspider import model

    class KuaiLiYu(processor.PageProcessor.PageProcessor):

        def __init__(self):
            super(KuaiLiYu , self).__init__()
            self.title_extractor = xpath_selector.XpathSelector(path = "//title/text()")

        def process(self , page , spider):
            items = model.fileds.Fileds()
            items["title"] = self.title_extractor.find(page)
            items["url"] = page.url
            return items

抓取部分有以下工程代码

第 1 条附言 · 2017-11-28 10:43:01 +08:00

继续顶，我想在这个工程上花些时间，做成一个带爬虫策略的爬虫框架

10 条回复 • 2017-12-01 12:56:34 +08:00

1

xiaozizayang

2017-11-23 16:25:48 +08:00

助攻 https://github.com/howie6879/talonspider

2

tamlok

2017-11-23 16:49:58 +08:00 via Android

助攻 https://github.com/tamlok/vnote

3

intohole

OP

2017-11-23 19:52:24 +08:00

@xiaozizayang 学习一下

4

intohole

OP

2017-11-23 19:52:42 +08:00

@tamlok 好屌～

5

j1wu

2017-11-23 20:00:21 +08:00

JavaScript 版本助攻，向大家学习 Orz https://github.com/j1wu/cli-scraper

6

zhangysh1995

2017-11-23 21:39:56 +08:00

最近正好在学爬虫，收藏一个，楼主加油！

7

intohole

OP

2017-11-24 10:16:11 +08:00

1

@j1wu 屌屌的

8

intohole

OP

2017-11-24 15:08:22 +08:00

@zhangysh1995 里面的 api 没有整理，这个爬虫专门为了机器不足时间来换的开发

9

coolloves

2017-12-01 11:12:16 +08:00

马克,学习

10

intohole

OP

2017-12-01 12:56:34 +08:00

@coolloves 感谢关注

关于 · 帮助文档 · 自助推广系统 · 博客 · API · FAQ · 实用小工具 · 5234 人在线 最高记录 6679 ·

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 · 32ms · UTC 05:43 · PVG 13:43 · LAX 22:43 · JFK 01:43
Developed with CodeLauncher
♥ Do have faith in what you're doing.