安装

1.安装scrapy之前需要安装Twisted

因为scrapy框架基于Twisted,所以先要下载其whl包安装
地址：https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

根据自己的版本下载

2. 接下来才是安装scrapy

scrapy的whl包地址:http://www.lfd.uci.edu/~gohlke/pythonlibs/

搜索 scrapy

3.命令行安装

# 安装Twisted
pip install xxxxxxx.whl
# 安装scrapy 名称根据自己下载的来定
pip install Scrapy.....whl

使用

1.项目创建

scrapy startproject scrapy_1 # scrapy_1是项目名，可以自拟

2.进入spider文件夹

cd F:\others\Desktop\crawler\scrapy_1\scrapy_1\spiders

相关配置文件说明

items.py:定义爬虫程序的数据模型

middlewares.py:定义数据模型中的中间件

pipelines.py:管道文件,负责对爬虫返回数据的处理

settings.py:爬虫程序设置,主要是一些优先级设置,优先级越高,值越小

scrapy.cfg:内容为scrapy的基础配置

值得注意的是，在学习阶段，我们要明白几点设置文件setting中的几处配置代码，它们影响着我们的爬虫的效率：

ROBOTSTXT_OBEY = True

这行代码意思是：是否遵守爬虫协议，学习阶段我们要改为False

3.创建爬虫文件

scrapy genspider 爬虫名 域名

scrapy genspider novel_spider aixiawx.com

之后会创建出爬虫文件

import scrapy


class NovelSpiderSpider(scrapy.Spider):
    name = 'novel_spider'  	               # name:  scrapy唯一定位实例的属性，必须唯一
    allowed_domains = ['aixiawx.com']      # allowed_domains：允许爬取的域名列表，不设置表示允许爬取所有
    start_urls = ['http://aixiawx.com/']   # start_urls：起始爬取列表，可以是多个url

    def parse(self, response):
        # parse：回调函数，处理response并返回处理后的数据和需要跟进的url
        pass

4. item.py 定义数据类型

import scrapy


class Scrapy1Item(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()

    pass

5.编写爬虫文件 novel_spider.py

import scrapy
from scrapy_1.items import Scrapy1Item


class NovelSpiderSpider(scrapy.Spider):
    name = 'novel_spider'
    allowed_domains = ['aixiawx.com']
    start_urls = ['http://www.aixiawx.com/15/15576/9642715.html']

    def parse(self, response):
        item = Scrapy1Item()
        item['title'] = response.xpath("/html/head/title/text()").extract()[0]
        item['content'] = response.xpath('//div[@id="content"]/text()').extract()

        yield item

6.启动爬取脚本

scrapy crawl <filename>

-o 将数据保存到指定文件
-t 指定数据的格式,支持的数据格式有
xml,
json,
csv,
pickle,    python二进制序列化格式  
Marshal   直译为“编排”， 在计算机中特 指将数据按某种描述格式编排出来，通常来说一般是从非文本格式到文本格式的数据转化

例如： scrapy crawl novel_spider -o 2.csv -t csv

介绍

Scrapy内置Selector选择器 ,不用再导入第三方包

Selector有四个基本的方法

xpath( )
- 程序返回该表达式所对应的所有节点的selector list选择器列表，从而筛选我们想要定位的元素。
css( )
- 返回该表达式所对应的所有节点的selector list选择器列表，
- 语法和 BeautifulSoup4相同
re( )
- 正则表达式，
- 对数据进行提取，返回Unicode字符串list列表。
extract( )
- 序列化节点为Unicode字符串，并返回list列表。

scrapy的选择器Xpath

/ 绝对路径，从根节点开始
- /html/body/form/input
- #查找html下的body下的form下的所有input节点
// 相对路径
- //input
- #查找所有input节点
- 通配符*选择未知的节点
//form/*
- #查找form节点下的所有节点
//*
- #查找所有节点
//*/input
- #查找所有input节点（input至少有爷爷辈亲戚节点）

@符号是属性符

//input[@name]
- #定位所有包含name属性的input节点
//input[@*]
- #定位含有属性的所有的input节点
//input[@value='2']
- #定位所有value=2的input节点
//div[@id]
- 选取所有拥有名为 id 的属性的 div 元素。
//div[@class=‘eng’]
- 选取所有 div 元素，且这些元素拥有值为 eng 的 class 属性。
- 使用便捷的函数来增强定位的准确性
//a[contains(@href,'promote.html')]
- #定位href属性中包含“promote.html”的所有a节点
//a[text()='应用推广']
- #元素内的文本为“应用推广”的所有a节点
//a[starts-with(@href,'/ads')]
- #href属性值是以“/ads”开头的所有a节点

nice

Python3.8安装Scrapy & 使用

安装