scrapy爬虫项目的建立_麻衣带我去上学_新建scrapy项目

网络投稿 02-07 5927

文章目录前言一、什么是爬虫？二、什么是scrapy三、新建一个scrapy项目四、各模块的作用4.1 item.py4.2 pipelines.py4.3 qkhousespider.py4.4 settings.py4.5 其他文件五、启动scrapy

前言

本文只涉及到scrapy爬虫项目的基础知识,不涉及网页信息的提取和反爬机制的处理等具体技术,由于本人对爬虫的学习比较浅显,有出错的地方欢迎指正.

一、什么是爬虫？

我相信能点进这份博客的人可能已经对爬虫有了大致了解,简单来说爬虫就是为了信息的获取,这些信息通常也能通过人工的方式获取,但是数据量一旦庞大起来,通过人工的方式就显得十分的不切实际,而爬虫程序可以为我们处理这些简单的重复无意义的操作.

通常而言,我们不会直接说我们在"爬数据",会比较高大上的说我们在做"数据规整"的工作.

关于爬虫的具体原理本人也不是特别清楚,我的理解是,通过爬虫程序向指定网站发送请求信息request,网站会给你一个回复报文response,熟悉web的同学应该知到,这个response中就包含了我们需要的信息,通常来说,这个response会被浏览器进行编译和执行,转换成我们熟悉的前端界面,我们需要的信息就包含在这个前端界面中,因此我们需要的信息也存在于这个response中.

简单来说,爬虫的工作是"伪装"成浏览器访问某一站点,再"伪装"成用户在这个站点中查询相关信息,最后拦截得到站点返回给浏览器的response,从中筛选得到我们需要的信息.

可能这一部分的讲解有点不太清楚,感兴趣的同学可以查阅其他资料

二、什么是scrapy

scrapy是python自带的一个爬虫框架,scrapy已经帮我们完成了许多工作,而我们只需要定制开发几个模块即可,这就使得开发一个爬虫程序变得十分简单.

三、新建一个scrapy项目

在新建一个项目之前,需要使用anaconda新建一个虚拟环境,然后下载需要使用的包,这里需要使用的包是scrapy.

在虚拟环境建立完成后,可以通过命令行的方式新建一个scrapy项目,主要分为三个步骤:

进入保存项目的文件目录激活虚拟环境scrapy startproject name 来新建scrapy项目.

上图中我省去了步骤1.

成功创建scrapy项目后,目录如下: 为了介绍更加方便,接下来我将以一个已经完成的完整的scrapy项目进行描述.一个完整的scrapy项目目录如下: 通过对比两张图可以看出,一个完整的scrapy项目只是多了几个py文件,大体结构并未改变.

四、各模块的作用 4.1 item.py

将需要爬取的信息封装成一个类,方便后期数据的保存,我的爬虫代码中需要保存房屋的价格,位置,楼层,社区,朝向,周围地铁等信息.

代码示例如下:

import scrapy class QkhouseItem(scrapy.Item): price = scrapy.Field() location = scrapy.Field() floor = scrapy.Field() community = scrapy.Field() orientation = scrapy.Field() subway = scrapy.Field() pass 4.2 pipelines.py

pipelines.py是随着项目的创建而自动创建的,该模块主要是完成数据保存到外部文件中.在完整的代码中,一共有三个pipelines模块:pipelines.py将数据保存到txt文件中,pipelines_excel.py将数据保存到excel表格中,pipelines_mysql.py将数据保存到mysql数据库中.

pipelines.py代码如下:

class QkhousePipeline: def process_item(self, item, spider): global index with open("house.txt", "a", encoding='utf8') as fp: fp.write(item['price'] + "\t") fp.write(item['location'] + "\t") fp.write(item['floor'] + "\t") fp.write(item['community'] + "\t") fp.write(item['orientation'] + "\t") fp.write(item['subway']+ "\t") fp.write("\n") fp.close() return item

pipelines_excel.py代码如下:

from openpyxl import Workbook index = 2 wb = Workbook() sheet = wb.create_sheet("租房信息") sheet.cell(row=1, column=1, value="位置") sheet.cell(row=1, column=2, value="小区") sheet.cell(row=1, column=3, value="价格") sheet.cell(row=1, column=4, value="楼层") sheet.cell(row=1, column=5, value="地铁") sheet.cell(row=1, column=6, value="朝向") class QkhousePipeline: def process_item(self, item, spider): global index """保存到excel中""" sheet.cell(row=index, column=1, value=item['location']) sheet.cell(row=index, column=2, value=item['community']) sheet.cell(row=index, column=3, value=item['price']) sheet.cell(row=index, column=4, value=item['floor']) sheet.cell(row=index, column=5, value=item['subway']) sheet.cell(row=index, column=6, value=item['orientation']) wb.save("house.xlsx") index += 1 return item

pipelines_mysql.py代码如下:

import pymysql pc = pymysql.connect(host='', user="", password="", database="test", port=3306, charset='UTF8MB4') cs = pc.cursor() sql = "show tables" cs.execute(sql) # print(cs.fetchall()) for i in cs.fetchall(): if i[0] == 'qkhouse': print('表已经存在') break else: sql = ''' create table qkhouse( sno int primary key auto_increment, location varchar(100), community varchar(100), floor varchar(20), Orientation varchar(10), price varchar(200), subway varchar(100) )character set 'UTF8MB4' ''' cs.execute(sql) class QkhousePipeline: def process_item(self, item, spider): global pc, cs """保存到数据库中""" try: pc.ping(reconnect=True) with pc.cursor() as cs2: sql = "insert into qkhouse (location, community, floor, Orientation, price, subway) values (%s, " \ "%s, %s, %s, %s, %s) " cs2.execute(sql, [item['location'], item['community'], item['floor'], item['orientation'], item['price'], item['subway']]) pc.commit() except Exception as e: print(e) pc.rollback() finally: pc.close() return item

你也可以将这些保存方式写在一个pipelines.py文件中,但是不方便代码的管理.

分成多个pipelines时,你需要修改setting.py中的代码,在65行附件有一段注释掉的代码:

#ITEM_PIPELINES = { # 'qkhouse.pipelines.QkhousePipeline': 300, #}

若你只有一个pipelines,直接取消注释即可,代码中的300表示优先级,因为只有一个pipelines,所以优先级设置的大小没有影响.

若有多个pipelines时,你就需要设置pipelines优先级的大小:

ITEM_PIPELINES = { 'qkhouse.pipelines.QkhousePipeline': 1, 'qkhouse.pipelines_excel.QkhousePipeline': 2, 'qkhouse.pipelines_mysql.QkhousePipeline': 3, } 4.3 qkhousespider.py

在刚创建的scrapy项目的spider目录下,只有一个init文件,我们需要在该目录下创建一个新的模块完成核心程序的编写.

import scrapy from bs4 import BeautifulSoup from ..items import QkhouseItem class QkhouseSpider(scrapy.Spider): name = 'qkhousespider' allowed_domains = ['wh.qk365.com'] start_urls = ['https://wh.qk365.com/list/p1'] def parse(self, response): house_data = BeautifulSoup(response.body.decode('utf8'), 'html.parser') elem = house_data.find_all("div", {"class": "w1170 clearfix"})[0].ul.children index = 1 for i in elem: if index % 2 == 1: index += 1 else: url = i.a['href'] index += 1 yield scrapy.Request(url=url, callback=self.parse_detail) next = house_data.find_all("p", {"class": "easyPage"})[0] next_url = next.contents[len(next.contents)-2]['href'] if next.contents[len(next.contents)-2]['onclick'] != "next.contents[len(next.contents)-2]['onclick']": yield scrapy.Request(url=next_url, callback=self.parse) print(next_url) pass def parse_detail(self, response): item = QkhouseItem() house_detail = \ BeautifulSoup(response.body.decode('utf8'), 'html.parser').find_all("dl", {"class": "houSurvey clearfix"})[ 0].contents house_detail_lf = house_detail[1] house_detail_rg = house_detail[3] item['price'] = house_detail_lf.contents[1].text[5:12] item["orientation"] = house_detail_lf.contents[5].text[6:8] item['floor'] = house_detail_lf.contents[7].text[5:] item['community'] = house_detail_lf.contents[9].text[5:] item['subway'] = house_detail_rg.contents[3].text.replace(u'\xa0\r\n\t\t\t\t', u',').replace(u'\n', u',')[6:] item['location'] = house_detail_rg.contents[7].text.replace(u'\xa0\r\n\t\t\t\t', u',')[6:] yield item

因为涉及到爬取二级网页所以有parse和parse_detail两个函数,关于信息的提取我使用的是BeautifulSoup,关于数如何提取,建议看看其他大佬的博客,如果以后还有机会用到scrapy我也会补全这份内容(这部分知识丢的太久了,忘记了).

在parse_detail函数中,返回值用的是yield,这是python中的特有语法,你可以看成是每一次保存的item对象都"传输"到pipelines中进行保存.

4.4 settings.py

在配置文件中,我们之前提到过的pipelines需要对其进行修改外,还需要做其他的变动. 20行的代码改为:

ROBOTSTXT_OBEY = False

另外,很多网站具备反爬机制,也需要修改相应的配置文件.

4.5 其他文件

middlewares.py是项目自动生成的,目前我没有遇到过修改该文件的情形. 最后四个py文件,是算法可视化的模块,利用matplotlib来绘制表格图形,这是建立在将爬取的数据保存为外部文件的基础上进行的,算法可视化也不是必须的操作.

五、启动scrapy

scrapy的启动需要在终端输入命令:scrapy crawl name

scrapy crawl qkhousespider

我的程序中,在spider目录下面的qkhousespider.py文件中,可以看到name设置的值为qkhousespider.