之前入门了Scrapy,用Scrapy框架爬取豆瓣TOP250,最近打算学习下scrapy-redis分布式爬虫,学习之前再重新温故下Scrapy,这个总结我缩写了很多内容,很多介绍可以看下我之前写的doubanmovie
实战应用
打开CMD输入scrapy startproject maoyan
import scrapy
class MaoyanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
movie_name = scrapy.Field()
movie_ename = scrapy.Field()
movie_type = scrapy.Field()
movie_publish = scrapy.Field()
movie_time = scrapy.Field()
movie_star = scrapy.Field()
movie_total_price = scrapy.Field()
pass
首先,引入Scrapy
接着,创建一个类,继承自scrapy.item,这个是用来储存要爬下来的数据的存放容器,类似orm的写法
我们要记录的是:电影的名字、电影的评分、电影的上映时间、电影类型、电影英文名
获取网页数据
好了,到这一步编辑spider
from scrapy.spiders import Rule, CrawlSpider
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from maoyan.items import MaoyanItem
class MaoyanmovieSpider(CrawlSpider):
name = 'my'
# allowed_domains = ['http://maoyan.com/']
start_urls = ['http://maoyan.com/films']
rules = (
Rule(LinkExtractor(allow=(r'http://maoyan.com/films\?offset=\d+'))),
Rule(LinkExtractor(allow=(r'http://maoyan.com/films/\d+')), callback='parse_item')
)
def parse_item(self, response):
# print(response.body)
sel = Selector(response)
movie_name = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/h3/text()').extract()
movie_ename = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/div/text()').extract()
movie_type = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[1]/text()').extract()
movie_publish = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[2]/text()').extract()
movie_time = sel.xpath('/html/body/div[3]/div/div[2]/div[1]/ul/li[3]/text()').extract()
movie_star = sel.xpath('/html/body/div[3]/div/div[2]/div[3]/div[1]/div/span/span/text()').extract()
# movie_total_price = sel.xpath('/html/body/div[3]/div/div[2]/div[3]/div[2]/div/span[1]/text()').extract()
# movie_introd = sel.xpath('//*[@id="app"]/div/div[1]/div/div[2]/div[1]/div[1]/div[2]/span/text()').extract()
# print(movie_name)
# print(movie_ename)
# print(movie_type)
# print(movie_publish)
# print(movie_time)
# print(movie_star)
# print(movie_total_price)
item = MaoyanItem()
item['movie_name'] = movie_name
item['movie_ename'] = movie_ename
item['movie_type'] = movie_type
item['movie_publish'] = movie_publish
item['movie_time'] = movie_time
item['movie_star'] = movie_star
# item['movie_total_price'] = movie_total_price
# item['movie_introd'] = movie_introd
yield item
spider写完后我们要将数据存进MongoDB数据库内,编辑pipeline
import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
client = pymongo.MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
db = client[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
# item: (Item 对象) – 被爬取的item
# (Spider 对象) – 爬取该item的spider
# 去重,删除重复的数据
valid = True
for data in item:
if not data:
valid = False
raise DropItem('Missing%sof blogpost from%s' % (data, item['url']))
if valid:
movies = [{
'movie_name': item['movie_name'],
'movie_ename': item['movie_ename'],
'movie_type': item['movie_type'],
'movie_publish': item['movie_publish'],
'movie_time': item['movie_time'],
'movie_star': item['movie_star']
}]
# 插入数据库集合中
self.collection.insert(movies)
log.msg('Item wrote to MongoDB database%s/%s' % (settings['MONGODB_DB'], settings['MONGODB_COLLECTION']),
level=log.DEBUG, spider=spider)
return item
配置文件
BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = True
DOWNLOAD_DELAY = 3
LOG_LEVEL = 'DEBUG'
RANDOMIZE_DOWNLOAD_DELAY = True
# 关闭重定向
REDIRECT_ENABLED = False
# 返回302时,按正常返回对待,可以正常写入cookie
HTTPERROR_ALLOWED_CODES = [302,]
ITEM_PIPELINES = {
'maoyan.pipelines.MongoDBPipeline': 300,
}
MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
MONGODB_DB = 'maoyan'
MONGODB_COLLECTION = 'movies'
好了,现在开启爬虫scrapy crawl my
写这个爬虫应该会遇到302重定向或者被网站发现是机器人操作,建议延长delay时间,不过爬取效率会非常低!!总共有23110页,每页有30条数据,总共693300条数据,就算不被ban掉,那得爬到猴年马月............................................................ 不说了,赶紧学习分布式爬虫!!!!
1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,会注明原创字样,如未注明都非原创,如有侵权请联系删除!;3.作者投稿可能会经我们编辑修改或补充;4.本站不提供任何储存功能只提供收集或者投稿人的网盘链接。 |
标签: #startproject #maoyanimport #scrapyclass #define