irpas技术客

python 关键词 抓取网页_使用Scrapy抓取网站,只抓取包含关键字的页面_张昕宇梁红

未知 2782

我试图抓取不同的网站寻找特定的关键字感兴趣,只有刮那些网页。我编写的脚本是作为独立的Python脚本运行的,而不是传统的Scrapy项目结构(在example之后)并使用CrawlSpider类运行。其思想是,从给定的主页,Spider将对该域内的页面进行爬网,并仅从包含关键字的页面中获取链接。当我发现一个包含关键字的页面时,我还试图保存一个副本。这个问题的前一个版本与一个语法错误有关(参见下面的注释,感谢@tegancp帮助我解决了这个问题),但是现在虽然我的代码运行了,但是我仍然无法按照预期的那样只抓取感兴趣的页面上的链接。在

我想我想要么I)删除对__init__函数中的LinkExtractor的调用,要么ii)只从__init__内调用{},但规则基于我访问该页面时发现的内容,而不是URL的某个属性。我做不到I)因为CrawlSpider类需要一个规则,而我做不到ii)因为LinkExtractor没有一个{}选项,就像旧的SgmlLinkExtractor一样,它似乎已经被弃用了。我刚接触scray,所以想知道我的唯一选择是写我自己的LinkExtractor?在from scrapy.crawler import Crawler

from scrapy.contrib.loader import ItemLoader

from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst

from scrapy.contrib.linkextractors import LinkExtractor

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy import log, signals, Spider, Item, Field

from scrapy.settings import Settings

from twisted.internet import reactor

# define an item class

class GenItem(Item):

url = Field()

# define a spider

class GenSpider(CrawlSpider):

name = "genspider3"

# requires 'start_url', 'allowed_domains' and 'folderpath' to be passed as string arguments IN THIS PARTICULAR ORDER!!!

def __init__(self):

self.start_urls = [sys.argv[1]]

self.allowed_domains = [sys.argv[2]]

self.folder = sys.argv[3]

self.writefile1 = self.folder + 'hotlinks.txt'

self.writefile2 = self.folder + 'pages.txt'

self.rules = [Rule(LinkExtractor(allow_domains=(sys.argv[2],)), follow=True, callback='parse_links')]

super(GenSpider, self).__init__()

def parse_start_url(self, response):

# get list of links on start_url page and process using parse_links

list(self.parse_links(response))

def parse_links(self, response):

# if this page contains a word of interest save the HTML to file and crawl the links on this page

theHTML = response.body

if 'keyword' in theHTML:

with open(self.writefile2, 'a+') as f2:

f2.write(theHTML + '\n')

with open(self.writefile1, 'a+') as f1:

f1.write(response.url + '\n')

for link in LinkExtractor(allow_domains=(sys.argv[2],)).extract_links(response):

linkitem = GenItem()

linkitem['url'] = link.url

log.msg(link.url)

with open(self.writefile1, 'a+') as f1:

f1.write(link.url + '\n')

return linkitem

# callback fired when the spider is closed

def callback(spider, reason):

stats = spider.crawler.stats.get_stats() # collect/log stats?

# stop the reactor

reactor.stop()

# instantiate settings and provide a custom configuration

settings = Settings()

#settings.set('DEPTH_LIMIT', 2)

settings.set('DOWNLOAD_DELAY', 0.25)

# instantiate a crawler passing in settings

crawler = Crawler(settings)

# instantiate a spider

spider = GenSpider()

# configure signals

crawler.signals.connect(callback, signal=signals.spider_closed)

# configure and start the crawler

crawler.configure()

crawler.crawl(spider)

crawler.start()

# start logging

log.start(loglevel=log.DEBUG)

# start the reactor (blocks execution)

reactor.run()


1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,会注明原创字样,如未注明都非原创,如有侵权请联系删除!;3.作者投稿可能会经我们编辑修改或补充;4.本站不提供任何储存功能只提供收集或者投稿人的网盘链接。

标签: #python抓网页关键词