Scrapy教程 - (2)寫一個簡單爬蟲

目的：爬取此網頁的所有書籍名稱，價格，url，庫存，評價及封面圖片。本文以此網站為例

檢查robotstxt_obey

創建好scrapy project後，先到settings.py找到ROBOTSTXT_OBEY，並把它設成False。
(此舉動意義為不遵守該網站的robots.txt，請在徵得該網同意後再施行。備註：此網站為範例練習網站。)

查看元素位置

回到範例網站，按F12打開開發者工具。
在这里插入图片描述
先以2個小練習來熟悉一下xpath ~
首先，書籍名稱在h3裡的a tag裡面，位置xpath如下：

// parse book titles
response.xpath('//h3/a/@title').extract()

// extract可以解析出所有title的名稱
// 若是使用extract_first()則會解析出第一個title的名稱

接著查看價格所在位置，xpath如下：

// parse book price
response.xpath('//p[@class="price_color"]/text()').extract()

查找url是相當重要的，因為我們必須先找到所有書籍的url，進一步在request所有url，並獲得我們想要取得的資料，其 xpath如下：

response.xpath('//h3/a/@href').extract_first()

// 輸出結果: 'catalogue/a-light-in-the-attic_1000/index.html'

Request第一本書籍

接著觀察url可以發現，剛剛所解析出的是該書籍網址的後綴，也就是說我們必須把前綴加上去，才是一個完整的url。因此到這裡，我們開始寫第一個function。

def parse(self, response):
	// 找所有書籍的url
	books = response.xpath('//h3/a/@href').extract()
    for book in books:
    	// 將網址前綴與後綴結合
    	url = response.urljoin(book)
        yield response.follow(url = url,
                              callback = self.parse_book)

def parse_book(self, response):
	pass

Parse Data

def parse_book(self, response):
	title = response.xpath('//h1/text()').extract_first()
    price = response.xpath('//*[@class="price_color"]/text()').extract_first()

    image_url = response.xpath('//img/@src').extract_first()
    image_url = image_url.replace('../../', 'http://books.toscrape.com/') 

    rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
    rating = rating.replace('star-rating', '')

    description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()

查看解析成果

這裡可以用yield來查看解析成果：

// inside parse_book function
yield {'title': title,
       'price': price,
       'image_url': image_url,
       'rating': rating,
       'description': description}

完成一個簡單爬蟲

def parse(self, response):
	// 找所有書籍的url
	books = response.xpath('//h3/a/@href').extract()
    for book in books:
    	// 將網址前綴與後綴結合
    	url = response.urljoin(book)
        yield response.follow(url = url,
                              callback = self.parse_book)
                              
def parse_book(self, response):
	title = response.xpath('//h1/text()').extract_first()
    price = response.xpath('//*[@class="price_color"]/text()').extract_first()

    image_url = response.xpath('//img/@src').extract_first()
    image_url = image_url.replace('../../', 'http://books.toscrape.com/') 

    rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
    rating = rating.replace('star-rating', '')

    description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
	
	yield {'title': title,
       	   'price': price,
           'image_url': image_url,
           'rating': rating,
           'description': description}