成功代码如下:
import scrapy
class CnblogSpider(scrapy.Spider):
name = 'cnblog'
allowed_domains = ['cnblogs.com']
start_urls = ['http://cnblogs.com/qiyeboy/default.html?page=1']
def parse(self, response):
papers=response.xpath(".//*[@class='day']")
for paper in papers:
url = paper.xpath(".//*[@class='postTitle']/a/@href").extract_first()
title = paper.xpath(".//*[@class='postTitle']/a/span/text()").extract()[0]
time = paper.xpath(".//*[@class='dayTitle']/a/text()").extract()[0]
content = paper.xpath(".//*[@class='postTitle']/a/span/text()").extract()[0]
print(f'{url},{title},{time},{content}')
其中有两点是我写的时候调试出来的 1.
title = paper.xpath(".//*[@class='postTitle']/a/span/text()").extract()[0]
式子中正则这样写运行得到的结果是不准确的,得到的title如下 里面是带有标签
的,正确的方法是在正则后面加上text(),如下
title = paper.xpath(".//*[@class='postTitle']/a/span/text()").extract()[0]
还有格式化打印多个字符串除了format(),还可以如上例中
f'{a},{b},{c}'
格式打印
|