2021SC@SDUSC
最后,作为软工应用这门课的收尾,熬夜硬肝,看完了response部分的代码。结合着官方文档进行分析:
Response objects:
类:scrapy.http.Response(*args,?**kwargs)? ? ? 源码附在最后
A?Response?object represents an HTTP response, which is usually downloaded (by the Downloader) and fed to the Spiders for processing.
一个 Response 对象代表一个 HTTP 响应,通常被下载(由下载器)并提供给爬虫程序进行处理。
Parameters
-
url?(str) – the URL of this response -
status?(int) – the HTTP status of the response. Defaults to?200 . -
headers?(dict) – the headers of this response. The dict values can be strings (for single valued headers) or lists (for multi-valued headers). -
body?(bytes) – the response body. To access the decoded text as a string, use?response.text ?from an encoding-aware?Response subclass, such as?TextResponse. -
flags?(list) – is a list containing the initial values for the?Response.flags?attribute. If given, the list will be shallow copied. -
request?(scrapy.http.Request) – the initial value of the?Response.request?attribute. This represents the?Request?that generated this response. -
certificate?(twisted.internet.ssl.Certificate) – an object representing the server’s SSL certificate. -
ip_address?(ipaddress.IPv4Address?or?ipaddress.IPv6Address) – The IP address of the server from which the Response originated. -
protocol?(str) – The protocol that was used to download the response. For instance: “HTTP/1.0”, “HTTP/1.1”, “h2”
url (str) – 此响应的 URL
status (int) – 响应的 HTTP 状态。默认为 200。
headers (dict) – 此响应的标头。 dict 值可以是字符串(对于单值标题)或列表(对于多值标题)。
body (bytes) – 响应正文。要将解码后的文本作为字符串访问,请使用编码感知 Response 子类中的 response.text,例如 TextResponse。
flags (list) – 是一个包含 Response.flags 属性初始值的列表。如果给定,列表将被浅复制。
request (scrapy.http.Request) – Response.request 属性的初始值。这表示生成此响应的请求。
证书 (twisted.internet.ssl.Certificate) – 代表服务器 SSL 证书的对象。
ip_address(ipaddress.IPv4Address 或 ipaddress.IPv6Address)– 产生响应的服务器的 IP 地址。
protocol (str) – 用于下载响应的协议。例如:“HTTP/1.0”、“HTTP/1.1”、“h2”
方法:follow_all(urls,?callback=None,?method='GET',?headers=None,?body=None,?cookies=None,?meta=None,?encoding='utf-8',?priority=0,?dont_filter=False,?errback=None,?cb_kwargs=None,?flags=None)→?Generator[scrapy.http.request.Request,?None,?None]
2.0 版中的新功能。
返回一个可迭代的 Request 实例以跟踪 url 中的所有链接。 它接受与 Request.__init__ 方法相同的参数,但 url 的元素可以是相对 URL 或 Link 对象,而不仅仅是绝对 URL。
TextResponse 提供了一个 follow_all() 方法,除了绝对/相对 URL 和 Link 对象之外,它还支持选择器。
def follow_all(self, urls, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, cb_kwargs=None, flags=None):
# type: (...) -> Generator[Request, None, None]
if not hasattr(urls, '__iter__'):
raise TypeError("'urls' argument must be an iterable")
return (
self.follow(
url=url,
callback=callback,
method=method,
headers=headers,
body=body,
cookies=cookies,
meta=meta,
encoding=encoding,
priority=priority,
dont_filter=dont_filter,
errback=errback,
cb_kwargs=cb_kwargs,
flags=flags,
)
for url in urls
)
replace([url,?status,?headers,?body,?request,?flags,?cls])
Returns a Response object with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute?Response.meta?is copied by default.
返回具有相同成员的 Response 对象,除了那些通过指定的关键字参数赋予新值的成员。 默认情况下复制属性 Response.meta。
urljoin(url)
Constructs an absolute url by combining the Response’s?url?with a possible relative url.
This is a wrapper over?urljoin(), it’s merely an alias for making this call
通过将 Response 的 url 与可能的相对 url 组合来构造绝对 url。
是 urljoin() 的包装器
Response subclasses:
以下是可用的内置响应子类的列表。 此外还可以继承 Response 类来实现自定义功能。
TextResponse objects:
类scrapy.http.TextResponse(url[,?encoding[,?...]])
class TextResponse(Response):
_DEFAULT_ENCODING = 'ascii'
_cached_decoded_json = _NONE
def __init__(self, *args, **kwargs):
self._encoding = kwargs.pop('encoding', None)
self._cached_benc = None
self._cached_ubody = None
self._cached_selector = None
super().__init__(*args, **kwargs)
def _set_url(self, url):
if isinstance(url, str):
self._url = to_unicode(url, self.encoding)
else:
super()._set_url(url)
def _set_body(self, body):
self._body = b'' # used by encoding detection
if isinstance(body, str):
if self._encoding is None:
raise TypeError('Cannot convert unicode body - '
f'{type(self).__name__} has no encoding')
self._body = body.encode(self._encoding)
else:
super()._set_body(body)
def replace(self, *args, **kwargs):
kwargs.setdefault('encoding', self.encoding)
return Response.replace(self, *args, **kwargs)
TextResponse 对象向基本 Response 类添加了编码功能,该类仅用于二进制数据,例如图像、声音或任何媒体文件。
除了基本 Response 对象之外,TextResponse 对象还支持新的 __init__ 方法参数。 其余功能与 Response 类相同
HtmlResponse objects:
类:scrapy.http.HtmlResponse(url[,?...])
HtmlResponse 类是 TextResponse 的子类,它通过查看 HTML 元 http-equiv 属性添加了编码自动发现支持。
XmlResponse objects
类:scrapy.http.XmlResponse(url[,?...])
XmlResponse 类是 TextResponse 的子类,它通过查看 XML 声明行来添加编码自动发现支持。
附:response类源码:
class Response(object_ref):
def __init__(
self,
url,
status=200,
headers=None,
body=b"",
flags=None,
request=None,
certificate=None,
ip_address=None,
protocol=None,
):
self.headers = Headers(headers or {})
self.status = int(status)
self._set_body(body)
self._set_url(url)
self.request = request
self.flags = [] if flags is None else list(flags)
self.certificate = certificate
self.ip_address = ip_address
self.protocol = protocol
@property
def cb_kwargs(self):
try:
return self.request.cb_kwargs
except AttributeError:
raise AttributeError(
"Response.cb_kwargs not available, this response "
"is not tied to any request"
)
@property
def meta(self):
try:
return self.request.meta
except AttributeError:
raise AttributeError(
"Response.meta not available, this response "
"is not tied to any request"
)
def _get_url(self):
return self._url
def _set_url(self, url):
if isinstance(url, str):
self._url = url
else:
raise TypeError(f'{type(self).__name__} url must be str, '
f'got {type(url).__name__}')
url = property(_get_url, obsolete_setter(_set_url, 'url'))
def _get_body(self):
return self._body
def _set_body(self, body):
if body is None:
self._body = b''
elif not isinstance(body, bytes):
raise TypeError(
"Response body must be bytes. "
"If you want to pass unicode body use TextResponse "
"or HtmlResponse.")
else:
self._body = body
body = property(_get_body, obsolete_setter(_set_body, 'body'))
def __str__(self):
return f"<{self.status} {self.url}>"
__repr__ = __str__
[docs] def copy(self):
"""Return a copy of this Response"""
return self.replace()
[docs] def replace(self, *args, **kwargs):
"""Create a new Response with the same attributes except for those
given new values.
"""
for x in [
"url", "status", "headers", "body", "request", "flags", "certificate", "ip_address", "protocol",
]:
kwargs.setdefault(x, getattr(self, x))
cls = kwargs.pop('cls', self.__class__)
return cls(*args, **kwargs)
[docs] def urljoin(self, url):
"""Join this Response's url with a possible relative url to form an
absolute interpretation of the latter."""
return urljoin(self.url, url)
@property
def text(self):
"""For subclasses of TextResponse, this will return the body
as str
"""
raise AttributeError("Response content isn't text")
def css(self, *a, **kw):
"""Shortcut method implemented only by responses whose content
is text (subclasses of TextResponse).
"""
raise NotSupported("Response content isn't text")
def xpath(self, *a, **kw):
"""Shortcut method implemented only by responses whose content
is text (subclasses of TextResponse).
"""
raise NotSupported("Response content isn't text")
[docs] def follow(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, cb_kwargs=None, flags=None):
# type: (...) -> Request
"""
Return a :class:`~.Request` instance to follow a link ``url``.
It accepts the same arguments as ``Request.__init__`` method,
but ``url`` can be a relative URL or a ``scrapy.link.Link`` object,
not only an absolute URL.
:class:`~.TextResponse` provides a :meth:`~.TextResponse.follow`
method which supports selectors in addition to absolute/relative URLs
and Link objects.
.. versionadded:: 2.0
The *flags* parameter.
"""
if isinstance(url, Link):
url = url.url
elif url is None:
raise ValueError("url can't be None")
url = self.urljoin(url)
return Request(
url=url,
callback=callback,
method=method,
headers=headers,
body=body,
cookies=cookies,
meta=meta,
encoding=encoding,
priority=priority,
dont_filter=dont_filter,
errback=errback,
cb_kwargs=cb_kwargs,
flags=flags,
)
[docs] def follow_all(self, urls, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None, cb_kwargs=None, flags=None):
# type: (...) -> Generator[Request, None, None]
"""
.. versionadded:: 2.0
Return an iterable of :class:`~.Request` instances to follow all links
in ``urls``. It accepts the same arguments as ``Request.__init__`` method,
but elements of ``urls`` can be relative URLs or :class:`~scrapy.link.Link` objects,
not only absolute URLs.
:class:`~.TextResponse` provides a :meth:`~.TextResponse.follow_all`
method which supports selectors in addition to absolute/relative URLs
and Link objects.
"""
if not hasattr(urls, '__iter__'):
raise TypeError("'urls' argument must be an iterable")
return (
self.follow(
url=url,
callback=callback,
method=method,
headers=headers,
body=body,
cookies=cookies,
meta=meta,
encoding=encoding,
priority=priority,
dont_filter=dont_filter,
errback=errback,
cb_kwargs=cb_kwargs,
flags=flags,
)
for url in urls
)
|