TCP Connection / Sockets
data:image/s3,"s3://crabby-images/a78e6/a78e6bb3c772dc10be67b2378dd3637c109c7e0a" alt="在这里插入图片描述"
TCP Port Number
data:image/s3,"s3://crabby-images/03424/034246764db6a105b82484e5c1368b313faa5c5c" alt="在这里插入图片描述" In a client-server application on the web using sockets, server must come up first data:image/s3,"s3://crabby-images/64a06/64a06bda5b712e392deda6a47223e2b61448991e" alt="在这里插入图片描述"
Sockets in Python
调用socket包创建socket data:image/s3,"s3://crabby-images/06f63/06f634472f514144e08a5f88fcb0027180e3ac8e" alt="在这里插入图片描述" 从网页中获取文件,会返回metadata和网页内容
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
Application Protocol
data:image/s3,"s3://crabby-images/72fca/72fca574281c17c5cb6f76a44fa6de6edb7b2925" alt="在这里插入图片描述" data:image/s3,"s3://crabby-images/93a3d/93a3df058b84e7ceb575a7b8001b877b063644b7" alt="在这里插入图片描述" HTTP的最重要的方面是规定了 Which application talks first? The client or server? data:image/s3,"s3://crabby-images/5e49c/5e49c202d29b0f1778698ff1f062059244111acb" alt="在这里插入图片描述" data:image/s3,"s3://crabby-images/9db2d/9db2d3d009a343fb3f89e8474e91c90f65976a73" alt="在这里插入图片描述" URL(Uniform Resource Locator)由 Protocol, host, and document三部分组成 data:image/s3,"s3://crabby-images/76b5a/76b5abe0dc8ae1541e324ceccbaf609721ceeace" alt="在这里插入图片描述" when a browser uses the HTTP protocol to load a file or page from a server and display it in the browser, which is called The Request/Response Cycle
Write a Web Browser
data:image/s3,"s3://crabby-images/fcb03/fcb03bae3f03705753547ebb9cb239c29463b6ab" alt="在这里插入图片描述" data:image/s3,"s3://crabby-images/d2841/d2841b11785add876e7870ac29d2997d74fc6e22" alt="在这里插入图片描述"
Unicode and UTF-8
ASCII data:image/s3,"s3://crabby-images/237cb/237cbec069188997c2962cd8d2c8ceb55616f54f" alt="ASCII"
data:image/s3,"s3://crabby-images/035bb/035bb6a612b44d29fde88d446fae0a73be19ffa0" alt="在这里插入图片描述" print(ord(‘’))函数查看ASCII码的值 Unicode data:image/s3,"s3://crabby-images/dae69/dae694920391a556a68dc7d7fc75fa42a9a854a7" alt="在这里插入图片描述" data:image/s3,"s3://crabby-images/0a366/0a36666f88dc91e9fb7cff32dc0a186efe8b7cc1" alt="在这里插入图片描述" data:image/s3,"s3://crabby-images/1a8d1/1a8d195f97deef92a3ebb8f93f1ad29d9791fb2b" alt="在这里插入图片描述" python3中string的格式都是Unicode Python Strings to Bytes data:image/s3,"s3://crabby-images/6f837/6f837d7cfe83c96f64f331802149d3b201ceebe1" alt="在这里插入图片描述"
urllib
data:image/s3,"s3://crabby-images/37843/378431111d9fb3a28feaecf77869c20110bc62f1" alt="在这里插入图片描述" 采用urllib可以将网页内容像file一样处理 data:image/s3,"s3://crabby-images/52f26/52f26f99f62eb463d470f1c6c69aeca695837f9c" alt="在这里插入图片描述" 用于阅读网页 data:image/s3,"s3://crabby-images/46ff3/46ff31f8bb181e0c3f21a9c9e760804551ddbaeb" alt="在这里插入图片描述"
import urllib.request
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())
用于获取网页中的链接 data:image/s3,"s3://crabby-images/ed2f7/ed2f7f6d7431ed0cb8262806701649129a7aa49d" alt="在这里插入图片描述"
When you click on an anchor tag in a web page like below, what HTTP request is sent to the server?
<p>Please click <a href="page1.htm">here</a>.</p>
==> GET
Web Scraping
data:image/s3,"s3://crabby-images/e31c4/e31c4ca50a166c9a4088e93b4e818b5a664623ec" alt="在这里插入图片描述" 有些网站不能爬,得看条款 data:image/s3,"s3://crabby-images/a3655/a36551b010307e143dff90dcb26e8e0c99e41cfb" alt="在这里插入图片描述" BeautifulSoup的目的:It repairs and parses HTML to make it easier for a program to understand
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
tags = soup('a')
for tag in tags:
print 'TAG:',tag
print 'URL:',tag.get('href', None)
print 'Contents:',tag.contents[0]
print 'Attrs:',tag.attrs
|