TCP Connection / Sockets
TCP Port Number
In a client-server application on the web using sockets, server must come up first
Sockets in Python
调用socket包创建socket 从网页中获取文件,会返回metadata和网页内容
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
Application Protocol
HTTP的最重要的方面是规定了 Which application talks first? The client or server? URL(Uniform Resource Locator)由 Protocol, host, and document三部分组成 when a browser uses the HTTP protocol to load a file or page from a server and display it in the browser, which is called The Request/Response Cycle
Write a Web Browser
Unicode and UTF-8
ASCII
print(ord(‘’))函数查看ASCII码的值 Unicode python3中string的格式都是Unicode Python Strings to Bytes
urllib
采用urllib可以将网页内容像file一样处理 用于阅读网页
import urllib.request
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())
用于获取网页中的链接
When you click on an anchor tag in a web page like below, what HTTP request is sent to the server?
<p>Please click <a href="page1.htm">here</a>.</p>
==> GET
Web Scraping
有些网站不能爬,得看条款 BeautifulSoup的目的:It repairs and parses HTML to make it easier for a program to understand
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
tags = soup('a')
for tag in tags:
print 'TAG:',tag
print 'URL:',tag.get('href', None)
print 'Contents:',tag.contents[0]
print 'Attrs:',tag.attrs
|