6.1. HTTP Protocol¶
- Retrieving Web Pages with HTTP (Chapter 6) Lecture
- Video by Lucas Holland on using the urllib2 Python module
The HTTP protocol defines a specific format for the contents of a message from
a client to request information from a web server. A simple static page is
retrieved with a
GET request. Dynamic page requests that require a small
amount of data to be sent as part of the request, also use the
and embed the data in the URL. A zip code or a part number are examples of the
type of data that might be embeded inside a
GET request. When a larger
amount of data is sent to the server, such as when a form was filled out or
file up-loaded, then a
POST request is sent.
6.1.1. HTTP Basics¶
With HTTP, the client sends a message requesting data, which may be a static page or a page that the server will dynamically generate. The server then sends data back, usually in the form of an HTML, XHTML or similar document. HTTP is a stateless, connectionless protocol. Both of these term relate to the one request, one reply nature of HTTP.
With most protocols, the client and server send several message back and forth. So the server can keep track of the state of overall conversation for each client. This is not the case with HTTP. Each client request stands on its own as a request for information. Web servers often have server side applications, such as a store front, which treat the sequence of messages to and from each client as a session and would thus track the state of the clients. However, we are just talking about the web server proper, which uses the HTTP protocol.
This has very similar mean to stateless. When you connect to a ssh, ftp or telnet server, you have an ongoing connection (session) to the server. With HTTP, as soon as the request is received and reply sent, the socket connection is closed. So if you are using a web based application, such as web-mail to read your e-mail, then the overall session with the server side application actually consists of many distinct socket connections.
HTTP was really designed for simple web page retrieval, not on-going interactions with a server side application. For this reason, some have questioned if HTTP is really the protocol, which should be used for such activity. However, it seems to work well as a protocol designed for the simplest case, but applicable in conjunction with other technologies for more complex applications.
6.1.2. Basic GET¶
Here is how to retrieve a simple web page using socket programming. Notice, that we have to concern ourselves with not only the socket connection, but the syntax of the HTTP protocol.
import socket s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(('www.sal.ksu.edu', 80)) request = """GET /faculty/tim/index.html HTTP/1.0\n From: email@example.com\n User-Agent: Python\n \n""" s.send(request) fp = open("index.html", "w") while 1: data = s.recv(1024) if not len(data): break fp.write(data) s.close() fp.close()
6.1.3. Submitting with GET¶
GET request with data embedded in the URL uses a question mark symbol (?)
to separate the web address from the data in the URL. Using a web browser, you
can often see URLs that send information as part of the URL.
6.1.4. Submitting with POST¶
POST request is used when additional information needs to sent as part
of the request, but the volume of the data is too large to be included as part
of URL. The POST request is used when you complete a form on a web page and
then click on a “submit” button.