Like it!

Join us on Facebook!

Like it!

Making HTTP requests with sockets in Python

A practical introduction to network programming, from socket configuration to network buffers and HTTP connection modes.

Other articles from this series

Welcome to the 7th episode of the Networking 101 series! In the previous chapter I spent some time digging into the concept of sockets and the Berkeley sockets interface. This time I want to explore the practical side of network programming by issuing an HTTP request through Python and its socket module.

Sockets: a quick refresher

A socket is a software object that allows programs to exchange data. The most popular socket API is the Berkeley sockets interface, usually implemented by operating systems in low-level languages such as C. I will be using Python for this experiment as its socket module follows very closely the original C implementation, without the memory-related complexities imposed by the C language.

Hypertext Transfer Protocol (HTTP) is a protocol for fetching resources such as HTML documents and is the foundation of the World Wide Web as we know it today. The point of this experiment is to use sockets to send an HTTP request to a web server out there and read its response: in other words, I will write an ultra-primitive web browser.

I haven't touched HTTP in this series yet, but don't worry: it's just a matter or sending and receiving text strings. I assume however you know how Python works and a bit of familiarity with the TCP/IP protocol stack. Let's get started!

Python's HTTP request: first attempts

As explained in the previous chapter, a socket must be created and configured first. Then you connect it to a host and start sending/receiving data. Finally you close the socket when you are done with it.

1. Configuring the socket

First thing import the socket module:

import socket

Now it's time to create a new socket object through the socket() constructor. It expects two parameters: the socket family and the socket type, chosen from a set of constants that start with the AF_ prefix for the family and the SOCK_ prefix for the type. The full list is available here.

What constants should we pick? HTTP is based on the Transmission Control Protocol (TCP), which in turn is based on the Internet Protocol (IP). This means that HTTP is stream-oriented (because of TCP) and wants an IP address to work (because of IP). Those requirements are fulfilled by picking:

  1. the AF_INET constant for the socket family. It stands for Internet Protocol with IPv4 addresses;

  2. the SOCK_STREAM for the socket type. We want a stream-based protocol because of TCP.

Those constants are passed to the socket() constructor:

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

We have now a working socket object sock configured for HTTP transmission. This type of socket is also known as stream socket.

2. Connecting the socket to a server

The next step is to choose a web address to connect to. I will be using www.example.com but you can pick whatever you like — just don't abuse it!

In the 5th episode of this series I mentioned how TCP uses special numbers called ports to determine the type of service you want from the server. Web servers usually provide HTTP services on port 80, so I'll pick that too.

The web address and the port number are then passed to the connect() method as a tuple:

sock.connect(("www.example.com", 80))

At this point our socket has established a connection to the web server that is responsible for serving web pages from www.example.com on port 80. Where's the IP address, anyway? The connect() method automatically translates the string www.example.com into the corresponding IP address by issuing a DNS lookup. Don't worry about it for now; I will write an article on how the DNS mechanism works in the future.

3. Sending data to the server

Here comes the fun. An HTTP communication always starts with a request made by the client (i.e. us!) with the page we want to obtain, followed by some additional information. Such request is sent as a normal text string and looks like this:

GET / HTTP/1.1\r\nHost:www.example.com\r\n\r\n

In words: give me (GET) the index page (/) through HTTP version 1.1 (HTTP comes in multiple versions, 1.1 is OK for our purpose) from the host called www.example.com. Fields are separated by \r\n and the request ends with \r\n\r\n.

It's now time to send this string to the web server, by calling the send() method on our socket. Normally, data is sent over the Internet in binary form, that is as a bunch of 0s and 1s packed together: this is why the send() method wants bytes in input. So our text string must be converted to binary first. In Python this is done by prepending a b to it:

b"GET / HTTP/1.1\r\nHost:www.example.com\r\n\r\n"

The string now has been turned into a sequence of bytes, ready for transmission:

sock.send(b"GET / HTTP/1.1\r\nHost:www.example.com\r\n\r\n")

4. Receiving data from the server

At this point the server should have received our request and is ready to reply back with some data, that is the page we asked for. The data is obtained by calling the recv() method:

response = sock.recv(4096)

The recv() method wants the maximum amount of data to be received at once, in bytes: 4096 should be good enough for now. The method returns the data as a byte string, the same format we used for the request. So we can either print it as it is, or convert it into something meaningful by decoding it with the decode() method:

print(response)             # raw byte string
print(response.decode())    # UTF-8-encoded string

Either way we should end up with a text string made of the response headers followed by the response body. The former are additional metadata, while the latter is the actual HTML code of the web page you have requested.

5. Closing the socket

Once the entire response has been received, close() the socket:

sock.close()

A socket should always be closed properly once you're done with it. And now the code we have produced so far:

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("www.example.com", 80))
sock.send(b"GET / HTTP/1.1\r\nHost:www.example.com\r\n\r\n")
response = sock.recv(4096)
sock.close()
print(response.decode())

It's a good start, but we can do better: let's see how.

Understanding network buffers

Two things can go wrong in the program above: 1) the request might be sent incompletely, 2) the response might be received incompletely. Let's introduce the concept of network buffers to understand why.

When you move stuff through a socket, data is not transmitted right away through the network card one byte at a time. Instead, the operating system temporarily puts it inside a buffer — a chunk of memory used to hold data while it is being moved from one place to another.

So when you send() something, the operating system copies a piece of your message into the buffer, then flushes it to the outside world when it's full. Receiving data works similarly, just the other way around. As bytes sent by the server arrive to your network card, the operating system collects them into another network buffer, waiting for your app to recv() it.

You have no control over those buffers: they might be empty, partially filled, completely filled with more data to be sent/received and so on, for multiple reasons — slow network, busy operating system, servers down, ... . The only thing you can do is to keep send()ing and recv()ing until there is data available, if you want to be 100% sure you are not missing anything.

A better send

The send() method returns the number of bytes actually sent: since we know the length of the message we want to send, let's keep track of the total bytes sent on each send() call, compare that value to the length of the message and eventually send what's left. Something like this:

request = b"GET / HTTP/1.1\r\nHost:www.example.com\r\n\r\n"
sent    = 0
while sent < len(request):
    sent = sent + sock.send(request[sent:])    # Send a portion of 'request', starting from 'sent' byte

Additionally, Python features the sendall() method that behaves exactly as the snippet above: use it if you don't want to be bothered by internals.

A better receive

The original code has an additional problem: 4096 bytes might not be enough to store the full response. We can fix that and the buffering issue by looping over recv() as we did above with send(). However things are trickier here, because we don't know exactly how long the incoming message will be beforehand. This is due to the stream-oriented nature of TCP, where data is seen as an unlimited stream of bytes, with no delimiters or other message boundaries.

Not all hope is lost, though. The recv() method returns the number of bytes received, and 0 bytes are returned when the server has terminated the connection. So we can loop over recv() until we get 0 bytes in return. Something like this:

# [...]
response = b""
while True:
    chunk = sock.recv(4096)
    if len(chunk) == 0:     # No more data received, quitting
        break
    response = response + chunk;
# [...]

The code should now be ready to handle messages of any length, or at least should be capable of receiving all the data sent by the server. However if you try to run it you will notice that the program gets stuck before printing the final response. Why?

HTTP connection mode versus blocking sockets

The first release of HTTP (version 1.0) uses one socket per transfer. You send a request, the server sends a reply, then it closes the connection. When the connection is closed your socket can be trashed: there's no way to use it again. Want to issue a new request? Just create a new socket and start over again. Over time this approach turned out to be very limited, so in HTTP version 1.1 — the version we are using! — new connection models were created. In HTTP 1.1 all connections are persistent unless declared otherwise: the server keeps the connection alive so that the socket can be reused for additional transmission.

Unfortunately the persistent mode clashes a bit with sockets, which are blocking by default: they pause the program waiting for data to be sent or received. Or better, they block until some data — even a single byte — is available in the network buffers. Since the server never closes the connection in persistent mode, the socket just waits for more data to arrive. No more data will be sent by the server (unless we issue a new request), so the socket hangs forever.

There are three ways to fix the problem in addition to revert back to HTTP/1.0: 1) disable the persistent HTTP connection, 2) set a timeout on the socket or 3) read the HTTP response headers to determine when to quit. Let's take a look.

1. Disable the persistent HTTP connection

The Connection header controls whether the connection with the server stays open after the transmission finishes. In HTTP/1.1 the default value is keep-alive. Change it to close in the HTTP request string to mimic the HTTP/1.0 default behavior:

sock.send(b"GET / HTTP/1.1\r\nHost:www.example.com\r\nConnection: close\r\n\r\n")

This way the server will gracefully close the connection once all the data is sent. The socket detects it, recv() returns 0 and the code can make progress.

2. Set a timeout on the socket

You can decide how long the socket should block before giving up. This is done by calling settimeout() on the socket object during configuration. Now all socket operations (connect(), send(), recv(), ...) will raise a socket.timeout exception if they take more time than requested. You can catch that error while calling recv() in the while loop and interpret it as the end of data. For example:

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(10)  # In seconds. Choose a value that makes sense to you
# [...]
response = b""
try:
    while True:
        response = response + sock.recv(4096);
except socket.timeout as e:
    print("Time out!")
# [...]

Note: in Python 3.10 and greater socket.timeout is just a deprecated alias for TimeoutError exception.

3. Read the HTTP response headers to determine when to quit

The Content-Length response header indicates the size of the message body, in bytes, that the server is sending you back. The idea here is to keep checking the incoming data for the presence of that header and read its value when available. Then you can stop recv()ing data from the socket as soon as you got a number of bytes that matches the value reported by Content-Length.

Parsing HTTP responses is a bit annoying but doable also without a library. Each header is separated by \r\n and the body (i.e. the actual HTML code) starts after \r\n\r\n. Obviously this approach only works with HTTP: other protocols might send the content length in a different format, or might not send it at all.

Final notes

This article wants to be a practical introduction to Berkeley sockets over a stream-based protocol. I've just scratched the surface of the topic; the following is a list of cool things to keep in mind:

  • All socket methods used along the way can throw other exceptions beyond socket.timeout. I didn't do it for brevity, but you should catch them and act accordingly;

  • A socket can be set to non-blocking mode. All socket operations (connect(), send(), recv(), ...) no longer wait in this mode: they return immediately as soon as you call them. This solves the problem of hanging sockets, however writing a correct program with non-blocking functions is quite tricky. I will try to rethink the code seen so far in a non-blocking way in one of the next articles;

  • Our socket was configured to run over IPv4. You can switch it to IPv6 by setting the socket family to AF_INET6 during the configuration. The code changes a little bit though, especially the connect() part. This example in the official documentation shows how to adapt it;

  • We issued an HTTP request, but nowadays everybody uses HTTPS — the secure, that is encrypted version of HTTP. The code doesn't change much, but you need to encrypt and decrypt the data. The Python's ssl module helps with that and I will probably write an article about it in the future.

Sources

Python docs — socket — Low-level networking interface
Python docs — Socket Programming HOWTO
Beej's Guide to Network Programming
MDN Web Docs — Connection
RPG IV Socket Tutorial — 6.5. Blocking vs. non-blocking sockets
StackOverflow — What is the difference between socket.send() and socket.sendall()?
StackOverflow — How does the python socket.recv() method know that the end of the message has been reached?
StackOverflow — How large should my recv buffer be when calling recv in the socket library
StackOverflow — Why is it assumed that send may return with less than requested data transmitted on a blocking socket?
SuperUser — Do socket and buffer mean the same thing?

previous article
Network programming for beginners: introduction to sockets
comments