Skip to main content

TIL: Checking if a URL has changed when you fetch it over HTTP

When you make an HTTP request, you can use the If-Modified-Since header to get a 304 Not Modified if nothing has changed since your last request.

I was writing a small crawler for my RSS feed, and I know my RSS feed doesn’t change that often, so I was looking for a way to speed up fetches when nothing has changed.

I had a vague recollection that there’s something for this in vanilla HTTP, and that led me to rediscover the If-Modified-Since header. If you send this header in your request, and the URL hasn’t changed, you get an HTTP 304 Not Modified rather than a 200 OK:

$ curl -s -o /dev/null -w "%{http_code}\n" "https://alexwlchan.net/atom.xml"
200

$ curl -s -o /dev/null -w "%{http_code}\n" -H "If-Modified-Since: Tue, 2 Apr 2024 19:05:00 GMT" "https://alexwlchan.net/atom.xml"
304

In my limited testing, adding the If-Modified-Since header made my requests a bit faster – not a lot, but not zero. Handy!

Getting the format of the header right

The header is very particular about how you format the date. Here’s the syntax from MDN:

If-Modified-Since: <day-name>, <day> <month> <year> <hour>:<minute>:<second> GMT

And here’s their example:

If-Modified-Since: Wed, 21 Oct 2015 07:28:00 GMT

One thing of note is that the timezone is always GMT: HTTP dates are always expressed in GMT, never in local time. When I was testing against my site hosted on Netlify, it would ignore the timezone and treat it as GMT regardless.

Here’s how to construct the header in various languages:

Some Python snippets

I was writing my feed crawler in Python, so I ended up building this behaviour around httpx, the HTTP client I was using at the time.

Here’s a little snippet I wrote to test this behaviour:

import datetime

import httpx


client = httpx.Client()

resp = client.get("https://alexwlchan.net/atom.xml")
print(resp)  # <Response [200 OK]>

now = datetime.datetime.now(datetime.UTC)
client.headers["If-Modified-Since"] = now.strftime("%a, %d %b %Y %H:%M:%S %Z")

resp = client.get("https://alexwlchan.net/atom.xml")
print(resp)  # <Response [304 Not Modified]>

I haven’t settled on a good pattern for using these 304 responses. In some cases, it’s enough to know that I got a 304 and nothing needs to be done. In other cases, I might need to retrieve a previously cached response from somewhere. I’m not entirely sure how I’ll use it.