Implementing an URL shortener using Django/Python

Recently, I implemented an URL shortener service and this post describes the design decisions that I took while implementing it. I have always been in love with Python because of it’s offerings that enable developers to write beautiful code. I decided to use the Django web framework for writing the URL shortener as it is simple enough to easily get started with.

The specs are as below. This webservice will expose the following APIs:

POST /shorten
Parameters: Parameter link contains the link to shorten.
Returns: Id for the shortened link in text/plain format.

GET /{id}
Returns: 301 redirects the user agent to a previously stored URL. 404 error if no link stored with given id.

The core functionality that such a service provides is that it shortens an URL as much as possible, while still being fast in it’s lookups. The obvious way to implement the shortening is to use a classical computer science concept — Hashing. The advantages of using a hashtable are:

  • All traditional hash functions such as MD5, SHA-1, CRC32 provide collision resistance to a significant level. This guarantees that no two different urls will have the same shortened url until we accumulate a considerable amount of URLs in our service. For example, MD5 takes any arbitrary length of string and outputs 128 bits. The Birthday problem tells us that there will be a collision only after 2^64 URLs. That is, after 18,446,744,073,709,551,616 URLs.
  • If we maintain a hashtable, the lookups are really fast. If we maintain an in-memory datastructure, the lookup is O(1). If we decide to store it in an RDBMS, we can create an index for the shortened id column.

But the obvious problem is that the output is not short enough. MD5 provides a 32 length hex string. This makes hashing unusable for our URL shortening.

Another approach is to use a simple rolling index, where we start with value 1, and we keep incrementing by 1 for every new URL. This index can be converted to an alphanumeric string (a-zA-Z0-9) by doing a simple base 62 encoding. The vice-versa is also simple, we just need to decode the string to get the index. This way, we can start with 1 letter shortened URL and when we run out of one-letter URLs, we can move to 2-letter URLs and so on.

Implementing the base 62 encoding/decoding is simple:

import math

ALLOWED_ALPHABETS = [
	'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s',
	't','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L',
	'M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','1','2','3','4','5',
	'6','7','8','9','0'
]

string_to_digit_map = {}

ctr = 0
for alpha in ALLOWED_ALPHABETS:
	string_to_digit_map[alpha] = ctr
	ctr = ctr + 1

def convert_to_base62(id):
	base62_string = ''
	while id > 0:
		rem = id % 62
		id = id / 62
		base62_string = ALLOWED_ALPHABETS[rem] + base62_string
	return base62_string

def convert_from_base62(string):
	res = 0
	digit_ctr = len(string) - 1
	for char in string:
		if char not in string_to_digit_map:
			return -1
		res = res + string_to_digit_map[char] * int(math.pow(62, digit_ctr))
		digit_ctr = digit_ctr - 1
	return res

Using this approach, the rolling index doesn’t need any special logic. An auto-increment primary key in an RDBMS will do the job. So, the model for our URL class would look like this:

from django.db import models

class Url(models.Model):
	"""
	Model representing each url shortened. The primary key (auto-generated) becomes
	the shortened url.
	"""

	url = models.TextField()
	date_created = models.DateTimeField()
	date_last_accessed = models.DateTimeField(null=True)
	visits = models.IntegerField(null=True)

With the models in place, we come to the controller (which are called ‘views’ in Django). Our controller just has two methods, one to shorten a given URL and another to redirect to the original URL, given a shortened URL. In the shorten method, we have to make sure that the URL contains a scheme (http/https) and if it doesn’t we add a default ‘http’ scheme. Now, we also need to check if we already have a shortened URL for the given URL, so that we don’t have duplicate shortened URLs for the same URLs.

import urllib2
from django.shortcuts import render, get_object_or_404
from django.shortcuts import redirect as http_redirect
from django.views.decorators.http import require_http_methods
from django.http import HttpResponse, HttpResponsePermanentRedirect, HttpResponseBadRequest
from shorten.models import Url
from shorten.conversions import convert_to_base62, convert_from_base62
from datetime import datetime
from django.http import Http404
from urlparse import urlparse

def redirect(request, query):
	print query
	id = convert_from_base62(query)
	if id < 0:
		#Unknown shortened url
		raise Http404("Url does not exist")

	url = get_object_or_404(Url, id=id)
	return HttpResponsePermanentRedirect(url.url)


@require_http_methods(["POST"])
def shorten(request):
	if request.META.get('CONTENT_TYPE') != "application/x-www-form-urlencoded":
		print "Not application/x-www-form-urlencoded"

	submitted_url = request.POST.get('link')
	if submitted_url == None or len(submitted_url) == 0:
		return HttpResponseBadRequest("No url specified.")

	#Make sure we have a scheme for the url. Else add http as default.
	url_parsed = urlparse(submitted_url.strip())
        if url_parsed.scheme == '':
            submitted_url = "http://"+ submitted_url
	print "Shorten request:", submitted_url

	#Check if we already have already shortened this URL
	url = None

	try:
		url = Url.objects.get(url=submitted_url)
	except Url.DoesNotExist:
		print "New Url"

	if url != None:
		#We already have a shortened url for the given url
		return HttpResponse(convert_to_base62(url.id))

	url = Url()
	url.url = submitted_url
	url.date_created = datetime.now()
	url.save()
	return HttpResponse(convert_to_base62(url.id))

And finally for the routes (URL mapping for the web service):

from django.conf.urls import patterns, include, url
import shorten.views

urlpatterns = patterns('',
    url(r'shorten/$', 'shorten.views.shorten'),
    url(r'shorten$', 'shorten.views.shorten'),
    url(r'(?P<query>\w+)', 'shorten.views.redirect'),
)

The last regex matches any ‘word’ (represented by \w+) and should be the last route, since the other mappings should take precedence.

The full project is available in GitHub at https://github.com/c05mic/django-url-shortener

Would love your comments on the implementation.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s