Recently, I implemented an URL shortener service and this post describes the design decisions that I took while implementing it. I have always been in love with Python because of it’s offerings that enable developers to write beautiful code. I decided to use the Django web framework for writing the URL shortener as it is simple enough to easily get started with.
The specs are as below. This webservice will expose the following APIs:
POST /shorten
Parameters: Parameter link contains the link to shorten.
Returns: Id for the shortened link in text/plain format.
GET /{id}
Returns: 301 redirects the user agent to a previously stored URL. 404 error if no link stored with given id.
The core functionality that such a service provides is that it shortens an URL as much as possible, while still being fast in it’s lookups. The obvious way to implement the shortening is to use a classical computer science concept — Hashing. The advantages of using a hashtable are:
- All traditional hash functions such as MD5, SHA-1, CRC32 provide collision resistance to a significant level. This guarantees that no two different urls will have the same shortened url until we accumulate a considerable amount of URLs in our service. For example, MD5 takes any arbitrary length of string and outputs 128 bits. The Birthday problem tells us that there will be a collision only after 2^64 URLs. That is, after 18,446,744,073,709,551,616 URLs.
- If we maintain a hashtable, the lookups are really fast. If we maintain an in-memory datastructure, the lookup is O(1). If we decide to store it in an RDBMS, we can create an index for the shortened id column.
But the obvious problem is that the output is not short enough. MD5 provides a 32 length hex string. This makes hashing unusable for our URL shortening.
Another approach is to use a simple rolling index, where we start with value 1, and we keep incrementing by 1 for every new URL. This index can be converted to an alphanumeric string (a-zA-Z0-9) by doing a simple base 62 encoding. The vice-versa is also simple, we just need to decode the string to get the index. This way, we can start with 1 letter shortened URL and when we run out of one-letter URLs, we can move to 2-letter URLs and so on.
Implementing the base 62 encoding/decoding is simple:
import math ALLOWED_ALPHABETS = [ 'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s', 't','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L', 'M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','1','2','3','4','5', '6','7','8','9','0' ] string_to_digit_map = {} ctr = 0 for alpha in ALLOWED_ALPHABETS: string_to_digit_map[alpha] = ctr ctr = ctr + 1 def convert_to_base62(id): base62_string = '' while id > 0: rem = id % 62 id = id / 62 base62_string = ALLOWED_ALPHABETS[rem] + base62_string return base62_string def convert_from_base62(string): res = 0 digit_ctr = len(string) - 1 for char in string: if char not in string_to_digit_map: return -1 res = res + string_to_digit_map[char] * int(math.pow(62, digit_ctr)) digit_ctr = digit_ctr - 1 return res
Using this approach, the rolling index doesn’t need any special logic. An auto-increment primary key in an RDBMS will do the job. So, the model for our URL class would look like this:
from django.db import models class Url(models.Model): """ Model representing each url shortened. The primary key (auto-generated) becomes the shortened url. """ url = models.TextField() date_created = models.DateTimeField() date_last_accessed = models.DateTimeField(null=True) visits = models.IntegerField(null=True)
With the models in place, we come to the controller (which are called ‘views’ in Django). Our controller just has two methods, one to shorten a given URL and another to redirect to the original URL, given a shortened URL. In the shorten method, we have to make sure that the URL contains a scheme (http/https) and if it doesn’t we add a default ‘http’ scheme. Now, we also need to check if we already have a shortened URL for the given URL, so that we don’t have duplicate shortened URLs for the same URLs.
import urllib2 from django.shortcuts import render, get_object_or_404 from django.shortcuts import redirect as http_redirect from django.views.decorators.http import require_http_methods from django.http import HttpResponse, HttpResponsePermanentRedirect, HttpResponseBadRequest from shorten.models import Url from shorten.conversions import convert_to_base62, convert_from_base62 from datetime import datetime from django.http import Http404 from urlparse import urlparse def redirect(request, query): print query id = convert_from_base62(query) if id < 0: #Unknown shortened url raise Http404("Url does not exist") url = get_object_or_404(Url, id=id) return HttpResponsePermanentRedirect(url.url) @require_http_methods(["POST"]) def shorten(request): if request.META.get('CONTENT_TYPE') != "application/x-www-form-urlencoded": print "Not application/x-www-form-urlencoded" submitted_url = request.POST.get('link') if submitted_url == None or len(submitted_url) == 0: return HttpResponseBadRequest("No url specified.") #Make sure we have a scheme for the url. Else add http as default. url_parsed = urlparse(submitted_url.strip()) if url_parsed.scheme == '': submitted_url = "http://"+ submitted_url print "Shorten request:", submitted_url #Check if we already have already shortened this URL url = None try: url = Url.objects.get(url=submitted_url) except Url.DoesNotExist: print "New Url" if url != None: #We already have a shortened url for the given url return HttpResponse(convert_to_base62(url.id)) url = Url() url.url = submitted_url url.date_created = datetime.now() url.save() return HttpResponse(convert_to_base62(url.id))
And finally for the routes (URL mapping for the web service):
from django.conf.urls import patterns, include, url import shorten.views urlpatterns = patterns('', url(r'shorten/$', 'shorten.views.shorten'), url(r'shorten$', 'shorten.views.shorten'), url(r'(?P<query>\w+)', 'shorten.views.redirect'), )
The last regex matches any ‘word’ (represented by \w+) and should be the last route, since the other mappings should take precedence.
The full project is available in GitHub at https://github.com/c05mic/django-url-shortener
Would love your comments on the implementation.