Why MySQL full text search WAS good enough

If you are an “afetionated” PornArk user, surely you noticed some slowdown when waiting for search results happening during last days.

Well that’s story: we have just deployed the new search system with best and faster results, try it!

In this post, we will disclose the “good enough” approach we used to serve search results till a week ago and why it became insufficient. Server side PornArk has started as a classic Django+MySQL web application, so that using MySQL full-text search capabilities was a reasonable and rapid-to-develop solution for our needs. MySQL offers an interesting full-text search on text fields once a particular index is specified. We started indexing fields such as “title” and then added further informations like votes and categorization. We wanted essentially the following characteristics:

  1. full-text search on titles/categories;
  2. ranking results on full-text match quality, votes and categorization quality for category matches;
  3. a certain degree of tolerance for wrong spelling

Our schema contained two main tables:

  • Video (title, votes balance);
  • Category (video ref, name, weight);

Each video could have associated zero or more Category entries, where there was uniqueness on the pair (video ref, name), with a weight; the weight counted the amount of check on the category check box made by the users in the details of the video.

This was enough for point 1. and 2., but not for 3.. Looking around there are two main ways to make approximated search:

  • soundex (a function witch takes a word and outputs its n-soundex, that is a string long n characters giving a phonetic representation of the word; eg: booooobs and boobs both have 4-soundex B200);
  • stemming (considering the “stemmed” form of words, that is the common root to words; eg: boob is common root to boobies and boobs).

Both methods depend on the language of reference. PornArk indexed videos mostly have English titles and categories were in English too. Implementations are commonly available for the English language. In order to use soundex we introduced a new field into the Video table, called title_soudnex, containing the title with 4-soundex of its words. As an example, given a title such as “nice busty woman” the corresponding title_soundex was “N200 B230 W500”.

A first step is the creation of a fulltext index with:

CREATE FULLTEXT INDEX index_name ON table_name (field1_name, field2_name);

We created such an index on title and title_soundex fields of the Video table.

Now the final search query (with some Python glue):

cat_part = ''
if cats is not None:
    cat_part = (' + (SELECT COALESCE(SUM(c.counter), 0) FROM category c WHERE v.id=c.vid_id AND (' +
                ' OR '.join(['c.name="' + cats[i:i + 2] + '"' for i in range(0, len(cats), 2)]) +
                '))')
title_fullsearch = 'MATCH(v.title) AGAINST ("' + kwds + '" IN NATURAL LANGUAGE MODE)'
soundex_fullsearch = 'MATCH(v.soundex) AGAINST ("' + (' '.join([soundex(k) for k in kwds_list])) + '" IN BOOLEAN MODE)'
sql = ('SELECT v.*, v.votes' + cat_part + ' + ' + title_fullsearch + ' + ' + soundex_fullsearch +
           ' AS score FROM video v ' +
           'WHERE ' + title_fullsearch + ' OR ' + soundex_fullsearch + ' ' +
           'ORDER BY score DESC LIMIT ' + str(base) + ', ' + str(MAX_RESULTS))
q = Video.objects.raw(sql)
  • kwds_list is the list of words in the query string;
  • cats is a (possibly empty) list of category names found (we had a phase of preprocessing) within keywords;
  • kwds is the phase to search;
  • soundex() the function that implement soundex in Python.

A typical generated query would be:

SELECT v.*, v.votes + (
  SELECT COALESCE(SUM(c.counter), 0)
    FROM category c
    WHERE v.id=c.vid_id AND (c.name="anal" OR c.name="busty") + 
  MATCH(v.title) AGAINST ("nice anal busty" IN NATURAL LANGUAGE MODE) +
  MATCH(v.soundex) AGAINST ("N200 A540 B230" IN BOOLEAN MODE) AS score
FROM video v
WHERE MATCH(v.title) AGAINST ("nice anal busty" IN NATURAL LANGUAGE MODE)
  OR MATCH(v.soundex) AGAINST ("N200 A540 B230" IN BOOLEAN MODE) AS score
ORDER BY score DESC LIMIT 0, 50;

We obtain results score as the sum of votes, match rank, soundex match rank, categories matches counter. The score components were weighted with constants. We used Django raw queries facilities in the end.

With our video db growing and growing this method became slower and slower making the exposed solution no more a viable one. The PornArk you can see right now is using a full-fledged full-text search engine with both stemming and soundex and ranking tags (are obviously more dynamic) instead of categories. We chose Sphinx to support our new search system. It turned out to be a great tool that well fits our needs. Details to follow in near future ;).

How to make Google index AJAX content (with Django)

Being Pornark a single GWT application that communicates via AJAX with the Django backend, we absolutely needed a way for our contents to be indexed by search engines. Looking around we found the Google AJAX crawling specification. The idea is to have a separated dummy version of the website, that bots can find via www.pornark.com?_escaped_fragment_= that is crawlable by Google. The specification is clear enough and is not our intention explain further. Rather we want to share how we followed it using Django.

Having the all content generated client side, we needed a plain version of the site, accessible through the _escaped_fragment_ way proposed by Google. The naive way is to add a Django application to manage that kind of plain web pages, with a master view to manage explicitly the presence of the _escaped_fragment_ url variable. Anyway we wanted to build the “dummy” site as if it was a normal site with its own clean urls.py and so on.

A tipical link generated by Pornark is in the form:

http://www.pornark.com/#!/videos/165120/Good-Outdoor-Fun (NSFW)

so that, following the specification, Google indexes:

http://www.pornark.com/?_escaped_fragment_=/videos/165120/Good-Outdoor-Fun (NSFW)

Essentially we wanted to create a Django application agnostic of the presence of the _escaped_fragment_ url parameter.

The solution, a function we called queryresolve follows:

from django.core.urlresolvers import resolve

def queryresolve(request, queryparam, urlpatterns):
    '''
    Allow url dispatching based on content of a query string parameter.
    '''
    r = resolve(request.GET[queryparam] or '/', urlpatterns)
    return r[0](request, *r[1], **r[2])

this could make us free to implement our dummy site as if it was a normal site, building the urls.py only considering the value of _escaped_fragment_.

Given the urls.py in the dummy site application package (say pornark.dummy) and the queryresolve being defined within it, this is how we enabled the magic in our project-level urls.py:

from django.conf.urls.defaults import patterns, include

urlpatterns = patterns('',
    (r'^', include('pornark.video.urls')),
    (r'^', 'pornark.dummy.urls.queryresolve', {'queryparam': '_escaped_fragment_', 'urlpatterns': 'pornark.dummy.urls'}),
)

Note that the queryresolve function is absolutely generic toward the chosen url parameter. You can use it to dispatch views of a Django application based on the value of an arbitrary url parameter.

PornArk - Starting up in porn with Python and Django

I started playing with aggregators few years ago. Initially I was building crawlers and HTML scrapers with PHP, then I fell in love with Python and everything changed.

I was bored of preparing computer science exams and I started working on the crazy idea of building something that really could get lots of visitors: everything started as a joke with friends, and nobody was taking it very seriously. The first version of PornArk was just a boring list of all videos available on youtube-like porn websites.

A year after, I met a guy at my university that shared with me great passion for web development and modern technologies (those things your are not exposed to during hard - and usefull - theoretical computer science courses). We were working at the time on various ideas with Django framework and building up what now is a quite strong knowledge of how to set up a modern website ready to host a lot of users (caching, db optimization, ful text search, facebook connect, open auth, extending Django models with MongoDB and so on).

The big step has been to join the power of "post ruby-on-rails frameworks" with Google Widget Toolkit: GWT is an amazing tool, and after a long time squeezing it capabilities to build PornArk, I realize how great it is. Writing in Java and compiling in Javascript allowed us to test every interface we wanted, without exploding the complexity of the source, up to the current version of the graphic interface, which still has problems (with Explorer for example...) but which is absolutely great.

We ended up, in few months, with what we really like to call the GMail of porn  :) actually we tried to do with porn what Hipmunk has done for flight search: most tube porn websites are slow, filled up with banners and confusing. Pornark has some great features:

  • It's freaking fast, without banners
  • Doesn't reload a single page: it dinamically attachs and detachs flash player objects into the DOM 
  • Allows multiple video players on the same page :)
  • Has fast and cathegorized search of videos

We are still working on that, and we have really good feedback from users up to now.  

We didn't really understand whether and how we could make money out of it, specially avoiding banners of all sorts. But, after a short time, and being mentioned by tblop.com (a big list of good porn website) we received few requests of affiliation, from companies producing video contents willing to be on pornark. We are also trying to make a premium version with more videos and additional services, but it's still very experimental.

In the next posts we are going to explain in detail how PornArk is technically designed and how we can manage everything with a single Linode VPS with good performances (short answer: memcache everything we can).

Hope to hear some other feedbacks for now :)  If you have suggestions or question, don't hesitate! feel free to mail to pornark on gmail.