Why MySQL full text search WAS good enough
If you are an “afetionated” PornArk user, surely you noticed some slowdown when waiting for search results happening during last days.
Well that’s story: we have just deployed the new search system with best and faster results, try it!
In this post, we will disclose the “good enough” approach we used to serve search results till a week ago and why it became insufficient. Server side PornArk has started as a classic Django+MySQL web application, so that using MySQL full-text search capabilities was a reasonable and rapid-to-develop solution for our needs. MySQL offers an interesting full-text search on text fields once a particular index is specified. We started indexing fields such as “title” and then added further informations like votes and categorization. We wanted essentially the following characteristics:
- full-text search on titles/categories;
- ranking results on full-text match quality, votes and categorization quality for category matches;
- a certain degree of tolerance for wrong spelling
Our schema contained two main tables:
- Video (title, votes balance);
- Category (video ref, name, weight);
Each video could have associated zero or more Category entries, where there was uniqueness on the pair (video ref, name), with a weight; the weight counted the amount of check on the category check box made by the users in the details of the video.
This was enough for point 1. and 2., but not for 3.. Looking around there are two main ways to make approximated search:
- soundex (a function witch takes a word and outputs its n-soundex, that is a string long n characters giving a phonetic representation of the word; eg: booooobs and boobs both have 4-soundex B200);
- stemming (considering the “stemmed” form of words, that is the common root to words; eg: boob is common root to boobies and boobs).
Both methods depend on the language of reference. PornArk indexed videos mostly have English titles and categories were in English too. Implementations are commonly available for the English language. In order to use soundex we introduced a new field into the Video table, called title_soudnex, containing the title with 4-soundex of its words. As an example, given a title such as “nice busty woman” the corresponding title_soundex was “N200 B230 W500”.
A first step is the creation of a fulltext index with:
CREATE FULLTEXT INDEX index_name ON table_name (field1_name, field2_name);
We created such an index on title and title_soundex fields of the Video table.
Now the final search query (with some Python glue):
cat_part = '' if cats is not None: cat_part = (' + (SELECT COALESCE(SUM(c.counter), 0) FROM category c WHERE v.id=c.vid_id AND (' + ' OR '.join(['c.name="' + cats[i:i + 2] + '"' for i in range(0, len(cats), 2)]) + '))') title_fullsearch = 'MATCH(v.title) AGAINST ("' + kwds + '" IN NATURAL LANGUAGE MODE)' soundex_fullsearch = 'MATCH(v.soundex) AGAINST ("' + (' '.join([soundex(k) for k in kwds_list])) + '" IN BOOLEAN MODE)' sql = ('SELECT v.*, v.votes' + cat_part + ' + ' + title_fullsearch + ' + ' + soundex_fullsearch + ' AS score FROM video v ' + 'WHERE ' + title_fullsearch + ' OR ' + soundex_fullsearch + ' ' + 'ORDER BY score DESC LIMIT ' + str(base) + ', ' + str(MAX_RESULTS)) q = Video.objects.raw(sql)
kwds_listis the list of words in the query string;catsis a (possibly empty) list of category names found (we had a phase of preprocessing) within keywords;kwdsis the phase to search;soundex()the function that implement soundex in Python.
A typical generated query would be:
SELECT v.*, v.votes + ( SELECT COALESCE(SUM(c.counter), 0) FROM category c WHERE v.id=c.vid_id AND (c.name="anal" OR c.name="busty") + MATCH(v.title) AGAINST ("nice anal busty" IN NATURAL LANGUAGE MODE) + MATCH(v.soundex) AGAINST ("N200 A540 B230" IN BOOLEAN MODE) AS score FROM video v WHERE MATCH(v.title) AGAINST ("nice anal busty" IN NATURAL LANGUAGE MODE) OR MATCH(v.soundex) AGAINST ("N200 A540 B230" IN BOOLEAN MODE) AS score ORDER BY score DESC LIMIT 0, 50;
We obtain results score as the sum of votes, match rank, soundex match rank, categories matches counter. The score components were weighted with constants. We used Django raw queries facilities in the end.
With our video db growing and growing this method became slower and slower making the exposed solution no more a viable one. The PornArk you can see right now is using a full-fledged full-text search engine with both stemming and soundex and ranking tags (are obviously more dynamic) instead of categories. We chose Sphinx to support our new search system. It turned out to be a great tool that well fits our needs. Details to follow in near future ;).
