Ranking Items in a Database
At work recently, I was tasked to create a system that ranked items in a database, from least to greatest based on time. Originally I was quite blinded by the original code, which had used three nested queries, and a dozen variables of impenetrable names, a set of code I won’t be posting. Now, this isn’t a very hard task – but let me document the path through with I got to the simple solution. This idea wasn’t immediately obvious to me, or someone else who had quite a bit more experience on the topic – but here we are.
The Dataset
The dataset that I’m ranking is much more complex, however the concept is still the same.
response ----------------------------------- id | INT(11) AUTO_INCREMENT NOT NULL time | FLOAT NOT NULL meta | VARCHAR(255)
Temporary Tables…Ish.
My first idea was to have a table that would be regenerated every $increment (probably an hour, maybe a minute when we were small.) These tables would be pretty simple, just:
response_ranking ----------------------------------------- id | INT(11) AUTO_INCREMENT NOT NULL response_id | INT(11) NOT NULL
And then in order to populate the database with the appropriate ranks, we would run:
-- Truncate the table TRUNCATE TABLE response_rank; -- Put the data into the table in order by time. INSERT INTO response_rank (response_id, time) SELECT id AS response_id, time FROM response ORDER BY time;
The SQL code would take the data from the response table in order of time, fastest to slowest, and then insert it into the response_rank table. The response_rank table’s ID would then directly be their rank amongst that dataset. The first record would be the fastest, the 257th record is the 257th fastest, etc. In order to find the rank for a particular record, the query is quite simple too:
SELECT id AS rank FROM response_rank WHERE response_id = ?;
Why this is a Bad Idea ™
That’s a terrible idea. There are a number of problems with it, and to just name a few:
- The dataset in the
response_ranktable is immediately out of date, even the first time someone runs a report. - With a large number of records in the response table, regenerating
response_rankcould take ages. - It’s just too complicated for what I want to do.
How To Do It… Better.
Now this may not be the most efficient way to do it, however I imagine it’s much closer to a Good Idea ™.
Instead of regenerating indexes or pushing massive amounts of data, let’s use a simple database operation which the database can be easily optimized to do, if it isn’t already by adding an index to the time column.
I present to you, The Better Way:
SELECT count(1) AS rank FROM response WHERE time < ?;
When you run the query, rank is exactly the rank within the current dataset, with nothing waiting to be aggregated into the database, or an index needing to be regenerated.
Now I know, this is stupidly simple, but it wasn’t extremely obvious to me or my coworker – so maybe someone else is stuck on such a simple problem too. However, this also may not be the most efficient way to do it either – if there’s a better way, I’d love to know.