Every once in a while we come across a problem that requires a different way of thinking to solve it. I ran into this tonight with what should have been a simple database query. The solution took me in an unexpected direction, but seems to be aligning nicely with the rest of the application.
I'm working on a system that has "jobs", and each job record will have a number of "activities". Each activity will have a date and an "activitytype" relationship. The activitytype says what kind of activity has occurred, and whether that activity type is considered a "status". The task at hand was to find the "status" for each job and when that status changed.
With a traditional SQL approach we would create a query to join these tables, and then do a sub-select to find the most recent activity that is a status. If we only ever needed to do this sub select for a single job at any given time, the performance hits of a sub-select would be a non issue. But, when you have a half second delay for the sub-select (for example), spread over hundreds of thousands of possible calls (i.e. all the jobs), then performance becomes a problem. A different way needs to be found.
And here is where I turned to the #postgresql IRC channel looking for tips. I was quickly pointed to a new topic called "window functions". Window functions are available in the more recent versions of PostgreSQL (version 8.4 or newer), and "performs calculations across a set of table rows that are related to the current row" (http://www.postgresql.org/docs/8.4/static/tutorial-window.html). That is a very abstract statement that becomes clearer after working with window functions a little.
I think the biggest benefit to window functions (for my immediate needs) is that we can now do aggregate functions on a field WITHOUT needing to use Group By for the non-aggregate fields. Consider this:
select id,
count(id)
from mytable
This kind of query throws an error. The "id" column needs to also be an aggregate function, or captured in a "group by". So for the above to work, it has to be changed to
select id,
count(id)
from mytable
group by id
But, sometimes this is not quite what we want. Especially when we just need a simple count, or average within a long list of non-aggregate columns. Windowed Functions allows us to change that query to look like this:
select id,
count(id) OVER ( PARTITIONED BY id )
from mytable
And that works like a charm. (Ignore the details of the syntax for now - it makes more sense when you (carefully) read through the documentation.)
I'll let you go off and read up on Windowed Functions in as much detail as you need, because I'm sure I would steer you wrong. I'm still too new to the topic. But, I will mentioned that there are pre-built windowed functions you can call as needed - see http://www.postgresql.org/docs/8.4/static/functions-window.html for more details.
So armed with this knowledge, I spent some time seeing if I could resolve my status issue. And well, WOW! I'm very impressed. With the traditional sub-select method, I was seeing a 700+ second query (I stopped it before it completed) to gather the status info. With the query similar to the one below I was able to get the status and status date for EVERY job (approx 300k records) in about 4 seconds. Here's the magic:
select s.job,
s.name as "status",
s.status_date
from (
select t.name,
a.job,
a.activity_date,
max(a.activity_date) OVER ( PARTITION BY a.job ) as "status_date"
from activities a
left join activitytypes t on a.activitytype = t.id
where
t.status is true
) as s
where s.activity_date = s.status_date
First, we create a select statement to give us the information we need - the activitytype name, the job id (pulled from the activities table), the date of the activity, and the maximum date in the activities table for the current job. Then we limit that query to say we only care about activitytypes that have the status flag set to true. This base query gives us a row for every status activity that has happened for the job. And does so very quickly.
Next we take that base query and treat it as a sub-select. We ask for the main information we are after - the job id, the activity name (aka the status), and the date that activity was done. Then we restrict the results to only the most recent activity by saying the activity date must match the "status_date" we calculated in the core query. So a columnA = columnB type of limit.
That's it. The query works. And it is the windowed function we use to get the "status date" that makes it possible. Even the use of a sub-query like this is suggested on the documentation pages (which is pretty much like a query of a temp table). There may be even better ways to accomplish this query, but I'm happy with this approach for now. Still, feel free to drop me a note if you know of a better way.
From there I wrapped that query up into a view, and can now just join to the view any time I need to know the status of a job.
I'm impressed. A rather quick response from the community pointing me in the right direction. Even when I didn't think it was the right direction at first. New topics take time to get your head around. But I'm already applying the windowed query technique in other queries. They are very useful, and quick.
Here are the links I used to get started:
- http://umitanuki.net/pgsql/wfv08/design.html
- http://www.postgresql.org/docs/8.4/static/tutorial-window.html
- http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTA...
I think you need to DO some of the examples and play with the ideas before it "clicks". The docs are useful, but make use of abstractions, wording, and an inherent core understanding that you may or may not understand or possess. After playing with it for a little while, the docs make much more sense.