My jabber server had the feature, that it logs all messages that got sent through it.
This is pretty cool, and useful. And now, i got asked to use it to create list of conversations.
What exactly is this? Whenever I send (or receive) something there is record in database with information about which local user, communication type (send/recv), correspondent, when it happened, and what is the body of message.
And based on this, we want to list messages into chats. How?
Table looks like this:
$ \d message_log Table "public.message_log" Column | Type | Modifiers ---------------+-----------------------------+----------- local_user | text | correspondent | character varying(250) | msg_type | character varying(250) | logged_when | timestamp without time zone | msg_body | text |
And data in this, like this:
$ SELECT md5(local_user), md5(correspondent), msg_type, logged_when, md5(msg_body) FROM message_log LIMIT 3; md5 | md5 | msg_type | logged_when | md5 ----------------------------------+----------------------------------+----------+----------------------------+---------------------------------- d5ecebb85919589a7cbf7f90eb34eacf | 09cc8925ddb803a1eb756001ad984c1c | sent | 2007-07-23 23:23:38.802531 | 875d829b9d90d52dc62ea9e8aa7e0afc d5ecebb85919589a7cbf7f90eb34eacf | 09cc8925ddb803a1eb756001ad984c1c | recv | 2007-07-23 23:23:45.314665 | d7567c72c61e30334500b029ec7e2033 d5ecebb85919589a7cbf7f90eb34eacf | d59345ff3c632a9ae921c7e3aab661fe | sent | 2007-07-23 23:43:00.96291 | eb43de8a086c848c0907bcff462154e6 (3 ROWS)
(sorry, can't disclose sensitive information).
Now, there is a question: what is conversation? Of course, we don't want to make language analysis, and checking the subject of messages. Instead let's try simpler approach:
Disclaimer: Idea on how to make it is not my own: it's based on RhodiumToad blogpost.
if 2 messages are less than x minutes apart – they are both part of the same conversation
Of course only messages between the same 2 parties.
So, assuming we'd have messages logged at:
- 2007-07-23 23:23:38
- 2007-07-23 23:24:01
- 2007-07-24 05:01:20
- 2007-07-24 08:24:59
- 2007-07-24 08:29:40
- 2007-07-24 08:33:15
and we assume that conversation threshold is 5 minutes, we have in here 3 conversations:
- from 2007-07-23 23:23:38 to 2007-07-23 23:24:01 (2 messages)
- on 2007-07-24 05:01:20, single message
- from 2007-07-24 08:24:59 to 2007-07-24 08:33:15 (3 messages)
Seems easy enough. But how to write a query to do it? Luckily, with PostgreSQL 8.4, we got window functions, which will be *very* useful in here.
First let's make smaller table, with test values from above:
CREATE TABLE test ( ts TIMESTAMP); INSERT INTO test (ts) VALUES ('2007-07-23 23:23:38'), ('2007-07-23 23:24:01'), ('2007-07-24 05:01:20'), ('2007-07-24 08:24:59'), ('2007-07-24 08:29:40'), ('2007-07-24 08:33:15');
First part of the query, is a way to mark which messages are within our threshold from previous message, we can do it as:
WITH setup AS ( SELECT '5 minutes'::INTERVAL AS threshold ) SELECT t.ts, CASE WHEN t.ts - lag(t.ts) OVER ( ORDER BY t.ts ) <= s.threshold THEN NULL ELSE t.ts END FROM test t, setup s ORDER BY t.ts;
Results are, as expected:
ts | ts ---------------------+--------------------- 2007-07-23 23:23:38 | 2007-07-23 23:23:38 2007-07-23 23:24:01 | [null] 2007-07-24 05:01:20 | 2007-07-24 05:01:20 2007-07-24 08:24:59 | 2007-07-24 08:24:59 2007-07-24 08:29:40 | [null] 2007-07-24 08:33:15 | [null] (6 rows)
Name of the 2nd column is bad, but it's not relevant.
In case you don't understand the case expression I used:
when t.ts – lag(t.ts) over ( order by t.ts ) <= s.threshold then null
Does:
From the value of t.ts (current ts in given row), subtracts result of “lag(t.ts) over ( order by t.ts )" – which is value of ts column in previous row, previous – as defined by “order by ts".
Technically, it means it shows time difference between current, and previous row:
$ SELECT ts, ts - lag(ts) OVER ( ORDER BY ts ) FROM test ORDER BY ts; ts | ?COLUMN? ---------------------+---------- 2007-07-23 23:23:38 | [NULL] 2007-07-23 23:24:01 | 00:00:23 2007-07-24 05:01:20 | 05:37:19 2007-07-24 08:24:59 | 03:23:39 2007-07-24 08:29:40 | 00:04:41 2007-07-24 08:33:15 | 00:03:35 (6 ROWS)
Then, we compare the value with setup threshold (5 minutes), and if it's less than threshold – I just return null – it's part of the previous conversation. If it's larger than 5 minutes – it's clearly new chat, and I need to mark it somehow – I chose to use ts as the mark.
With this resultset, I can now fill-in the nulls, to values that will let me group the messages into chats:
WITH setup AS ( SELECT '5 minutes'::INTERVAL AS threshold ), starts_marked AS ( SELECT t.ts, CASE WHEN t.ts - lag(t.ts) OVER ( ORDER BY t.ts ) <= s.threshold THEN NULL ELSE t.ts END AS chat_start FROM test t, setup s ) SELECT ts, MAX(chat_start) OVER (ORDER BY ts) FROM starts_marked ORDER BY ts;
Results now:
ts | max ---------------------+--------------------- 2007-07-23 23:23:38 | 2007-07-23 23:23:38 2007-07-23 23:24:01 | 2007-07-23 23:23:38 2007-07-24 05:01:20 | 2007-07-24 05:01:20 2007-07-24 08:24:59 | 2007-07-24 08:24:59 2007-07-24 08:29:40 | 2007-07-24 08:24:59 2007-07-24 08:33:15 | 2007-07-24 08:24:59 (6 rows)
Again, name of the column is bad, but it's irrelevant.
As you can see, now all rows that belong to the same conversation have the same value in max column, which make it possible to:
WITH setup AS ( SELECT '5 minutes'::INTERVAL AS threshold ), starts_marked AS ( SELECT t.ts, CASE WHEN t.ts - lag(t.ts) OVER ( ORDER BY t.ts ) <= s.threshold THEN NULL ELSE t.ts END AS chat_start FROM test t, setup s ), grouped AS ( SELECT ts, MAX(chat_start) OVER (ORDER BY ts) AS group_start FROM starts_marked ) SELECT group_start, MAX(ts) AS group_end, COUNT(*) AS messages, MAX(ts) - group_start AS duration FROM grouped GROUP BY group_start ORDER BY group_start;
Results:
group_start | group_end | messages | duration ---------------------+---------------------+----------+---------- 2007-07-23 23:23:38 | 2007-07-23 23:24:01 | 2 | 00:00:23 2007-07-24 05:01:20 | 2007-07-24 05:01:20 | 1 | 00:00:00 2007-07-24 08:24:59 | 2007-07-24 08:33:15 | 3 | 00:08:16 (3 rows)
Which is pretty cool. But how to change it to work with multiple correspondents? That's actually pretty trivial:
WITH setup AS ( SELECT '5 minutes'::INTERVAL AS threshold ), starts_marked AS ( SELECT local_user, correspondent, t.logged_when, CASE WHEN t.logged_when - lag(t.logged_when) OVER ( partition BY local_user, correspondent ORDER BY t.logged_when ) <= s.threshold THEN NULL ELSE t.logged_when END AS chat_start FROM message_log t, setup s ), grouped AS ( SELECT local_user, correspondent, logged_when, MAX(chat_start) OVER (partition BY local_user, correspondent ORDER BY logged_when) AS group_start FROM starts_marked ) SELECT md5(local_user), md5(correspondent), group_start, MAX(logged_when) AS group_end, COUNT(*) AS messages, MAX(logged_when) - group_start AS duration FROM grouped GROUP BY md5(local_user), md5(correspondent), group_start ORDER BY md5(local_user), md5(correspondent), group_start;
Results look pretty cool:
md5 | md5 | group_start | group_end | messages | duration ----------------------------------+----------------------------------+----------------------------+----------------------------+----------+----------------- 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-03 21:17:29.281889 | 2010-09-03 21:25:59.092128 | 32 | 00:08:29.810239 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-04 05:21:00.081023 | 2010-09-04 05:21:00.081023 | 1 | 00:00:00 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-04 16:31:50.569392 | 2010-09-04 16:41:42.637592 | 58 | 00:09:52.0682 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-04 21:25:56.50246 | 2010-09-04 21:25:56.50246 | 1 | 00:00:00 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-05 00:09:59.265376 | 2010-09-05 00:09:59.265376 | 1 | 00:00:00 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-05 08:41:11.776131 | 2010-09-05 08:41:11.776131 | 1 | 00:00:00 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-10 19:49:30.163026 | 2010-09-10 19:49:45.565422 | 3 | 00:00:15.402396 20a80d47259e2f59315e6e46850e2cff | 020582827f3be456a4d79d4da81ade32 | 2010-09-10 20:54:53.288434 | 2010-09-10 20:55:27.811443 | 6 | 00:00:34.523009 20a80d47259e2f59315e6e46850e2cff | 09351174dfb598850656184da17446a9 | 2010-09-02 18:25:51.179718 | 2010-09-02 18:25:51.179718 | 1 | 00:00:00 20a80d47259e2f59315e6e46850e2cff | 09351174dfb598850656184da17446a9 | 2010-09-02 18:34:35.861093 | 2010-09-02 18:36:56.641371 | 6 | 00:02:20.780278 ...
Now, I could toss in some rounding (for example: conversations are rounded up to nearest 5 or 10 or 15 minutes), grouping (all conversation with given person during a month), and I'll have nice system for automatically generating invoices for consulting-over-jabber services, but how to do it – I'll leave out as an exercise for readers 🙂
@depesz, what about performance – using indexes, amount of data dependences, etc.
Well, it is all mostly rescanning of some recordset (i.e. list of messages to be grouped into chats). So, as long as your dataset is not too big – it will be fast. Generally, we are speaking about: 1. seq scan (or index scan) of the source table, 3 materializations of changes to the recordset. with appropriate work_mem it should work really nice.
@depesz, thanks for the extremely useful example! You made my day.