Some time ago someone on irc asked interesting question. One that I couldn't answer then (didn't have an immediate idea, and didn't have time to spend on looking into it).
Now, I have some more time, and despite the fact that the person that had this problem no longer cares about it (he found some solution himself if I recall correctly), decided to look into it.
The problem is: having simple table:
$ CREATE TABLE test ( id INT NOT NULL PRIMARY KEY, sample_date DATE NOT NULL, VALUE INT );
And data there, like:
$ INSERT INTO test (id, sample_date, VALUE) VALUES (1, '2015-01-01', 100), (2, '2015-01-03', 120), (3, '2015-01-06', 140), (4, '2015-01-15', 100);
can we somehow return data that exists in the table, but also data for missing days, somehow “interpolating" the values?
So, we'd want to get:
sample_date | VALUE | REAL -------------+-------+------ 2015-01-01 | 100 | t 2015-01-02 | 110 | f 2015-01-03 | 120 | t 2015-01-04 | 126 | f 2015-01-05 | 134 | f 2015-01-06 | 140 | t 2015-01-07 | 136 | f 2015-01-08 | 131 | f 2015-01-09 | 127 | f 2015-01-10 | 122 | f 2015-01-11 | 118 | f 2015-01-12 | 113 | f 2015-01-13 | 109 | f 2015-01-14 | 104 | f 2015-01-15 | 100 | t (4 ROWS)
“real" column has true for values coming from test table, and false for interpolated.
Since the values are rounded to integer, final results might be slightly different, but that's the general idea.
So, how can I do it?
First, I need somehow to get all rows, even with nulls. For this I will use generate_series. But hardcoding min/max values is not really nice, so let's make it check range from the table itself:
WITH range AS ( SELECT MIN(sample_date), MAX(sample_date) FROM test ), all_days AS ( SELECT g.day::DATE AS DAY FROM range r, lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY) ) SELECT * FROM all_days; DAY ------------ 2015-01-01 2015-01-02 2015-01-03 2015-01-04 2015-01-05 2015-01-06 2015-01-07 2015-01-08 2015-01-09 2015-01-10 2015-01-11 2015-01-12 2015-01-13 2015-01-14 2015-01-15 (15 ROWS)
This is cool. Now, I need to join the test table, to get real data:
WITH range AS ( SELECT MIN(sample_date), MAX(sample_date) FROM test ), all_days AS ( SELECT g.day::DATE AS DAY FROM range r, lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY) ), all_with_real AS ( SELECT a.day, t.value, t.value IS NOT NULL AS REAL FROM all_days a LEFT OUTER JOIN test t ON a.day = t.sample_date ) SELECT * FROM all_with_real ORDER BY 1; DAY | VALUE | REAL ------------+--------+------ 2015-01-01 | 100 | t 2015-01-02 | [NULL] | f 2015-01-03 | 120 | t 2015-01-04 | [NULL] | f 2015-01-05 | [NULL] | f 2015-01-06 | 140 | t 2015-01-07 | [NULL] | f 2015-01-08 | [NULL] | f 2015-01-09 | [NULL] | f 2015-01-10 | [NULL] | f 2015-01-11 | [NULL] | f 2015-01-12 | [NULL] | f 2015-01-13 | [NULL] | f 2015-01-14 | [NULL] | f 2015-01-15 | 100 | t (15 ROWS)
Now, how about actually filing in the values. I could try with subselects:
WITH range AS ( SELECT MIN(sample_date), MAX(sample_date) FROM test ), all_days AS ( SELECT g.day::DATE AS DAY FROM range r, lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY) ), all_with_real AS ( SELECT a.day, t.value, t.value IS NOT NULL AS REAL FROM all_days a LEFT OUTER JOIN test t ON a.day = t.sample_date ), all_with_prev_and_next AS ( SELECT a.day, a.value, a.real, CASE WHEN a.real THEN NULL ELSE ( SELECT array[a.day - t.sample_date, t.value] FROM test t WHERE t.sample_date < a.day AND t.value IS NOT NULL ORDER BY t.sample_date DESC LIMIT 1 ) END AS prev_day, CASE WHEN a.real THEN NULL ELSE ( SELECT array[t.sample_date - a.day, t.value] FROM test t WHERE t.sample_date > a.day AND t.value IS NOT NULL ORDER BY t.sample_date ASC LIMIT 1 ) END AS next_day FROM all_with_real a ) SELECT a.day, CASE WHEN a.real THEN a.value ELSE ( a.next_day[2] - a.prev_day[2] ) * a.prev_day[1] / ( a.next_day[1] + a.prev_day[1] ) + a.prev_day[2] END AS VALUE, a.real FROM all_with_prev_and_next a ORDER BY a.day; DAY | VALUE | REAL ------------+-------+------ 2015-01-01 | 100 | t 2015-01-02 | 110 | f 2015-01-03 | 120 | t 2015-01-04 | 126 | f 2015-01-05 | 133 | f 2015-01-06 | 140 | t 2015-01-07 | 136 | f 2015-01-08 | 132 | f 2015-01-09 | 127 | f 2015-01-10 | 123 | f 2015-01-11 | 118 | f 2015-01-12 | 114 | f 2015-01-13 | 109 | f 2015-01-14 | 105 | f 2015-01-15 | 100 | t (15 ROWS)
But this will be relatively slow, as it has to do two scans of test table for every day we're missing data for.
Also, while I wrote it, and understand how it works, I wouldn't call it the nicest possible query.
So, is there, perhaps, any other approach we could do? Perhaps window functions could help?
Sure, they can. I found this solution. It is most likely not the best possible, but it works for me. Let's see:
WITH range AS ( SELECT MIN(sample_date), MAX(sample_date) FROM test ), all_days AS ( SELECT g.day::DATE AS DAY FROM range r, lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY) ), all_with_real AS ( SELECT a.day, t.value, t.value IS NOT NULL AS REAL FROM all_days a LEFT OUTER JOIN test t ON a.day = t.sample_date ), all_with_previous_and_next AS ( SELECT a.*, MAX( CASE WHEN a.value IS NULL THEN NULL ELSE array[ a.day::TEXT, a.value::TEXT ] END ) OVER ( ORDER BY a.day ) AS previous_real, MIN( CASE WHEN a.value IS NULL THEN NULL ELSE array[ a.day::TEXT, a.value::TEXT ] END ) OVER ( ORDER BY a.day range BETWEEN CURRENT ROW AND unbounded following ) AS next_real FROM all_with_real a ) SELECT a.day, COALESCE( a.value, a.previous_real[2]::INT4 + ( a.next_real[2]::INT4 - a.previous_real[2]::INT4 ) * ( a.day - a.previous_real[1]::DATE ) / ( a.next_real[1]::DATE - a.previous_real[1]::DATE ) ) AS VALUE, a.real FROM all_with_previous_and_next a ORDER BY a.day;
Result is as expected, but how it works?
Let's see what each of the CTEs returns:
range:
MIN | MAX ------------+------------ 2015-01-01 | 2015-01-15 (1 ROW)
all_days:
SELECT * FROM all_days; DAY ------------ 2015-01-01 2015-01-02 2015-01-03 2015-01-04 2015-01-05 2015-01-06 2015-01-07 2015-01-08 2015-01-09 2015-01-10 2015-01-11 2015-01-12 2015-01-13 2015-01-14 2015-01-15 (15 ROWS)
all_with_real:
DAY | VALUE | REAL ------------+--------+------ 2015-01-01 | 100 | t 2015-01-02 | [NULL] | f 2015-01-03 | 120 | t 2015-01-04 | [NULL] | f 2015-01-05 | [NULL] | f 2015-01-06 | 140 | t 2015-01-07 | [NULL] | f 2015-01-08 | [NULL] | f 2015-01-09 | [NULL] | f 2015-01-10 | [NULL] | f 2015-01-11 | [NULL] | f 2015-01-12 | [NULL] | f 2015-01-13 | [NULL] | f 2015-01-14 | [NULL] | f 2015-01-15 | 100 | t (15 ROWS)
all_with_previous_and_next
DAY | VALUE | REAL | previous_real | next_real ------------+--------+------+------------------+------------------ 2015-01-01 | 100 | t | {2015-01-01,100} | {2015-01-01,100} 2015-01-02 | [NULL] | f | {2015-01-01,100} | {2015-01-03,120} 2015-01-03 | 120 | t | {2015-01-03,120} | {2015-01-03,120} 2015-01-04 | [NULL] | f | {2015-01-03,120} | {2015-01-06,140} 2015-01-05 | [NULL] | f | {2015-01-03,120} | {2015-01-06,140} 2015-01-06 | 140 | t | {2015-01-06,140} | {2015-01-06,140} 2015-01-07 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-08 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-09 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-10 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-11 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-12 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-13 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-14 | [NULL] | f | {2015-01-06,140} | {2015-01-15,100} 2015-01-15 | 100 | t | {2015-01-15,100} | {2015-01-15,100} (15 ROWS)
Of course this is the most magical. The previous_real and next_real, as expected, show previous date/value, and next date/value.
This is done using one nasty trick. Normally I could use:
MAX( CASE WHEN a.value IS NULL THEN NULL ELSE a.day END ) OVER ( ORDER BY a.day )
To get previous day (and similar query with different frame to get next real day).
But – this doesn't give me value. I can't use “min/max" on the values, as the values can decrease in time.
I didn't want to do subselects to get the values for previous/next day, so instead I decided to abuse arrays. I cast both needed values to text, and merge them into array, but putting date as first column.
PostgreSQL date format YYYY-MM-DD has the great benefit of being sortable. So assuming the dates are unique, value will not influence ordering.
This is still not really nice – the dirty trick with arrays of text – it just doesn't look ok.
Can it be done in any other way?
Sure. First of all, maybe someone else will find a better way, but I figured that I can simply add custom aggregate (don't worry, this is not scary at all):
CREATE FUNCTION last_agg ( anyelement, anyelement ) RETURNS anyelement AS $$ SELECT COALESCE( $2, $1 ); $$ LANGUAGE SQL; CREATE aggregate last_agg ( anyelement ) ( sfunc = last_agg, stype = anyelement );
That's all. What it does is that it returns last not-null value from a group. This doesn't really make sense in case of normal grouping, as “last" is not really well defined there, but since PostgreSQL can have aggregates with “order by" clause, suddenly last makes a lot of sense.
For example, given these values, in this order:
- 10
- NULL
- 20
- NULL
- 15
- NULL
last_agg will return value 15.
Having this, we can change our monstrous query into something much nicer:
WITH range AS ( SELECT MIN(sample_date), MAX(sample_date) FROM test ), all_days AS ( SELECT g.day::DATE AS DAY FROM range r, lateral generate_series( r.min, r.max, '1 day'::INTERVAL ) AS g(DAY) ), all_with_real AS ( SELECT a.day, t.value, t.value IS NOT NULL AS REAL FROM all_days a LEFT OUTER JOIN test t ON a.day = t.sample_date ), all_with_previous_and_next AS ( SELECT a.*, last_agg( CASE WHEN a.value IS NULL THEN NULL ELSE a.day END ) OVER ( ORDER BY a.day ) AS previous_day, last_agg( a.value ) OVER ( ORDER BY a.day ) AS previous_value, last_agg( CASE WHEN a.value IS NULL THEN NULL ELSE a.day END ) OVER ( ORDER BY a.day DESC ) AS next_day, last_agg( a.value ) OVER ( ORDER BY a.day DESC ) AS next_value FROM all_with_real a ) SELECT a.day, COALESCE( a.value, a.previous_value + ( a.day - a.previous_day ) * ( a.next_value - a.previous_value ) / ( a.next_day - a.previous_day ) ) AS VALUE, a.real FROM all_with_previous_and_next a ORDER BY a.day;
all_with_previous_and_next contains now different values, and they are even properly typed:
DAY | VALUE | REAL | previous_day | previous_value | next_day | next_value ------------+--------+------+--------------+----------------+------------+------------ 2015-01-01 | 100 | t | 2015-01-01 | 100 | 2015-01-01 | 100 2015-01-02 | [NULL] | f | 2015-01-01 | 100 | 2015-01-03 | 120 2015-01-03 | 120 | t | 2015-01-03 | 120 | 2015-01-03 | 120 2015-01-04 | [NULL] | f | 2015-01-03 | 120 | 2015-01-06 | 140 2015-01-05 | [NULL] | f | 2015-01-03 | 120 | 2015-01-06 | 140 2015-01-06 | 140 | t | 2015-01-06 | 140 | 2015-01-06 | 140 2015-01-07 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-08 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-09 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-10 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-11 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-12 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-13 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-14 | [NULL] | f | 2015-01-06 | 140 | 2015-01-15 | 100 2015-01-15 | 100 | t | 2015-01-15 | 100 | 2015-01-15 | 100 (15 ROWS)
All this, and we still have just one scan of the underlying table. Not bad. Anyone has better idea?
You don’t need the subqueries. Just use lead on the real data before joining it with all_days:
@Anders:
Nice. I knew that there has to be a simpler approach. Just didn’t get to it myself :/
Argh, apparently I didn’t either: I didn’t like the range in the join (it looked slow), but I guess you can just produce the range in pieces:
FWIW, if you used temp views it would make the examples easier to follow. IE, replace all the WITH range, … all_days … with
CREATE TEMP VIEW all_days AS
with range as (
select min(sample_date), max(sample_date) from test
) SELECT …
This is especially handy as you build up examples; if each example is a temp view then you can just keep adding them to your JOIN clause.
@Jim:
True, will check how it goes next time.
The point in the queries is to shown that it can be done with a query, without any special objects. I’m sure you know that view/cte are interchangeable, but I’m not so sure if this is common knowledge.
Nice food for thought. Here is my attempt:
with recursive
bounds as (select min(sample_date),max(sample_date) from test),
day(current) as (select bounds.min from bounds union all select (current + interval ‘1 day’)::date from day,bounds where current < bounds.max),
dates as (select current, max(test.sample_date) filter (where test.sample_date = current) as hi_date from day, test group by current),
samples as (select current,lo_date,hi_date,min(value) filter (where test.sample_date = lo_date) as lo_val,min(value) filter (where test.sample_date = hi_date) as hi_val from dates, test group by 1,2,3)
select *, case when hi_date=lo_date then lo_val else lo_val + (current-lo_date) * (hi_val-lo_val) / (hi_date-lo_date) end as intra from samples order by 1;
Scratch my previous post.
with recursive
bounds as (select min(sample_date),max(sample_date) from test),
day(current) as (select bounds.min from bounds union all select (current + interval ‘1 day’)::date from day,bounds where current < bounds.max),
dates as (select current, max(test.sample_date) filter (where test.sample_date = current) as hi_date from day, test group by current),
samples as (select current,lo_date,hi_date,min(value) filter (where test.sample_date = lo_date) as lo_val,min(value) filter (where test.sample_date = hi_date) as hi_val from dates, test group by 1,2,3)
select *, case when hi_date=lo_date then lo_val else lo_val + (current-lo_date) * (hi_val-lo_val) / (hi_date-lo_date) end as intra from samples order by 1;
Last attempt to copy-paste, what is wrong with this thing 🙂
doesn’t work for me – doesn’t even run – complains about lo_date column not existing.
I don’t know, the lo_date is declared in the the “dates” CTE, which is mungled each time i posted it. Simmilar to hi_date, only reversed.
dates as (
select current,
max(test.sample_date) filter (where test.sample_date = current) as hi_date
from day, test group by current),
Final version, working:
Just put it on some paste site. Or use html pre tags in your comment, that should be allowed by wordpress, and it will not change anything inside.
@Pg23193:
Looks like it’s working, but your solution requires 3 scans overt test table, while mine need 2, and Andres – just one.