On 3rd of November, Heikki Linnakangas committed patch:
Support range DATA types. Selectivity estimation functions are missing FOR SOME range TYPE operators, which IS a TODO. Jeff Davis
Well, the commit log is quite laconic. So let's dig a bit deeper.
Every once in a while you need to represent range of values. For example – when you have e-commerce database, you might have rebates system based on how much given buyer previously bought from you – if between 10 and 100 items – you give him 1% rebate, if 100-1000 – 5%, and so on.
In auction sites you might have defined time range for when the auction should be visible on site.
Previously these things were defined usually using 2 fields ( visible_from, visible_to ), but this was suboptimal for at least 2 reasons:
- readability – given: rebate_from = 10, rebate_to = 100, rebate = 1% and rebate_from = 100, rebate_to = 1000, rebate = 5% – what should be the rebate for user with 100 purchased items?
- indexing such cases can be problematic
Both of above problems are solved, quite nicely.
As for first problem – while two values for range “10 , 100" can have four possible meanings:
- neither 10 nor 100 is included in range
- 10 is included, but 100 not
- both are included
- 10 is not included, but 100 is
But we have proper tools to denote it in value: [] and () characters.
Above four options can be written now as:
- (10,100)
- [10,100)
- [10,100]
- (10,100]
Every datatype can have now range subtype. Built-in we get 6:
- int4range based on integer values
- numrange based on numeric values
- tsrange based on timestamp without time zone values
- tstzrange based on timestamp with time zone values
- daterange based on date values
- int8range based on bigint values
To create range types, one can simply use text format and cast:
$ SELECT '(10.1,123.2345]'::numrange; numrange ----------------- (10.1,123.2345] (1 ROW)
or new constructors, which can be used like this:
$ SELECT tstzrange( ); tstzrange ----------- empty (1 ROW) $ SELECT tstzrange( now() ); tstzrange ------------------------------------------------------------------- ["2011-11-07 15:21:56.366521+01","2011-11-07 15:21:56.366521+01"] (1 ROW) $ SELECT tstzrange( now(), now() + '1 day'::INTERVAL ); tstzrange ------------------------------------------------------------------- ["2011-11-07 15:22:07.535156+01","2011-11-08 15:22:07.535156+01") (1 ROW) $ SELECT tstzrange( now(), now() + '1 day'::INTERVAL, '(]' ); tstzrange ------------------------------------------------------------------- ("2011-11-07 15:22:18.096187+01","2011-11-08 15:22:18.096187+01"] (1 ROW)
First call generated empty range. I'm not entirely sure if there is sensible usecase for it, but at the very least – it's important to have it for completeness sake.
Second – with only one value generate single-value range. I.e. only this single value will be within the range, no other.
Third call showed basic range based on two values, and fourth is a variation of this, but with changed inclusiveness of ends – from “[x,y)" to “(x,y]".
Of course ability to specify ranges as single value is not so thrilling, until we'll add indexing.
To index these we will need to use GiST, like this:
CREATE INDEX some_name ON some_table USING gist ( COLUMN );
but that's just the beginning.
Now, we can easily add exclusion constraint – which is kind-of like unique.
For example, we can make the rebates table, so that only one row can be made to match given number of items:
CREATE TABLE rebates (
codename text primary key,
items_range int4range not null,
rebate_pct int4
);
ALTER TABLE rebates ADD EXCLUDE USING gist (items_range WITH &&);
And now, I can insert some rows:
INSERT INTO rebates (codename, items_range, rebate_pct) VALUES ( 'bronze', '[10,100)', 1 ), ( 'silver', '[100,1000)', 5 ), ( 'gold', '[1000, 5000)', 7 ), ( 'platinum', '[5000,)', 10 );
The last row is interesting because it shows range with no upper bound – i.e. it will be from 5000 to anything.
Now we can check if the matching works:
$ SELECT * FROM rebates WHERE items_range @> 15::int4; codename | items_range | rebate_pct ----------+-------------+------------ bronze | [10,100) | 1 (1 ROW) $ SELECT * FROM rebates WHERE items_range @> 100::int4; codename | items_range | rebate_pct ----------+-------------+------------ silver | [100,1000) | 5 (1 ROW) $ SELECT * FROM rebates WHERE items_range @> 1000000::int4; codename | items_range | rebate_pct ----------+-------------+------------ platinum | [5000,) | 10 (1 ROW)
All looks good. But now I can try to insert bad, overlapping range:
$ INSERT INTO rebates (codename, items_range, rebate_pct) VALUES ('test', '[50,500)', 3); ERROR: conflicting KEY VALUE violates exclusion CONSTRAINT "rebates_items_range_excl" DETAIL: KEY (items_range)=([50,500)) conflicts WITH existing KEY (items_range)=([10,100)).
It showed immediately first row that the new row conflicts with. Sweet.
Of course we also have other options than just checking if value is within range. We can do unions, differences, intersections, checks location (left-of/right-of), whether two ranges are adjacent, and so on. All details in the fine manual.
All in all – this is not strictly speaking new functionality – we could have done it all by ourselves earlier, but the new syntax is nicer, easier to understand and less prone to errors. And that's definitely something good.
The killer feature here that is not easily replicable with greater than less than is with finding where two ranges overlap. Think resource scheduling — if you have a room reserved from 10am-1pm and you want to see if there is a conflict with a request to have the room from 9am to 11am.
@Anonymous:
I showed how to do it in 9.0 in the linked blogpost.
As said – it was possible before, just not as nicely.
Great blog entry as usual Depesz! I really hope many people will use this in real production when 9.2 arrives. Does anyone happen to know if other database systems have such types?
Interesting. One suggestion für increased “intuitivity” – instead of
neither 10 nor 100 is included in range
10 is included, but 100 not
both are included
10 is not included, but 100 is
Above four options can be written now as:
instead of (10,100) i’d use: )10,100(
similiar:
[10,100) -> (10,100(
[10,100] -> (10,100)
(10,100] -> )10,100)
just an idea
@FFF:
hmm … using reverse parentheses is not a widely known approach – at least that’s the first time I hear about it.
On the other hand using () and [] is (afaik) widely known, and agreed upon mathematical convention (I learned about it in high school). Or perhaps it’s () and <> ? Not sure now. Anyway – never before heard of notation with reversed parentheses.
I deal with ranges all the time in the sciences, often requiring two “from” and “to” columns, with a number of checks and trigger functions to ensure valid, non-overlapping ranges.
The closed/open [,(,),] specification is a really useful feature which I couldn’t exactly do with two “from”/”to” columns without predefined assumptions of which bound was closed or open.
The predicate functions are also really useful, and intuitive from their 2/3D spatial counterparts. I’ll be using it all.
Just a word on the parentheses, in Italy while in highschool i did not use ( but only [ and the reversed form.
I used both in university. I’ll check with a friend russian math PhD and tell you about the standard if there is one.
Depesz,
naturally it is not known, as i just invented it
If [ vs ( is “mathematical convention” i can’t comment, but given that SQL is no math, and parentheses of all sorts are omnipresent in hacking (and has there other meaning), i think anything intuitive understandable is better. You have to “learn” what’s the meanining of [ or (, as you have to with reversed parentheses, but i for one would remember the latter’s meaning more easily…
Nevermind, it was only a suggestion
@Tiziano:
interesting. I guess my school skewed my view (not really shocking, just was *sure* it’s common convention). Thanks for info.
Hey,
+1 with Tiziano, I learned the reversed notation with [ only at school, in France. So [10, 100) is actually [10, 100[ in my mind.
I was actually wondering while reading your post why the patch don’t used this notation, but reading the comments here, I realize it isn’t actually a standard…
I’ve only just seen this, and I’m very pleased to read it! It makes the functionality for creating range-exclusions much nicer, especially with it handling half-open ranges.
The “[x,y)” notation for a half-open range is familiar to me from maths too. I think both “[x,y)” and “[x,y[” both look like typos unless you’re already familiar with the convention.
Here in Finland, we use [1,100] for inclusive and ]0,inf[ for exclusive (that is, square brackets all the time). Apparently the same as in France. This will need a set of alternative input notations, or a lot of people will be confused.