On 18th of December, Teodor Sigaev committed patch:
Introduce distance operators over cubes: <#> taxicab distance <-> euclidean distance <=> chebyshev distance Also add kNN support of those distances in GiST opclass. Author: Stas Kelvich
I once worked for a company that was using unofficial patch to add this, so I'm very happy that we finally got kNN for cubes.
What this is?
Well, for starters – it's contrib extension that you can load to your database using simple:
$ CREATE extension cube; CREATE EXTENSION
Within the extension there are 61 elements, but the basic is “cube" datatype.
It's basically a point, or a “cube" in n-dimensional space.
For example, you can have following cubes:
- ‘(12)' – single point, in one dimension
- ‘(12,34)' – single point in two dimensions
- ‘(12,34,56,78,91)' – single point in 5 dimensions
- ‘(12),(24)' – single dimension “cube" from 12 to 24
- ‘(12,34),(56,78)' – two-dimensional cube (rectangle) defined by providing opposing points
- ‘(1,2,3,4,5),(6,7,8,9,10)' – cube in five dimensions
There is no real limit on number of dimensions.
Now, what are the different distances defined by the new patch? Let's see. For example purposes, let's start with one-dimensional points:
$ WITH x AS ( SELECT '(1)'::cube AS point_a, '(5)'::cube AS point_b ) SELECT point_a <#> point_b AS taxicab, point_a <-> point_b AS euclidean, point_a <=> point_b AS chebyshev FROM x; taxicab | euclidean | chebyshev ---------+-----------+----------- 4 | 4 | 4 (1 ROW)
Not really interesting. What about more dimensions?
$ WITH x AS ( SELECT '(0,0)'::cube AS point_a, '(16,9)'::cube AS point_b ) SELECT point_a <#> point_b AS taxicab, point_a <-> point_b AS euclidean, point_a <=> point_b AS chebyshev FROM x; taxicab | euclidean | chebyshev ---------+------------------+----------- 25 | 18.3575597506858 | 16 (1 ROW)
This shows pretty clearly what's what. taxicab is basically sum of distances across all axes, euclidean is distance in straight line, and chebyshev is largest distance from distances on all axes.
For me, euclidean looks the most interesting. So let's stick to it.
Now, let's imagine, we have a table with 1 million 3 dimensional points, where x, y, and z are all within -5000 .. 5000 range. Or better yet, let's not image, just make it:
$ CREATE TABLE test_cubes AS SELECT i AS id, cube(array[random() * 10000 - 5000, random() * 10000 - 5000, random() * 10000 - 5000]) FROM generate_series(1,1000000) i; SELECT 1000000 $ ALTER TABLE test_cubes ADD PRIMARY KEY (id); ALTER TABLE
some sample data:
$ SELECT * FROM test_cubes tablesample system ( 0.01 ) LIMIT 10; id | cube --------+---------------------------------------------------------- 542641 | (-4178.8482805714, 4531.55778348446, -4325.33879298717) 542642 | (-4550.62797293067, 3511.47029548883, -220.161587931216) 542643 | (-4966.07152279466, 1456.31569903344, -2061.16470042616) 542644 | (-1129.34995442629, -550.684169866145, 2578.38271558285) 542645 | (4474.94647931308, 2999.37592353672, 1136.93070597947) 542646 | (-1516.54057670385, 1515.51792863756, -4341.86951257288) 542647 | (795.945581048727, 2787.07716614008, 2160.27483809739) 542648 | (1535.92750895768, -2401.27934142947, -726.663023233414) 542649 | (763.151990249753, -995.386703871191, -2518.4234790504) 542650 | (2935.75372546911, 1408.53805933148, -1084.11618042737) (10 ROWS)
Finding rows that are close to some specific point (let's say – (0,0,0)) can be done by something like this:
SELECT * FROM ( SELECT id, cube, cube <-> cube(array[0.0,0.0,0.0]) AS distance FROM test_cubes tc ) AS x ORDER BY distance ASC LIMIT 10; id | cube | distance --------+-----------------------------------------------------------+------------------ 915947 | (15.2430403977633, -53.2670319080353, 48.1952121481299) | 73.4336805754876 280346 | (9.85732767730951, -80.3589634597301, -41.3774838671088) | 90.9220880118409 243882 | (8.24892893433571, 74.6928667649627, 67.0726876705885) | 100.726434492086 195082 | (-15.9435207024217, 69.5352349430323, -78.6104751750827) | 106.155318087336 239258 | (-61.3826420158148, -46.3476357981563, -98.9516917616129) | 125.329044468573 661586 | (-103.007820434868, 76.9877433776855, -1.76172703504562) | 128.611147974336 363224 | (27.2627454251051, -116.961468011141, 53.6130089312792) | 131.520329280688 336093 | (68.2466197758913, -35.0401923060417, -123.154250904918) | 145.094402730191 168317 | (120.665831491351, 83.5728365927935, -18.6012778431177) | 147.954957480518 11121 | (24.2509506642818, 133.48447624594, -62.4999403953552) | 149.373547042966 (10 ROWS)
(I know I don't need subselect, just wanted to show the distance too, not only order by it.
But it's relatively slow:
QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- LIMIT (cost=42443.64..42443.67 ROWS=10 width=36) (actual TIME=352.614..352.617 ROWS=10 loops=1) -> Sort (cost=42443.64..44943.64 ROWS=1000000 width=36) (actual TIME=352.612..352.614 ROWS=10 loops=1) Sort KEY: ((tc.cube <-> '(0, 0, 0)'::cube)) Sort Method: top-N heapsort Memory: 25kB -> Seq Scan ON test_cubes tc (cost=0.00..20834.00 ROWS=1000000 width=36) (actual TIME=0.052..195.534 ROWS=1000000 loops=1) Planning TIME: 0.403 ms Execution TIME: 352.636 ms (7 ROWS)
As you can see it's fetching all rows, calculates the distance, and sorts. All expensive.
But, since 9.1 we had the ability to use kNN searches, with proper indexing. And this new patch adds this capability to cubes. So how do I use it?
$ CREATE INDEX magic ON test_cubes USING gist (cube); CREATE INDEX
And now, the same select, will have “slightly" different explain:
$ EXPLAIN analyze SELECT * FROM ( SELECT id, cube, cube <-> cube(array[0.0,0.0,0.0]) AS distance FROM test_cubes tc ) AS x ORDER BY distance ASC LIMIT 10; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- LIMIT (cost=0.41..1.37 ROWS=10 width=36) (actual TIME=1.213..1.656 ROWS=10 loops=1) -> INDEX Scan USING magic ON test_cubes tc (cost=0.41..96296.41 ROWS=1000000 width=36) (actual TIME=1.210..1.649 ROWS=10 loops=1) ORDER BY: (cube <-> '(0, 0, 0)'::cube) Planning TIME: 0.497 ms Execution TIME: 1.808 ms (5 ROWS)
Of course the same index would be used for taxicab and chebyshev distances.
This is great, thanks guys.