Picking random element, with weights – select * from depesz;

Whenever I'm doing some testing I need sample data. Easiest way to do it is to generate data using some random/generate_series queries.

But what if I need specific frequencies?

For example, I need to generate 10,000,000 rows, where there will be 10% of ‘a', 20% of ‘b', and the rest will be split equally between ‘c' and ‘d'?

“online user" on Discord had similar problem, so figured I'll write a function that does that.

Since we need to pass, to the function, any number of args, and weigths for them, the simplest thing I can pass it as are 2 arrays:

elements to randomly pick from
their weights

Ready function is:

CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] ) RETURNS ANYELEMENT language plpgsql as $$
DECLARE
    v_no_choices INT4 := array_length(p_choices, 1);
    v_no_weights INT4 := array_length(p_weights, 1);
    v_weight_sum FLOAT8;
    v_random     FLOAT8;
    v_i          INT4;
BEGIN
    IF v_no_choices IS NULL OR v_no_weights IS NULL OR v_no_choices <> v_no_weights THEN
        RETURN NULL;
    END IF;
 
    SELECT sum(x) INTO v_weight_sum FROM unnest(p_weights) as q(x);
 
    v_random := random() * v_weight_sum;
 
    for v_i in array_lower(p_weights, 1) .. array_upper(p_weights, 1) loop
        IF v_random < p_weights[v_i] THEN
            RETURN p_choices[v_i];
        END IF;
        v_random := v_random - p_weights[v_i];
    END loop;
    RETURN NULL;
END;
$$;

It needs two arrays, they shouldn't be empty/null, and their number of elements have to be the same.

Example use:

=$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{.5,.5,.1,.9}' );
 weighted_random
─────────────────
 d
(1 row)

In this example, it picks from four values:

a with weight 0.5
b with weight 0.5
c with weight 0.1
d with weight 0.9

Since the weights are any float8, we can pick virtually any number to represent weight. For example we could:

=$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' );

to generate one of the values with probabilities from the head of this post 🙂

We can easily check if the percentages are what we wanted by doing:

=$ WITH x AS (
    SELECT
        weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' ) as v
    FROM
        generate_series(1,10000) i
)
SELECT
    v,
    count(*),
    count(*)::float8 * 100 / 10000 as count_pct
FROM
    x
GROUP BY
    v
ORDER BY
    v;
 v │ count │ count_pct
───┼───────┼───────────
 a │   995 │      9.95
 b │  2014 │     20.14
 c │  3471 │     34.71
 d │  3520 │      35.2
(4 rows)

Of course, since these are randomly picked, the percentages fluctuate a bit, but should be close enough 🙂

Hope it could be useful for someone 🙂