Whenever I'm doing some testing I need sample data. Easiest way to do it is to generate data using some random/generate_series queries.
But what if I need specific frequencies?
For example, I need to generate 10,000,000 rows, where there will be 10% of ‘a', 20% of ‘b', and the rest will be split equally between ‘c' and ‘d'?
“online user" on Discord had similar problem, so figured I'll write a function that does that.
Since we need to pass, to the function, any number of args, and weigths for them, the simplest thing I can pass it as are 2 arrays:
- elements to randomly pick from
- their weights
Ready function is:
CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] ) RETURNS ANYELEMENT LANGUAGE plpgsql AS $$ DECLARE v_no_choices INT4 := array_length(p_choices, 1); v_no_weights INT4 := array_length(p_weights, 1); v_weight_sum FLOAT8; v_random FLOAT8; v_i INT4; BEGIN IF v_no_choices IS NULL OR v_no_weights IS NULL OR v_no_choices <> v_no_weights THEN RETURN NULL; END IF; SELECT SUM(x) INTO v_weight_sum FROM unnest(p_weights) AS q(x); v_random := random() * v_weight_sum; FOR v_i IN array_lower(p_weights, 1) .. array_upper(p_weights, 1) loop IF v_random < p_weights[v_i] THEN RETURN p_choices[v_i]; END IF; v_random := v_random - p_weights[v_i]; END loop; RETURN NULL; END; $$;
It needs two arrays, they shouldn't be empty/null, and their number of elements have to be the same.
Example use:
=$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{.5,.5,.1,.9}' ); weighted_random ───────────────── d (1 ROW)
In this example, it picks from four values:
- a with weight 0.5
- b with weight 0.5
- c with weight 0.1
- d with weight 0.9
Since the weights are any float8, we can pick virtually any number to represent weight. For example we could:
=$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' );
to generate one of the values with probabilities from the head of this post 🙂
We can easily check if the percentages are what we wanted by doing:
=$ WITH x AS ( SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' ) AS v FROM generate_series(1,10000) i ) SELECT v, COUNT(*), COUNT(*)::float8 * 100 / 10000 AS count_pct FROM x GROUP BY v ORDER BY v; v │ COUNT │ count_pct ───┼───────┼─────────── a │ 995 │ 9.95 b │ 2014 │ 20.14 c │ 3471 │ 34.71 d │ 3520 │ 35.2 (4 ROWS)
Of course, since these are randomly picked, the percentages fluctuate a bit, but should be close enough 🙂
Hope it could be useful for someone 🙂