UPDATE
Please read also about this change in Pg 9.1, and this change in Pg 9.2 posts, as they explain that since Pg 9.1 some of the limitations listed in this post are no longer there.
END OF UPDATE
Fight!
But more seriously – people tend to use various data types, and there have been some myths about them, so let's see how it really boils down.
First of all – All those data types are internally saved using the same C data structure – varlena.
Thanks to this we can be nearly sure that there are no performance differences. Are there no performance differences in reality? Let's test.
First, let's create simple table, and fill it with 500k rows. Then add index. and also – make the table with index from start, and then load data.
For generating test data, I wrote this simple script:
#!/usr/bin/perl -w use strict; print gen_random_word(100)."\n" for 1..500_000; exit; sub gen_random_word { my $len = shift; my @chars = ("a".."z", "A".."Z", "0".."9"); return join '', map { $chars[ rand @chars ] } 1 .. $len; }
And ran it like this:
$ perl ~/z.pl > words.100.txt $ cut -b 1-50 words.100.txt > words.50.txt $ cut -b 1-10 words.50.txt > words.10.txt $ ls -l *.txt -rw-r--r-- 1 depesz users 50500000 Mar 1 14:59 words.100.txt -rw-r--r-- 1 depesz users 5500000 Mar 1 14:59 words.10.txt -rw-r--r-- 1 depesz users 25500000 Mar 1 14:59 words.50.txt
As you can see we have 3 test data files – with “words" of 10, 50 and 100 characters. Each datafile contains 500k rows.
Now, let's test the data load using this script:
#!/bin/bash export TIME="%e" echo "Testing of: create table, load data and create index." FOR LEN IN 10 50 100 do echo " - testing word length $LEN" FOR DATATYPE IN "char($LEN)" "varchar($LEN)" "varchar" "text" do export DATATYPE FOR i IN 1 2 3 4 5 do dropdb test 2> /dev/NULL createdb test psql -d test -qAtX -c "create table test ( field $DATATYPE )" ( echo "\\copy test from /var/tmp/test/words.$LEN.txt" echo "CREATE INDEX q on test (field)" ) | TIME psql -qAtX -d test done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf " - %-12s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["DATATYPE"], S/C, MIN, MAX} ' done done echo "Testing of: create table with index anda then load data." FOR LEN IN 10 50 100 do echo " - testing word length $LEN" FOR DATATYPE IN "char($LEN)" "varchar($LEN)" "varchar" "text" do export DATATYPE FOR i IN 1 2 3 4 5 do dropdb test 2> /dev/NULL createdb test psql -d test -qAtX -c "create table test ( field $DATATYPE )" psql -d test -qAtX -c "create index q on test (field)" ( echo "\\copy test from /var/tmp/test/words.$LEN.txt" ) | TIME psql -qAtX -d test done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf " - %-12s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["DATATYPE"], S/C, MIN, MAX} ' done done
Basically, this script will test data loading 5 times for each datatype and each word length, using 2 methods:
- Method #1:
CREATE TABLE test ( FIELD $DATATYPE ); copy test (FIELD) FROM '/var/tmp/test/words.$LEN.txt'; CREATE INDEX q ON test (FIELD);
- Method #2:
CREATE TABLE test ( FIELD $DATATYPE ); CREATE INDEX q ON test (FIELD); copy test (FIELD) FROM '/var/tmp/test/words.$LEN.txt';
The script might look complicated, but it's not really. And its output is even simpler:
Testing OF: CREATE TABLE, LOAD DATA AND CREATE INDEX. - testing word LENGTH 10 - CHAR(10) : avg: 4.34s (MIN: 4.25, MAX: 4.41) - VARCHAR(10) : avg: 4.17s (MIN: 4.09, MAX: 4.24) - VARCHAR : avg: 4.23s (MIN: 4.20, MAX: 4.27) - text : avg: 4.21s (MIN: 4.19, MAX: 4.22) - testing word LENGTH 50 - CHAR(50) : avg: 5.67s (MIN: 5.54, MAX: 5.80) - VARCHAR(50) : avg: 5.40s (MIN: 5.26, MAX: 5.62) - VARCHAR : avg: 5.41s (MIN: 5.35, MAX: 5.58) - text : avg: 5.47s (MIN: 5.34, MAX: 5.65) - testing word LENGTH 100 - CHAR(100) : avg: 7.83s (MIN: 7.72, MAX: 7.99) - VARCHAR(100) : avg: 7.65s (MIN: 7.51, MAX: 7.81) - VARCHAR : avg: 7.64s (MIN: 7.48, MAX: 7.78) - text : avg: 7.62s (MIN: 7.47, MAX: 7.78) Testing OF: CREATE TABLE WITH INDEX anda THEN LOAD DATA. - testing word LENGTH 10 - CHAR(10) : avg: 5.03s (MIN: 4.95, MAX: 5.07) - VARCHAR(10) : avg: 4.90s (MIN: 4.80, MAX: 5.02) - VARCHAR : avg: 4.99s (MIN: 4.87, MAX: 5.09) - text : avg: 4.90s (MIN: 4.82, MAX: 5.02) - testing word LENGTH 50 - CHAR(50) : avg: 12.56s (MIN: 11.33, MAX: 14.04) - VARCHAR(50) : avg: 12.35s (MIN: 11.31, MAX: 14.07) - VARCHAR : avg: 11.96s (MIN: 10.93, MAX: 13.09) - text : avg: 11.45s (MIN: 10.30, MAX: 12.41) - testing word LENGTH 100 - CHAR(100) : avg: 57.01s (MIN: 54.35, MAX: 59.11) - VARCHAR(100) : avg: 55.12s (MIN: 53.91, MAX: 57.11) - VARCHAR : avg: 58.38s (MIN: 54.46, MAX: 62.88) - text : avg: 55.52s (MIN: 52.73, MAX: 58.43)
As you can see times are very similar – there are differences between data types, but there is no clear “winner", and I'm much more inclined to assume that these differences come from random fluctuations of load on my machine, than from anything else.
So, we know that storing the data takes the same time.
What about searching?
So, I recreated the big tables:
CREATE TABLE test_char ( FIELD CHAR(100) ); \copy test_char FROM /var/tmp/test/words.100.txt CREATE INDEX q_char ON test_char (FIELD); CREATE TABLE test_varchar ( FIELD VARCHAR(100) ); \copy test_varchar FROM /var/tmp/test/words.100.txt CREATE INDEX q_varchar ON test_varchar (FIELD); CREATE TABLE test_varcharu ( FIELD VARCHAR ); \copy test_varcharu FROM /var/tmp/test/words.100.txt CREATE INDEX q_varcharu ON test_varcharu (FIELD); CREATE TABLE test_text ( FIELD text ); \copy test_text FROM /var/tmp/test/words.100.txt CREATE INDEX q_text ON test_text (FIELD);
And wrote simple script to test searches:
#!/bin/bash export TIME="%e" FOR DATATYPE IN CHAR VARCHAR varcharu text do export DATATYPE FOR i IN 1 2 3 4 5 do FOR j IN $( seq 1 20 ) do echo "SELECT COUNT(*) FROM test_char where field = any('{"$( shuf words.100.txt | head -n 50 | xargs echo | tr ' ' , )"}');" done | TIME psql -qAtX -d test > /dev/NULL done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf "- %-12s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["DATATYPE"], S/C, MIN, MAX} ' done
In case you can't read it – for each data type, it runs 5 times test. Each test contains of 20 selects, each getting 50 rows from test table, using of course index scan.
Results:
- char : avg: 2.47s (min: 2.44, max: 2.49) - varchar : avg: 2.50s (min: 2.48, max: 2.51) - varcharu : avg: 2.48s (min: 2.47, max: 2.50) - text : avg: 2.49s (min: 2.44, max: 2.52)
This (I believe) proves that performance of all 4 datatypes is the same.
So, what other points there might be when considering which datatype to use?
Ease of use? In this area char(n) gets really low notes. Why? Can you spot problem?
# CREATE TABLE x (y CHAR(15)); CREATE TABLE # INSERT INTO x(y) VALUES ('depesz'); INSERT 0 1 # SELECT * FROM x WHERE y LIKE '%sz'; y --- (0 ROWS)
Why wasn't the record selected? Reason is simple: char(n) values are right padded with spaces.
This also means that this will work:
# SELECT * FROM x WHERE y = 'depesz '; y ----------------- depesz (1 ROW)
Given this – remember that char(n) will actually use more disk space for strings – if your strings are shorter than “n" – because it will right pad them to required length.
So, what about varchar, varchar(n) and text.
Varchar and text are the same. So we can treat them as the same, but to avoid confusion with varchar(n), and because text is simply shorter (in terms of characters in name) – I prefer text.
So, we're with 2 data types left: varchar(n) and text.
The obvious benefit of varchar(n) is that is has built-in limit of size.
Which is all cool, until you will have to change this limit. Which happens a lot.
Of course, one can say, you can:
ALTER TABLE xxx ALTER COLUMN TYPE VARCHAR(NEW LIMIT)
Well, first – let me say that I am discussing now only making the limit larger. If the new limit has to be smaller than previously (never seen of such case, but it's technically possible) – table has to be scanned to be sure that all values fit within new limit.
So, what happens with when you make the limit larger?
PostgreSQL has to rewrite the table. Which has 2 very important drawbacks:
- Requires exclusive lock on the table for the time of operation
- In case of non-trivial tables, will take considerable amount of time
This 2 points together make it (in my opinion) a no-go.
I'd rather use text without any limitation, than consider making database that will require (for example) shutdown of site, to change limit of email field from 75 to 100 characters (been there, done than, cursed a lot).
So, what is the best way to be able to limit field size – in a way that will not lock the table when increasing the limit? Or at least – will do it's job without table rewrite, as this takes too long time.
First idea: check. So, let's test it:
# CREATE TABLE x (y text); CREATE TABLE # INSERT INTO x SELECT 'depesz' FROM generate_series(1,10000000); INSERT 0 10000000 # SELECT pg_relation_size('x'), pg_total_relation_size('x'); pg_relation_size | pg_total_relation_size ------------------+------------------------ 362479616 | 362594304 (1 ROW)
OK, we have some data in it. Now, let's add check, so we will have something to alter 🙂
# \timing Timing IS ON. # ALTER TABLE x ADD CONSTRAINT y_length CHECK (LENGTH(y) <= 16 ); ALTER TABLE TIME: 4038.007 ms
Now, let's alter it. Unfortunately you can't change contraint – you have to drop it and create. OK, Let's play:
# BEGIN; BEGIN TIME: 0.100 ms *# ALTER TABLE x DROP CONSTRAINT y_length; ALTER TABLE TIME: 0.284 ms *# ALTER TABLE x ADD CONSTRAINT y_length CHECK (LENGTH(y) <= 32 ); ALTER TABLE TIME: 2333.937 ms *# commit; COMMIT TIME: 0.642 ms
OK, It took in totak about 2.5s. – for relatively small table.
What about rewrite and locks?
Well, just before I ran this transaction I ran this SQL:
# SELECT oid, relfilenode FROM pg_class WHERE relname = 'x'; oid | relfilenode -------+------------- 27238 | 27238 (1 ROW)
Afterwards, the numbers are the same, so table was not rewritten. but, while the transaction was going on, I checked its locks:
# SELECT * FROM pg_locks WHERE pid = 16874; locktype | DATABASE | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | GRANTED ---------------+----------+----------+--------+--------+------------+---------------+---------+--------+----------+--------------------+-------+---------------------+--------- relation | 27219 | 27238 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/29 | 16874 | AccessExclusiveLock | t virtualxid | [NULL] | [NULL] | [NULL] | [NULL] | 2/29 | [NULL] | [NULL] | [NULL] | [NULL] | 2/29 | 16874 | ExclusiveLock | t object | 27219 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2606 | 27244 | 0 | 2/29 | 16874 | AccessExclusiveLock | t transactionid | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2720 | [NULL] | [NULL] | [NULL] | 2/29 | 16874 | ExclusiveLock | t (4 ROWS)
Unfortunately, as you can see, this transaction obtained ‘AccessExclusiveLock' for table x. This means that for 2.5 seconds nobody can use it. It doesn't sound bad, does it? Unfortunately – it does. My table is relatively small. What's more – trying to get lock, automatically blocks all next transactions trying to reach the table.
So, is there any other option?
Well, we could use trigger – like this:
CREATE FUNCTION y_length_check() RETURNS TRIGGER AS $$ BEGIN IF LENGTH(NEW.y) > 16 THEN RAISE EXCEPTION 'Too long value in x.y'; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER y_length BEFORE UPDATE OR INSERT ON x FOR EACH ROW EXECUTE PROCEDURE y_length_check();
Which will of course work, but looks like overkill. The good thing about this approach is that limit change is instant – you just do “CREATE OR REPLACE FUNCTION", and you're done.
Any other way that would not require exclusive lock on a table?
Sure, let's try to use DOMAINs:
# CREATE DOMAIN depesz_text AS text; CREATE DOMAIN # ALTER DOMAIN depesz_text ADD CONSTRAINT max_length CHECK ( LENGTH(VALUE) <= 16 ); ALTER DOMAIN # CREATE TABLE x (y depesz_text); CREATE TABLE # INSERT INTO x SELECT 'depesz' FROM generate_series(1,10000000); INSERT 0 10000000
Of course trying to insert too long value will fail:
# INSERT INTO x (y) VALUES ('abcdefghhijlndlkagjhflsagfljsahflsahdflsadjhlsd'); ERROR: VALUE FOR DOMAIN depesz_text violates CHECK CONSTRAINT "max_length"
And how about changing limit?
# BEGIN; BEGIN TIME: 0.088 ms *# ALTER DOMAIN depesz_text DROP CONSTRAINT max_length; ALTER DOMAIN TIME: 0.140 ms *# ALTER DOMAIN depesz_text ADD CONSTRAINT max_length CHECK ( LENGTH(VALUE) <= 32 ); ALTER DOMAIN TIME: 2052.200 ms *# SELECT * FROM pg_locks WHERE pid = 16874; locktype | DATABASE | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | GRANTED ---------------+----------+----------+--------+--------+------------+---------------+---------+--------+----------+--------------------+-------+---------------------+--------- relation | 27219 | 10988 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | AccessShareLock | t virtualxid | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | ExclusiveLock | t relation | 27219 | 1247 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | RowExclusiveLock | t object | 27219 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2606 | 27249 | 0 | 2/45 | 16874 | AccessExclusiveLock | t relation | 27219 | 27250 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | ShareLock | t transactionid | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2732 | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | ExclusiveLock | t (6 ROWS) TIME: 0.874 ms *# SELECT 27250::regclass, 10988::regclass; regclass | regclass ----------+---------- x | pg_locks (1 ROW) TIME: 0.191 ms *# commit; COMMIT TIME: 0.966 ms
As you can see – there is only ShareLock on the table while changing domain – this means that writes will not work, but selects will. Which is huge gain in comparison with “ALTER TABLE" and its AccessExclusiveLock on table – which blocked everything.
Is that all that we can do? Basically – yes. We could theoretically make check that gets limit from custom GUC, but it looks slow, and is definitely error prone – as the values can be modified per-session.
So, while there is no clear winner, I believe that the TEXT+DOMAIN is really good enough for most of the cases, and if you want really transparent limit changes – it looks like trigger is the only choice.
To sum it all up:
- char(n) – takes too much space when dealing with values shorter than n, and can lead to subtle errors because of adding trailing spaces, plus it is problematic to change the limit
- varchar(n) – it's problematic to change the limit in live environment
- varchar – just like text
- text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
UPDATE
Based on Caleb comment, I did test of speed of data load for various ways of getting text datatype with limited length.
I used this script to create test tables:
CREATE TABLE t_varchar ( FIELD VARCHAR(100) ); CREATE TABLE t_trigger_text (FIELD TEXT ); CREATE FUNCTION ttt_field_length_trigger() RETURNS TRIGGER AS $$ BEGIN IF LENGTH(NEW.field) > 100 THEN RAISE EXCEPTION 'Too long value in t_trigger_text.field'; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER ttt_field_length_trigger BEFORE UPDATE OR INSERT ON t_trigger_text FOR EACH ROW EXECUTE PROCEDURE ttt_field_length_trigger(); CREATE TABLE t_constraint_text (FIELD TEXT CHECK ( LENGTH(FIELD) <= 100 ) ); CREATE OR REPLACE FUNCTION check_length (TEXT) RETURNS bool AS $$ SELECT LENGTH($1) <= 100; $$ LANGUAGE SQL; CREATE TABLE t_function_constraint_text (FIELD TEXT CHECK ( check_length(FIELD) ) ); CREATE DOMAIN depesz_text AS TEXT; ALTER DOMAIN depesz_text ADD CONSTRAINT max_length CHECK ( LENGTH(VALUE) <= 100 ); CREATE TABLE t_domain ( FIELD depesz_text ); CREATE DOMAIN depesz_text2 AS TEXT; ALTER DOMAIN depesz_text2 ADD CONSTRAINT max_length CHECK ( check_length(VALUE) ); CREATE TABLE t_function_domain ( FIELD depesz_text2 );
and this test script:
#!/bin/bash export TIME="%e" FOR TABLE IN "t_varchar" "t_trigger_text" "t_constraint_text" "t_function_constraint_text" "t_domain" "t_function_domain" do export TABLE FOR i IN 1 2 3 4 5 do dropdb test 2> /dev/NULL createdb test psql -qAtX -d test -f CREATE.sql echo "\\copy $table from /var/tmp/test/words.100.txt" | TIME psql -qAtX -d test # done done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf " - %-26s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["table"], S/C, MIN, MAX} ' done
As you can see we have 6 different tables:
- t_varchar : basic table with varchar(100) field
- t_trigger_text : table with text field, and trigger function which checks length
- t_constraint_text : table with text field, and simple check for length
- t_function_constraint_text : table with text field, and check, but check is done using function (thanks go to Marko Tiikkaja for the idea)
- t_domain : table with domain field, and domain has simple check
- t_function_domain : table with domain field, and domain has function based check.
Both function-based contraints (t_function_constraint_text and t_function_domain) have the same great ability as t_trigger_text – changing limit is instantenous.
Results of the test:
- t_varchar : avg: 2.21s (min: 2.06, max: 2.41) - t_trigger_text : avg: 4.37s (min: 4.29, max: 4.46) - t_constraint_text : avg: 2.33s (min: 2.28, max: 2.42) - t_function_constraint_text : avg: 2.52s (min: 2.35, max: 2.91) - t_domain : avg: 2.62s (min: 2.33, max: 2.83) - t_function_domain : avg: 2.52s (min: 2.29, max: 2.76)
After 2 years of using Postgresql in our project. I change from Varchar(n) to Text completely.
My experience is Varchar not only give a bitter change length but also not helpful. User never please at document title that limit 50 characters! What’s drawback if they want to the title up to 80 chars! I don’t see value to limit such need. One may argue how can we show in report if it’s over limit 50. No problem I design report to wrap such text and it does matter.
I am very impressive PostgreSQL.It’s very powerful RDBMS and help developing my project, GWT project.
Our demo project,iLegal, UAT :- url=http://ilegal-uat.cloudora.net , user = user12_48, password=p@ssword
Hi Depesz,
after reading your article I’ve done several tests on a real-world application which I’m working on from several years. It’s a sort of datawarehouse that works with about 1.5 bilion rows.
I have two systems with different hardware and OSs. I’ve done several tests varying only the tables definition. The original tables had character(n) and character varying(n) columns; in the new tables the same columns were changed to text. So basically on each system I had two schemas that differed only about text/non-text columns. Over those tables I created partitions, indexes, clustered and vacuumed them. I didn’t use triggers or domains, so my scenario is simpler than yours and focuses only on pure text vs non-text string definition.
PostgreSQL 9.0.13 (OS X Server)
——————————————-
I saw that the loading process of data (COPY, INDEX, CLUSTER and VACUUM) is ~4% faster using text, but my transactions (which involves partitions and many indexes over string columns) were ~12% slower compared to the non-text tables.
PostgreSQL 9.4.1 (Ubuntu)
——————————————-
Results are promising and aligned with your findings. Loading process time differences is the same as PostgreSLQ 9.0 but transactions now are ~1% faster than non-text tables.
What about size? Indexes are smaller for both systems, but overall size gain is trascurable (few MB against 15GB of tables). This may only increase of a little percentage the probability of fitting indexes inside RAM.
I hope this findings may help others.
Pietro
Now we have description of this question in PostgreSQL manual: http://www.postgresql.org/docs/9.1/static/datatype-character.html