UPDATE
Please read also about this change in Pg 9.1, and this change in Pg 9.2 posts, as they explain that since Pg 9.1 some of the limitations listed in this post are no longer there.
END OF UPDATE
Fight!
But more seriously – people tend to use various data types, and there have been some myths about them, so let's see how it really boils down.
First of all – All those data types are internally saved using the same C data structure – varlena.
Thanks to this we can be nearly sure that there are no performance differences. Are there no performance differences in reality? Let's test.
First, let's create simple table, and fill it with 500k rows. Then add index. and also – make the table with index from start, and then load data.
For generating test data, I wrote this simple script:
#!/usr/bin/perl -w use strict; print gen_random_word(100)."\n" for 1..500_000; exit; sub gen_random_word { my $len = shift; my @chars = ("a".."z", "A".."Z", "0".."9"); return join '', map { $chars[ rand @chars ] } 1 .. $len; }
And ran it like this:
$ perl ~/z.pl > words.100.txt $ cut -b 1-50 words.100.txt > words.50.txt $ cut -b 1-10 words.50.txt > words.10.txt $ ls -l *.txt -rw-r--r-- 1 depesz users 50500000 Mar 1 14:59 words.100.txt -rw-r--r-- 1 depesz users 5500000 Mar 1 14:59 words.10.txt -rw-r--r-- 1 depesz users 25500000 Mar 1 14:59 words.50.txt
As you can see we have 3 test data files – with “words" of 10, 50 and 100 characters. Each datafile contains 500k rows.
Now, let's test the data load using this script:
#!/bin/bash export TIME="%e" echo "Testing of: create table, load data and create index." FOR LEN IN 10 50 100 do echo " - testing word length $LEN" FOR DATATYPE IN "char($LEN)" "varchar($LEN)" "varchar" "text" do export DATATYPE FOR i IN 1 2 3 4 5 do dropdb test 2> /dev/NULL createdb test psql -d test -qAtX -c "create table test ( field $DATATYPE )" ( echo "\\copy test from /var/tmp/test/words.$LEN.txt" echo "CREATE INDEX q on test (field)" ) | TIME psql -qAtX -d test done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf " - %-12s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["DATATYPE"], S/C, MIN, MAX} ' done done echo "Testing of: create table with index anda then load data." FOR LEN IN 10 50 100 do echo " - testing word length $LEN" FOR DATATYPE IN "char($LEN)" "varchar($LEN)" "varchar" "text" do export DATATYPE FOR i IN 1 2 3 4 5 do dropdb test 2> /dev/NULL createdb test psql -d test -qAtX -c "create table test ( field $DATATYPE )" psql -d test -qAtX -c "create index q on test (field)" ( echo "\\copy test from /var/tmp/test/words.$LEN.txt" ) | TIME psql -qAtX -d test done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf " - %-12s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["DATATYPE"], S/C, MIN, MAX} ' done done
Basically, this script will test data loading 5 times for each datatype and each word length, using 2 methods:
- Method #1:
CREATE TABLE test ( FIELD $DATATYPE ); copy test (FIELD) FROM '/var/tmp/test/words.$LEN.txt'; CREATE INDEX q ON test (FIELD);
- Method #2:
CREATE TABLE test ( FIELD $DATATYPE ); CREATE INDEX q ON test (FIELD); copy test (FIELD) FROM '/var/tmp/test/words.$LEN.txt';
The script might look complicated, but it's not really. And its output is even simpler:
Testing OF: CREATE TABLE, LOAD DATA AND CREATE INDEX. - testing word LENGTH 10 - CHAR(10) : avg: 4.34s (MIN: 4.25, MAX: 4.41) - VARCHAR(10) : avg: 4.17s (MIN: 4.09, MAX: 4.24) - VARCHAR : avg: 4.23s (MIN: 4.20, MAX: 4.27) - text : avg: 4.21s (MIN: 4.19, MAX: 4.22) - testing word LENGTH 50 - CHAR(50) : avg: 5.67s (MIN: 5.54, MAX: 5.80) - VARCHAR(50) : avg: 5.40s (MIN: 5.26, MAX: 5.62) - VARCHAR : avg: 5.41s (MIN: 5.35, MAX: 5.58) - text : avg: 5.47s (MIN: 5.34, MAX: 5.65) - testing word LENGTH 100 - CHAR(100) : avg: 7.83s (MIN: 7.72, MAX: 7.99) - VARCHAR(100) : avg: 7.65s (MIN: 7.51, MAX: 7.81) - VARCHAR : avg: 7.64s (MIN: 7.48, MAX: 7.78) - text : avg: 7.62s (MIN: 7.47, MAX: 7.78) Testing OF: CREATE TABLE WITH INDEX anda THEN LOAD DATA. - testing word LENGTH 10 - CHAR(10) : avg: 5.03s (MIN: 4.95, MAX: 5.07) - VARCHAR(10) : avg: 4.90s (MIN: 4.80, MAX: 5.02) - VARCHAR : avg: 4.99s (MIN: 4.87, MAX: 5.09) - text : avg: 4.90s (MIN: 4.82, MAX: 5.02) - testing word LENGTH 50 - CHAR(50) : avg: 12.56s (MIN: 11.33, MAX: 14.04) - VARCHAR(50) : avg: 12.35s (MIN: 11.31, MAX: 14.07) - VARCHAR : avg: 11.96s (MIN: 10.93, MAX: 13.09) - text : avg: 11.45s (MIN: 10.30, MAX: 12.41) - testing word LENGTH 100 - CHAR(100) : avg: 57.01s (MIN: 54.35, MAX: 59.11) - VARCHAR(100) : avg: 55.12s (MIN: 53.91, MAX: 57.11) - VARCHAR : avg: 58.38s (MIN: 54.46, MAX: 62.88) - text : avg: 55.52s (MIN: 52.73, MAX: 58.43)
As you can see times are very similar – there are differences between data types, but there is no clear “winner", and I'm much more inclined to assume that these differences come from random fluctuations of load on my machine, than from anything else.
So, we know that storing the data takes the same time.
What about searching?
So, I recreated the big tables:
CREATE TABLE test_char ( FIELD CHAR(100) ); \copy test_char FROM /var/tmp/test/words.100.txt CREATE INDEX q_char ON test_char (FIELD); CREATE TABLE test_varchar ( FIELD VARCHAR(100) ); \copy test_varchar FROM /var/tmp/test/words.100.txt CREATE INDEX q_varchar ON test_varchar (FIELD); CREATE TABLE test_varcharu ( FIELD VARCHAR ); \copy test_varcharu FROM /var/tmp/test/words.100.txt CREATE INDEX q_varcharu ON test_varcharu (FIELD); CREATE TABLE test_text ( FIELD text ); \copy test_text FROM /var/tmp/test/words.100.txt CREATE INDEX q_text ON test_text (FIELD);
And wrote simple script to test searches:
#!/bin/bash export TIME="%e" FOR DATATYPE IN CHAR VARCHAR varcharu text do export DATATYPE FOR i IN 1 2 3 4 5 do FOR j IN $( seq 1 20 ) do echo "SELECT COUNT(*) FROM test_char where field = any('{"$( shuf words.100.txt | head -n 50 | xargs echo | tr ' ' , )"}');" done | TIME psql -qAtX -d test > /dev/NULL done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf "- %-12s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["DATATYPE"], S/C, MIN, MAX} ' done
In case you can't read it – for each data type, it runs 5 times test. Each test contains of 20 selects, each getting 50 rows from test table, using of course index scan.
Results:
- char : avg: 2.47s (min: 2.44, max: 2.49) - varchar : avg: 2.50s (min: 2.48, max: 2.51) - varcharu : avg: 2.48s (min: 2.47, max: 2.50) - text : avg: 2.49s (min: 2.44, max: 2.52)
This (I believe) proves that performance of all 4 datatypes is the same.
So, what other points there might be when considering which datatype to use?
Ease of use? In this area char(n) gets really low notes. Why? Can you spot problem?
# CREATE TABLE x (y CHAR(15)); CREATE TABLE # INSERT INTO x(y) VALUES ('depesz'); INSERT 0 1 # SELECT * FROM x WHERE y LIKE '%sz'; y --- (0 ROWS)
Why wasn't the record selected? Reason is simple: char(n) values are right padded with spaces.
This also means that this will work:
# SELECT * FROM x WHERE y = 'depesz '; y ----------------- depesz (1 ROW)
Given this – remember that char(n) will actually use more disk space for strings – if your strings are shorter than “n" – because it will right pad them to required length.
So, what about varchar, varchar(n) and text.
Varchar and text are the same. So we can treat them as the same, but to avoid confusion with varchar(n), and because text is simply shorter (in terms of characters in name) – I prefer text.
So, we're with 2 data types left: varchar(n) and text.
The obvious benefit of varchar(n) is that is has built-in limit of size.
Which is all cool, until you will have to change this limit. Which happens a lot.
Of course, one can say, you can:
ALTER TABLE xxx ALTER COLUMN TYPE VARCHAR(NEW LIMIT)
Well, first – let me say that I am discussing now only making the limit larger. If the new limit has to be smaller than previously (never seen of such case, but it's technically possible) – table has to be scanned to be sure that all values fit within new limit.
So, what happens with when you make the limit larger?
PostgreSQL has to rewrite the table. Which has 2 very important drawbacks:
- Requires exclusive lock on the table for the time of operation
- In case of non-trivial tables, will take considerable amount of time
This 2 points together make it (in my opinion) a no-go.
I'd rather use text without any limitation, than consider making database that will require (for example) shutdown of site, to change limit of email field from 75 to 100 characters (been there, done than, cursed a lot).
So, what is the best way to be able to limit field size – in a way that will not lock the table when increasing the limit? Or at least – will do it's job without table rewrite, as this takes too long time.
First idea: check. So, let's test it:
# CREATE TABLE x (y text); CREATE TABLE # INSERT INTO x SELECT 'depesz' FROM generate_series(1,10000000); INSERT 0 10000000 # SELECT pg_relation_size('x'), pg_total_relation_size('x'); pg_relation_size | pg_total_relation_size ------------------+------------------------ 362479616 | 362594304 (1 ROW)
OK, we have some data in it. Now, let's add check, so we will have something to alter 🙂
# \timing Timing IS ON. # ALTER TABLE x ADD CONSTRAINT y_length CHECK (LENGTH(y) <= 16 ); ALTER TABLE TIME: 4038.007 ms
Now, let's alter it. Unfortunately you can't change contraint – you have to drop it and create. OK, Let's play:
# BEGIN; BEGIN TIME: 0.100 ms *# ALTER TABLE x DROP CONSTRAINT y_length; ALTER TABLE TIME: 0.284 ms *# ALTER TABLE x ADD CONSTRAINT y_length CHECK (LENGTH(y) <= 32 ); ALTER TABLE TIME: 2333.937 ms *# commit; COMMIT TIME: 0.642 ms
OK, It took in totak about 2.5s. – for relatively small table.
What about rewrite and locks?
Well, just before I ran this transaction I ran this SQL:
# SELECT oid, relfilenode FROM pg_class WHERE relname = 'x'; oid | relfilenode -------+------------- 27238 | 27238 (1 ROW)
Afterwards, the numbers are the same, so table was not rewritten. but, while the transaction was going on, I checked its locks:
# SELECT * FROM pg_locks WHERE pid = 16874; locktype | DATABASE | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | GRANTED ---------------+----------+----------+--------+--------+------------+---------------+---------+--------+----------+--------------------+-------+---------------------+--------- relation | 27219 | 27238 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/29 | 16874 | AccessExclusiveLock | t virtualxid | [NULL] | [NULL] | [NULL] | [NULL] | 2/29 | [NULL] | [NULL] | [NULL] | [NULL] | 2/29 | 16874 | ExclusiveLock | t object | 27219 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2606 | 27244 | 0 | 2/29 | 16874 | AccessExclusiveLock | t transactionid | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2720 | [NULL] | [NULL] | [NULL] | 2/29 | 16874 | ExclusiveLock | t (4 ROWS)
Unfortunately, as you can see, this transaction obtained ‘AccessExclusiveLock' for table x. This means that for 2.5 seconds nobody can use it. It doesn't sound bad, does it? Unfortunately – it does. My table is relatively small. What's more – trying to get lock, automatically blocks all next transactions trying to reach the table.
So, is there any other option?
Well, we could use trigger – like this:
CREATE FUNCTION y_length_check() RETURNS TRIGGER AS $$ BEGIN IF LENGTH(NEW.y) > 16 THEN RAISE EXCEPTION 'Too long value in x.y'; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER y_length BEFORE UPDATE OR INSERT ON x FOR EACH ROW EXECUTE PROCEDURE y_length_check();
Which will of course work, but looks like overkill. The good thing about this approach is that limit change is instant – you just do “CREATE OR REPLACE FUNCTION", and you're done.
Any other way that would not require exclusive lock on a table?
Sure, let's try to use DOMAINs:
# CREATE DOMAIN depesz_text AS text; CREATE DOMAIN # ALTER DOMAIN depesz_text ADD CONSTRAINT max_length CHECK ( LENGTH(VALUE) <= 16 ); ALTER DOMAIN # CREATE TABLE x (y depesz_text); CREATE TABLE # INSERT INTO x SELECT 'depesz' FROM generate_series(1,10000000); INSERT 0 10000000
Of course trying to insert too long value will fail:
# INSERT INTO x (y) VALUES ('abcdefghhijlndlkagjhflsagfljsahflsahdflsadjhlsd'); ERROR: VALUE FOR DOMAIN depesz_text violates CHECK CONSTRAINT "max_length"
And how about changing limit?
# BEGIN; BEGIN TIME: 0.088 ms *# ALTER DOMAIN depesz_text DROP CONSTRAINT max_length; ALTER DOMAIN TIME: 0.140 ms *# ALTER DOMAIN depesz_text ADD CONSTRAINT max_length CHECK ( LENGTH(VALUE) <= 32 ); ALTER DOMAIN TIME: 2052.200 ms *# SELECT * FROM pg_locks WHERE pid = 16874; locktype | DATABASE | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | GRANTED ---------------+----------+----------+--------+--------+------------+---------------+---------+--------+----------+--------------------+-------+---------------------+--------- relation | 27219 | 10988 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | AccessShareLock | t virtualxid | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | ExclusiveLock | t relation | 27219 | 1247 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | RowExclusiveLock | t object | 27219 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2606 | 27249 | 0 | 2/45 | 16874 | AccessExclusiveLock | t relation | 27219 | 27250 | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | ShareLock | t transactionid | [NULL] | [NULL] | [NULL] | [NULL] | [NULL] | 2732 | [NULL] | [NULL] | [NULL] | 2/45 | 16874 | ExclusiveLock | t (6 ROWS) TIME: 0.874 ms *# SELECT 27250::regclass, 10988::regclass; regclass | regclass ----------+---------- x | pg_locks (1 ROW) TIME: 0.191 ms *# commit; COMMIT TIME: 0.966 ms
As you can see – there is only ShareLock on the table while changing domain – this means that writes will not work, but selects will. Which is huge gain in comparison with “ALTER TABLE" and its AccessExclusiveLock on table – which blocked everything.
Is that all that we can do? Basically – yes. We could theoretically make check that gets limit from custom GUC, but it looks slow, and is definitely error prone – as the values can be modified per-session.
So, while there is no clear winner, I believe that the TEXT+DOMAIN is really good enough for most of the cases, and if you want really transparent limit changes – it looks like trigger is the only choice.
To sum it all up:
- char(n) – takes too much space when dealing with values shorter than n, and can lead to subtle errors because of adding trailing spaces, plus it is problematic to change the limit
- varchar(n) – it's problematic to change the limit in live environment
- varchar – just like text
- text – for me a winner – over (n) data types because it lacks their problems, and over varchar – because it has distinct name
UPDATE
Based on Caleb comment, I did test of speed of data load for various ways of getting text datatype with limited length.
I used this script to create test tables:
CREATE TABLE t_varchar ( FIELD VARCHAR(100) ); CREATE TABLE t_trigger_text (FIELD TEXT ); CREATE FUNCTION ttt_field_length_trigger() RETURNS TRIGGER AS $$ BEGIN IF LENGTH(NEW.field) > 100 THEN RAISE EXCEPTION 'Too long value in t_trigger_text.field'; END IF; RETURN NEW; END; $$ LANGUAGE plpgsql; CREATE TRIGGER ttt_field_length_trigger BEFORE UPDATE OR INSERT ON t_trigger_text FOR EACH ROW EXECUTE PROCEDURE ttt_field_length_trigger(); CREATE TABLE t_constraint_text (FIELD TEXT CHECK ( LENGTH(FIELD) <= 100 ) ); CREATE OR REPLACE FUNCTION check_length (TEXT) RETURNS bool AS $$ SELECT LENGTH($1) <= 100; $$ LANGUAGE SQL; CREATE TABLE t_function_constraint_text (FIELD TEXT CHECK ( check_length(FIELD) ) ); CREATE DOMAIN depesz_text AS TEXT; ALTER DOMAIN depesz_text ADD CONSTRAINT max_length CHECK ( LENGTH(VALUE) <= 100 ); CREATE TABLE t_domain ( FIELD depesz_text ); CREATE DOMAIN depesz_text2 AS TEXT; ALTER DOMAIN depesz_text2 ADD CONSTRAINT max_length CHECK ( check_length(VALUE) ); CREATE TABLE t_function_domain ( FIELD depesz_text2 );
and this test script:
#!/bin/bash export TIME="%e" FOR TABLE IN "t_varchar" "t_trigger_text" "t_constraint_text" "t_function_constraint_text" "t_domain" "t_function_domain" do export TABLE FOR i IN 1 2 3 4 5 do dropdb test 2> /dev/NULL createdb test psql -qAtX -d test -f CREATE.sql echo "\\copy $table from /var/tmp/test/words.100.txt" | TIME psql -qAtX -d test # done done 2>&1 | awk ' BEGIN{MAX=0; MIN=1000} ($1 > MAX) {MAX=$1} ($1 < MIN) {MIN=$1} {C++;S+=$1} END{printf " - %-26s : avg: %7.2fs (min: %.2f, max: %.2f)\n", ENVIRON["table"], S/C, MIN, MAX} ' done
As you can see we have 6 different tables:
- t_varchar : basic table with varchar(100) field
- t_trigger_text : table with text field, and trigger function which checks length
- t_constraint_text : table with text field, and simple check for length
- t_function_constraint_text : table with text field, and check, but check is done using function (thanks go to Marko Tiikkaja for the idea)
- t_domain : table with domain field, and domain has simple check
- t_function_domain : table with domain field, and domain has function based check.
Both function-based contraints (t_function_constraint_text and t_function_domain) have the same great ability as t_trigger_text – changing limit is instantenous.
Results of the test:
- t_varchar : avg: 2.21s (min: 2.06, max: 2.41) - t_trigger_text : avg: 4.37s (min: 4.29, max: 4.46) - t_constraint_text : avg: 2.33s (min: 2.28, max: 2.42) - t_function_constraint_text : avg: 2.52s (min: 2.35, max: 2.91) - t_domain : avg: 2.62s (min: 2.33, max: 2.83) - t_function_domain : avg: 2.52s (min: 2.29, max: 2.76)
Nice. I was always a fan of DOMAINs, now I have one more reason to use them. This decoupling of constraints from data is a great feature. And it’s standards-compliant, so should be portable to other databases.
You could also use a function in the CHECK constraint, e.g. CHECK(foo_bar_check(bar)). It’s not as expensive as a trigger and you can easily replace the function when you want to change the limit.
Nice to see that. It’s a good comparison and have a good point of view. I most prefer also the text data type.
In sql server they have deprecated the “text” datatype and advises use of “varchar(max)” instead.
http://msdn.microsoft.com/en-us/library/ms187993.aspx
I just wish Oracle would do the same. The lob datatypes have way to many restrictions compared with a varchar2.
I think types like “char” and “text” exposes implementation details.
I’ve been bit too many times by needing to change the size of a varchar(n) as well. Been going with text for awhile now based on mailing list info and some more basic measurements. Glad to see it looks like the right call.
I guess I’m the only one that disapproves of the over use of text.
First of all the fact that your table becomes unusable by changing the length I consider a bit of problem and one of my pet peeves with PostgreSQL. I don’t have this much of an issue with other databases so I consider it something that must be improved on with PostgreSQL. Cause I get asked that question all the time — why is it so damn slow to increase a field in PostgreSQL?
Secondly hmm I use tools to view data that really consider text and varchar(x) as very different animals. They rely on that to determine width to show a field and also (gasp!),whether a field is searchable or not. I know its sad but we must live in a world that has to support other databases in addition to PostgreSQL. In most other databases – text storage is very different from varchar(n) and so generic tools such as ODBC drivers and so forth treat them differently. Not to mention I just like the fact that I can generate a create table statement and have the meta data quickly there to know what the maximum size is I can expect in a table. I like the genericness of that approach to seeing max length.
@Regina:
You’re most likely not the only one that doesn’t like “text”.
Compatibility (whether it will work with other databases) was one of the angles that I wanted to write about, but then I realized, that in the most important (in my, not so enlightened, opinion) database – oracle – it’s also not varchar.
As for using tools that have problems with text, and don’t have it with varchar – well, whatever I write expresses my opinions, and, thanks $DEITY, I don’t have to deal with such tools/interfaces/whatever.
As with every “rule” in every field – there are exceptions. What I write is just general guideline (according to me) that can be ignored, bent, or ridiculed 🙂
Nice to know about the ‘domain’ trick, thanks.
But I’m surprised postgres isn’t smart enough to take a SharedLock instead of an ExclusiveLock when altering the table in this manner. Is there a good reason for that, or could postgres use a SharedLock in the future ? Failing that, it would be good to update the “character types” documentation. I guess 99% of people use varchar(n) instead of domain/trigger, because they dont know any better.
@Hubert,
Yap I know. What would be really cool though, is to have my cake and eat it too. You can have your text with your domains, but I’m lazy — I just want to limit the amount of text people put in my fields (very easily) and have my meta data view show it.
If the performance is exactly the same, then why can’t PostgreSQL internally
do the domain thing but publish the meta data I wanna see and other tools rely on. Then it wouldn’t need to rewrite tables right?
Really I don’t care about implementation details — I just like my meta data the way I like my meta data and don’t want to go sniffing in domain logic to find it. My ER diagrammer lists the sizes the way I like it. Its so important for ETL. Okay now you can ridicule me too :).
@Regina – well, I think that Pg should even simpler way to update max length – after all – since binary format of ‘xx’::varchar(10) and ‘xx’::varchar(1000) is (afaik) the same, it should be (theoretically) enough to update catalog and set new max length (of course only in case when new length it greater than previous). but I’m not sure if it will ever happen.
Can actually ALTER TABLE for the constraint check use a temporary index on length(value)? Never tried it actually…
And what about changing a varchar(n) limit in the system catalog directly for the needed table column? Increasing the limit is always safe actually. Do you see any problems with that, other then it is not good to play with the catalog? What ALTER TABLE actually does, is checking that a new limit, that it is going to write into the catalog, is not violated by the current data. And probably is dropping current execution plans.
I think varchar(<=255) should take a bit less space than text (text uses 4 bytes for length, varchar(<=255) should use 1 byte).
@Tomas: afaik, it’s not true in Pg.
@depesz,
The real issue around constraints which you kinda touched on is that “must be length n or shorter” isn’t a real constraint, or at least hasn’t been one any time I’ve seen so far. Real constraints are more complicated. Perhaps pointing this out in your DOMAIN example would help.
Maybe I missed it… or maybe like the difference between the data types.. it’s negligible. but what, if any, difference is there for inserts and updates on the different kind of constraints. I’m kinda a fan of thinking that varchar(x) will be the fastest, and that inserts into my table are probably being done more often than I’m increasing that limit. if there is I may want to use varchar or even char if I know the width is unlikely to change for a very long time (like a zip code which is not a number). Basically you’ve shown it’s a pain if you ever need to change… yeah, ok, but what if it adds a lot of overhead to a write heavy server?
also worth noting that although this code works in postgres many less mature databases don’t support this part of the standard (and by many I’m looking at MySQL). CHECK’s and DOMAIN’s are simply not an option there at this time, so if you’re building something to support both…
Post updated to include test suggested by Caleb.
@David:
Sure. I know that usually you should have more complex constraints, but in here, I was just testing simplest possible case.
@Caleb:
sorry but MySQL compliance is not high on my priorities list.
@Valentine Gogichashvili
It’s actually not always safe. You have to do a bit more if your change pushes the table from not needing a toast table to needing one. Apparently talking PG into creating the toast table for you at the same time you change the catalog is not so simple in later versions (although not impossible). And if you don’t realize you need to do it, things get all mucked up.
great post as always, now i won’t have to explain why char(x) is bad idea, i will point here if it’s neccessary
Great post – thanks for taking the time to share with everyone. Not that it matters, since you’re creating words with random characters anyway, but for those trying to follow along, I believe
my @chars = (“a”..”A”, “z”..”Z”, “0”..”9″);
should really be
my @chars = (“a”..”z”, “A”..”Z”, “0”..”9″);
@Pete:
You’re absolutely right. Fixing it now. Thanks.
Hi,
could You provide the specs of the libraries, operating system and hardware that You are using for the tests?
Greate Thanks for all the hard work 😉
@Charles:
system is Arch Linux 32 bit, updated daily – that also basically responds to the libraries question.
hardware – i7 920, 12GB of ram, and 2 (standalone, not raid) velociraptors 300g disks.
Is default storage the same for char() and varchar/text? So far I thought that storage for fixed size char is faster than varchar/text. Is it relevant at all?
@TEOŚ:
have you read the blogpost? I mean – it’s the one of the first things, that I stress that the binary formats are exactly the same, and I test it to show that there is no visible performance difference.
OK, just wanted to make sure.Thx.
Formally you could also use RULE. It should be faster than TRIGGER and comparable to all other options.
Rules are very bad idea. If at any point in time you think “well, rule can do it just fine”, it means you’re in a very bad situation, and you definitely should reconsider.
I don’t see justification for such a general statement (yet). Recently I adapted single-thread code to parallel computation in a nearly real-time app in such a way that the original single-threaded version is still usable. Rules work just fine in integration of data streams, despite their limited capabilities. The syntax of RULE is less complex than that of a trigger (define function + define trigger) and the execution time seems OK at first glance (I didn’t test in-depth).
Concerning the speed it seems that rules are actually faster than triggers, at least in the specific situation described: http://filip.rembialkowski.net/postgres-partitioning-performance-rules-vs-triggers/
Tell us why rules are bad, please. Otherwise i don’t see any technical justification for the statement.
@depesz “MySQL compliance” is a oxymoron.
You can increase the max length of a varchar column without re-writing the table … but it requires modifying the system tables.
Its instant, its safe (if you increase only), and I’ve done it many times to prevent having to re-write a TB of data.
@Scott:
No, it’s not safe. I don’t know details (I know it involves creation of related toast table, but that’s about it), but you can ask RhodiumToad on irc for what can go wrong with it.
Besides – touching internal pg data seems like a very bad idea anyway.
is there a reason when doing create domain not to do
CREATE DOMAIN foo AS VARCHAR(100);
?
@Caleb:
hmm .. – because you then can’t change the limit at all? You can’t change base type of domain.
that’s a limitation I was not aware of
PostgreSQL 9.1 introduced a new feature to alleviate some pain. I quote the release notes here:
http://www.postgresql.org/docs/current/interactive/release-9-1.html#AEN107778
> Allow ALTER TABLE … SET DATA TYPE to avoid table rewrites in appropriate cases (Noah Misch, Robert Haas)
> For example, converting a varchar column to text no longer requires a rewrite of the table. However, increasing the length constraint on a varchar column still requires a table rewrite.
Removes the advantage of the domain solution in v9.1, doesn’t it? Better go with text + constraint now?
@Erwin:
personally, i go with straight up text, with no limitation.
If you’d positively need limitation, then using varchar(x) would be good idea now, or text, with limitation based on check constraint that runs function().
I go with just text in 99% of the cases, too.
Where limitation is needed varchar(n) still seems a poor choice (change of limit still requires table rewrite and would break views, etc).
I was only considering text + constraint vs. text-domain with constraint, and whether the changes in 9.1 affect the advantage of the domain – less aggressive lock while changing the constraint. Maybe I just have to run some tests.
Thanks for the research. I was looking for performance differences between varchar and text and you showed me that there is no real difference.
Your post has a technical point of view. For the decision which datatype to use, I would however consider first functional requirements. The real problem for me with constraints (domains or varchar(n) alike) is that they come in too late. Functional data corruption and SQL-injection are the first concerns. The second requires input validation before any execution of SQL. The first might be the reason why ALTER TABLE puts an exclusive lock on the table.
Hence I prefer the use of ALTER TABLE instead of domains, because alteration of a domain does not affect rows already in the database. That means using domains that the same column can be used for different types of functional data. That makes it very hard to understand reports based on those columns.
You can change varchar(n) on the fly by hacking the catalog:
http://sniptools.com/databases/resize-a-column-in-a-postgresql-table-without-changing-data
Why PG doesn’t do this itself when you do an ALTER TABLE, I don’t know.
Excellent article, and wonderful to see opinion backed by data.
In your discussion of using DOMAIN+CONSTRAINT to flexibly limit TEXT length I note your adverse comment about the time taken to check a large table if the constraint limit is changed.
From the PostgreSQL 9.2 documentation I note:
ALTER DOMAIN
ADD domain_constraint [ NOT VALID ]
These checks can be suppressed …
Newly inserted or updated rows are
always checked…
Seems to be the best of both worlds… assuming the existing data is still valid it does not need checking; and, all newly entered/altered data, i.e., the reason for the design change, will be checked against the updated constraint.
Untested: Could this be an instantaneous ‘no locks’ solution?
On the general question of text vs. varchar(n) I infer from the following that at least Joe Celko favours setting lower limits in suitable (most?) contexts:
These large sizes tend to invite bad data.
You give someone … an NVARCHAR(255) column
and eventually it will get a Buddhist sutra
in Chinese Unicode.
[http://joecelkothesqlapprentice.blogspot.com.au/2007/06/db-table-design-question.html]
@Gavan:
while using invalid checks is interesting hack, I wouldn’t really use it in production.
Of course – the fact that you can have invalid checks, and later on validate them makes changing constraints easier.
As for Celko. Well. I appreciate a lot of stuff that he did and does, but I definitely don’t agree with him on many things (starting from favorite structure for trees).
Thanks for the ‘interesting hack’ compliment. 🙂
I’m unsure why you have reservations: likely I’m reading the docs differently and/or expressing myself badly.
The way I read the specification the checks are not ‘invalid’ (despite the semantics saying this), rather the checks are not applied to existing data, only new data.
If the original domain text had a constraint of 24, and design changes to 36, then it is provably certain that the new constraint (36) will not invalidate any existing data.
Importantly, as you mention, it would be good to avoid the wasted effort of a big production table getting rechecked for existing valid data, i.e., data that had already passed the more restrictive constraint.
Obviously this is not the case for a 36->24 design change. But moving the limit down is a much bigger problem that just checking constraints, and it’s almost certainly never going to happen, i.e., not a likely problem ‘in production’.
I would see the sequence as follows.
1. Implementation: text field with unqualified length constraint of 24, with the belief/hope this will never change.
2. Transition: (Design change needed) Add the qualified with “NOT VALID” constraint for 36, i.e., both constraints active, so no invalid data can sneak in.
3. Complete: drop the original constraint of 24 so the longer data can now enter the table.
4. Ongoing: (whenever design changes) repeat steps 2 & 4.
As for Celko and trees I can see the place for aesthetics in the selection of one’s trees. At least until application specific performance data is collected. 🙂
Great job. Thank for your cool work. I am moving MySQL to PostgreSQL. According to the test, no need expecting Char would outperform Varchar. But I still prefer Varchar than Text since it make meta data more understadable.
Since I think we were talking about performance in this blog post I find it interesting that your last set of numbers do seem to actually show that varchar is the fastest on avg.
Avg Time % difference
t_varchar 2.21 0%
t_trigger_text 4.37 98%
t_constraint_text 2.33 5%
t_function_constraint_text 2.52 14%
t_domain 2.62 19%
t_function_domain 2.52 14%
So yes for small row counts this wouldn’t make much difference, but for large table sizes 5% matters.
@mattman:
please note that it checks not speed of the datatype, but of constraint.