The love of Data, Database Engineering, Architecture, Entrepreneurship, and other assorted bits
18 April 2008
So which one is a better use of space given that an application uses 32bit integers? In PostgreSQL, the int column takes only 4 bytes, and the bigint takes 8 bytes. So one would think there is significant space savings to be had by using int. In the real world with disk space costs where they are at, it just does not matter in a practical sense. Furthermore, if you are running on an <a href=http://en.wikipedia.org/wiki/Extent_(file_systems)>extent based filesystem</a> like ext3 or vxfs, then you really don’t see any practical benefits. Why? Well, your extent size is likely to be much larger then (number of tuples * 4 bytes) so the difference in size is masked because the FS sizes up to the next extent size. Consider the below example:
vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz 2.6.16.46-0.12-smp #1 SMP Thu May 17 14:00:09 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux pg_ctl (PostgreSQL) 8.2.4 postgres=# \d testtable2; Table "public.testtable2" Column | Type | Modifiers ----------+------------------------+----------- id | bigint | not null text | character varying(200) | thecount | integer | Indexes: "testtable2_pk" PRIMARY KEY, btree (id) postgres=# \d testtable3; Table "public.testtable3" Column | Type | Modifiers ----------+------------------------+----------- id | integer | not null text | character varying(200) | thecount | integer | Indexes: postgres=# select count(*) from testtable2; count --------- 2253530 (1 row) postgres=# select count(*) from testtable3; count --------- 2253530 postgres=# analyze testtable2; ANALYZE postgres=# analyze testtable3; ANALYZE postgres=# select tablename, attname, avg_width,n_distinct from pg_stats where tablename = 'testtable2'; tablename | attname | avg_width | n_distinct ------------+----------+-----------+------------ testtable2 | text | 56 | 1 testtable2 | thecount | 4 | 0 testtable2 | id | 8 | -1 postgres=# select tablename, attname, avg_width,n_distinct from pg_stats where tablename = 'testtable3'; tablename | attname | avg_width | n_distinct ------------+----------+-----------+------------ testtable3 | thecount | 4 | 0 testtable3 | id | 4 | -1 testtable3 | text | 56 | 1 (3 rows) postgres=# SELECT * FROM pgstattuple('testtable3'); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent -----------+-------------+-----------+---------------+------------------+----------------+--------------------+------------+-------------- 227917824 | 2253530 | 207324760 | 90.96 | 0 | 0 | 0 | 1897096 | 0.83
Two tables with the only difference being the testtable2 is int and testtable3 is bigint. Same number of rows. Different row sizes. Even pgstattuple shows different row lengths.
postgres=# SELECT schemaname, tablename, pg_size_pretty(size) AS size_pretty, pg_size_pretty(total_size) AS total_size_pretty FROM (SELECT *, pg_relation_size(schemaname||'.'||tablename) AS size, pg_total_relation_size(schemaname||'.'||tablename) AS total_size FROM pg_tables where tablename like '%testtable%') AS TABLES ORDER BY total_size DESC; schemaname | tablename | size_pretty | total_size_pretty ------------+------------+-------------+------------------- public | testtable2 | 217 MB | 266 MB public | testtable3 | 217 MB | 266 MB select relname, relpages from pg_class where relname in ('testtable2','testtable3'); relname | relpages ------------+---------- testtable2 | 27822 testtable3 | 27822 (2 rows) postgres=# select 27822*8192; ?column? ----------- 227917824 (1 row) postgres=# select 27822*8192/1024/1024; ?column? ---------- 217
But since we are on an extent based filesystem, we see the exact same size on disk. So for anything other than the most extravagant setups, it’s smart to use bigint and not limit yourself to 32bit integers. There is no space penalty for doing so at least. There are <a href=http://archives.postgresql.org/pgsql-general/2007-06/msg00299.php>performance penalties in performing arithmetic on these columns</a>, but no downside in most practical applications. For instance, a row access by index.
postgres=# select id from testtable2 where id = 2003; id ------ 2003 (1 row) Time: 0.131 ms postgres=# select id from testtable3 where id = 2003; id ------ 2003 (1 row) Time: 0.165 ms insert 100000 rows into testtable2 and commit: Elapsed Seconds: 21.438544035 TPS: 4664.49586487 insert 100000 rows into testtable3 and commit: Elapsed Seconds: 21.7363810539 TPS: 4600.58184258
If you perform calculations on these id’s you do see a performance hit, so be careful:
postgres=# select id*10000+45/10 from testtable2 where id = 2003; ?column? ---------- 20030004 (1 row) Time: 0.268 ms postgres=# select id*10000+45/10 from testtable3 where id = 2003; ?column? ---------- 20030004 (1 row) Time: 0.133 ms