Monthly Archives: July 2012

Myth: COUNT(*) is slower than COUNT(1) or COUNT(column name)

There is a strongly held myth by some that in Greenplum, COUNT(*) is somehow slower than COUNT(1) or COUNT(column name).  However, it is a myth and there isn’t a difference.  If you are interested for proof, keep reading.

The 4 arguments I’ve heard that COUNT(*) is bad and COUNT(1) or COUNT(column name) is better are:
1. COUNT(*) is slower
2. COUNT(1) is faster when there is an index on a column
3. COUNT(column name) is faster when you count a unique column
4. COUNT(1) is faster when the table is column oriented

1. COUNT(*) is slower

Setup

create schema demo;

create table demo.foo
(id int not null,
fname text,
lname text)
distributed by (id);

insert into demo.foo
select i, 'jon' || i, 'roberts' || i
from (select generate_series(1, 1000000) as i) as sub;

analyze demo.foo;

COUNT(*)

explain analyze select count(*) from demo.foo;

Aggregate  (cost=14228.92..14228.93 rows=1 width=8)
  Rows out:  1 rows with 483 ms to end, start offset by 0.474 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14228.86..14228.91 rows=1 width=8)
        Rows out:  2 rows at destination with 481 ms to first row, 483 ms to end, start offset by 0.486 ms.
        ->  Aggregate  (cost=14228.86..14228.88 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 483 ms to end, start offset by 0.908 ms.
              ->  Seq Scan on foo  (cost=0.00..11725.49 rows=500675 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.071 ms to first row, 398 ms to end, start offset by 0.908 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 200K bytes avg x 2 workers, 200K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 484.379 ms

COUNT(1)

explain analyze select count(1) from demo.foo;

Aggregate  (cost=14228.92..14228.93 rows=1 width=8)
  Rows out:  1 rows with 504 ms to end, start offset by 0.532 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14228.86..14228.91 rows=1 width=8)
        Rows out:  2 rows at destination with 502 ms to first row, 504 ms to end, start offset by 0.564 ms.
        ->  Aggregate  (cost=14228.86..14228.88 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 499 ms to end, start offset by 4.029 ms.
              ->  Seq Scan on foo  (cost=0.00..11725.49 rows=500675 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.030 ms to first row, 398 ms to end, start offset by 4.029 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 505.237 ms

Conclusion: No difference.

2. COUNT(1) is faster when there is an index on a column

COUNT(*) with PK Index

alter table demo.foo add primary key (id);

analyze demo.foo;

explain analyze select count(*) from demo.foo;

Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
  Rows out:  1 rows with 511 ms to end, start offset by 0.459 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
        Rows out:  2 rows at destination with 511 ms to end, start offset by 0.471 ms.
        ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 510 ms to end, start offset by 0.836 ms.
              ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.039 ms to first row, 420 ms to end, start offset by 0.837 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 511.958 ms

COUNT(1) with PK Index

explain analyze select count(1) from demo.foo;

Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
  Rows out:  1 rows with 511 ms to end, start offset by 0.563 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
        Rows out:  2 rows at destination with 507 ms to first row, 511 ms to end, start offset by 0.596 ms.
        ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 507 ms to end, start offset by 1.022 ms.
              ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.040 ms to first row, 404 ms to end, start offset by 1.023 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 511.875 ms

Conclusion: No difference.  The cost went down slightly with an Index but COUNT(*) versus COUNT(1) didn’t make a difference.

3. COUNT(column name) is faster when you count a unique column

COUNT(ID)

explain analyze select count(id) from demo.foo;

Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
  Rows out:  1 rows with 533 ms to end, start offset by 0.609 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
        Rows out:  2 rows at destination with 531 ms to first row, 533 ms to end, start offset by 0.647 ms.
        ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 529 ms to end, start offset by 4.696 ms.
              ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=4)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.030 ms to first row, 418 ms to end, start offset by 4.697 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 533.647 ms

Conclusion: No difference

4. COUNT(1) is faster when the table is column oriented

Setup

create table demo.foo_column
(id int not null,
 fname text,
 lname text)
 WITH (appendonly=true, orientation=column)
 distributed by (id);

insert into demo.foo_column
select i, 'jon' || i, 'roberts' || i
from (select generate_series(1, 1000000) as i) as sub;

analyze demo.foo_column;

COUNT(*)

explain analyze select count(*) from demo.foo_column;
Aggregate  (cost=13350.06..13350.07 rows=1 width=8)
  Rows out:  1 rows with 368 ms to end, start offset by 0.544 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=13350.00..13350.04 rows=1 width=8)
        Rows out:  2 rows at destination with 362 ms to first row, 368 ms to end, start offset by 0.573 ms.
        ->  Aggregate  (cost=13350.00..13350.01 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 359 ms to end, start offset by 4.160 ms.
              ->  Append-only Columnar Scan on foo_column  (cost=0.00..10850.00 rows=500000 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.182 ms to first row, 266 ms to end, start offset by 4.160 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 264K bytes avg x 2 workers, 264K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 368.748 ms

COUNT(1)

explain analyze select count(1) from demo.foo_column;

Aggregate  (cost=13350.06..13350.07 rows=1 width=8)
  Rows out:  1 rows with 277 ms to end, start offset by 0.557 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=13350.00..13350.04 rows=1 width=8)
        Rows out:  2 rows at destination with 204 ms to first row, 277 ms to end, start offset by 0.586 ms.
        ->  Aggregate  (cost=13350.00..13350.01 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 273 ms to end, start offset by 4.296 ms.
              ->  Append-only Columnar Scan on foo_column  (cost=0.00..10850.00 rows=500000 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.195 ms to first row, 174 ms to end, start offset by 4.297 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 264K bytes avg x 2 workers, 264K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 277.571 ms

Conclusion: No difference. The cost is lower to count the number of rows in a column oriented table but again, makes no difference if it is COUNT(1) or COUNT(*).

Summary

Don’t use silly code tricks like COUNT(1) because it doesn’t make any difference.