Notes on pg_stats
- To collect statistics, the analyzer randomly selects 300 × default_statistics_target rows (the default value is 100, so 30,000 rows in total).
- Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly updated. After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the pg_statistic table
Statistics Used by the Planner
SELECT * FROM tenk1
- The planner determines the cardinality of tenk1
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
- Basic relation-level statistics are stored in the table pg_class in the system catalog.
- reltuples: Relation’s row count
- relpages: Relation’s size in pages
WHERE unique1 < 1000
- The planner examines the WHERE clause condition and looks up the selectivity function for the operator < in pg_operator. This is held in the column oprrest, and the entry in this case is scalarltsel. The scalarltsel function retrieves the
histogram
for unique1 from pg_statistic. - For equality estimation the histogram is not useful; instead the list of
most common values (MCVs)
is used to determine the selectivity.
WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2
- The restriction on tenk1, unique1 < 50, is evaluated before the nested-loop join
- The restriction for the join is t2.unique2 = t1.unique2. The operator is just our familiar =, however the selectivity function is obtained from the oprjoin column of pg_operator, and is eqjoinsel. eqjoinsel looks up the statistical information for both tenk2 and tenk1, e.g.,
null_frac,n_distinct, most_common_vals
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
rows = (outer_cardinality * inner_cardinality) * selectivity
pg_stats
- pg_stats is designed to be more easily readable and is readable by all, whereas pg_statistic is only readable by a superuser.
- For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run an ANALYZE on a replica or a reader node (both can read from the pg_statistics relation, but can’t update it).
ALTER TABLE SET STATISTICS
- sets the per-column statistics-gathering target for subsequent ANALYZE operations. The target can be set in the range 0 to 10000; alternatively, set it to -1 to revert to using the system default statistics target (default_statistics_target)
correlation
- When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0
most_common_vals(MCV) and most_common_freqs
- MCV lists are also used for selectivity estimations of inequalities: to find the selectivity of “column < value”, the planner searches most_common_vals for all the values lower than the given value, and then adds together their frequencies from most_common_freqs.
- Common value statistics work best when the number of distinct values is low. The maximum size of the MCV arrays is defined by default_statistics_target, the same parameter that governs row sample size during analysis.
Histogram
- When the number of distinct values grows too large to store them all in an array, the system starts using the histogram representation. A histogram employs several buckets to store values in. The number of buckets is limited by the same default_statistics_target parameter.
- histograms are used to estimate selectivity of “greater than” and “less than” operations together with MCV lists.
n_distinct
ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...)
the only defined per-attribute options are n_distinct and n_distinct_inherited, which override the number-of-distinct-values estimates made by subsequent ANALYZE operations, i.e., the n_distinct change will not be in effect until you run ANALYZE again
How is pg_stats build
FROM (((pg_statistic s
JOIN pg_class c ON ((c.oid = s.starelid)))
JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum))))
LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)))
- tablename: pg_class.relname
- attname: from pg_attribute, identified by oid + attnum
- null_frac: stanullfrac
- avg_width: stawidth
- n_distinct: stadistinct
CREATE STATISTICS
- Extended statistics metadata is stored in the pg_statistic_ext table in the system catalog, while the statistics data itself is stored in a separate table pg_statistic_ext_data
- CREATE STATISTICS will create a new extended statistics object tracking data about the specified table, foreign table or materialized view. The statistics object will be created in the current database and will be owned by the user issuing the command.
- The CREATE STATISTICS command has two basic forms. The first form allows univariate statistics for a single expression to be collected, providing benefits similar to an expression index without the overhead of index maintenance. This form does not allow the statistics kind to be specified, since the various statistics kinds refer only to multivariate statistics. The second form of the command allows multivariate statistics on multiple columns and/or expressions to be collected, optionally specifying which statistics kinds to include. This form will also automatically cause univariate statistics to be collected on any expressions included in the list.
- If a schema name is given (for example, CREATE STATISTICS myschema.mystat …) then the statistics object is created in the specified schema. Otherwise it is created in the current schema. The name of the statistics object must be distinct from the name of any other statistics object in the same schema.
- Extended statistics are not currently used by the planner for selectivity estimations made for table joins. This limitation will likely be removed in a future version of PostgreSQL.
- Creation of such an object merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed by ANALYZE (either a manual command, or background auto-analyze). The collected values can be examined in the pg_statistic_ext_data catalog.
- ANALYZE computes extended statistics based on the same sample of table rows that it takes for computing regular single-column statistics.
Dependencies
- By default, the statistics from ANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.