Chapter 05: ​​SQL Server Statistics Analysis

 ​Applied to: SQL Server (2005-2022) , Azure SQL Database, Azure Synapse, Azure Fabric.

By now, you should have a good understanding of the importance of indexes. It is equally important for the optimizer to have the necessary statistics on the data distribution so that it can choose indexes effectively. In SQL Server, this information is maintained in the form of statistics on the index key.

SQL Server’s query optimizer is a cost-based optimizer, it decides on the best data access mechanism an join strategy by identifying the selectivity, how unique the data is, and which columns are used in filtering the data. Statistics exist with an index, but they also exist on columns without an index that are used as part of a predicate. Up-to-date information on data distribution in the columns referenced as predicates helps the optimizer determine the query strategy to use. In SQL Server, this information is maintained in the form of statistics, which are essential for the cost-based optimizer to create an effective query execution plan. Through the statistics, the optimizer can make reasonably accurate estimates about how long it will take to return a result set or an intermediate result set and therefore determine the most effective operations to use. As long as you ensure that the default statistical settings for the database are set, the optimizer will be able to do its best to determine effective processing strategies dynamically. Also, as a safety measure while troubleshooting performance, you should ensure that the automatic statistics maintenance routine is doing its job as desired. When necessary, you may even have to take manual control over the creation and/or maintenance of statistics.

Statistics on Indexed Columns

The usefulness of an index is fully dependent on the statistics of the indexed columns, without statistics, SQL server’s cost-based query optimizer cannot decide upon the most effective way of using an index. To meet this requirement, SQL Server automatically creates the statistics of an index key whenever the index is created. It is not possible to turn this feature off.

As data changes, the data retrieval mechanism required to keep the cost of a query low may also change. For example, if a table has only one matching row for a certain column value, then it makes sense to retrieve the matching rows from the table by going through the non-clustered index on the column. But if the data in the table changes so that a large number of rows are added with the same column value, then using the non-clustered index no longer make sense. To be able to have SQL Server decide this change in processing strategy as the data changes over time, it is vital to have up-to-date statistics.

Updating statistics consumes extra CPU cycles, to optimize the update process, SQL Server uses an efficient algorithm to decide when to execute the update statistics procedure, based on factors such as the number of modifications and the size of the table:

  • When a table with no rows gets a row.
  • When a table has fewer than 500 rows and is increased by 500 or more rows.
  • When a table has more than 500 rows and is increased by 500 rows + 20 percent of the number rows (Prior SQL Server 2016)
  • When a table has more than 500 rows and is increased by 500 rows + 1 percent of the number rows (SQL Server 2016)

This built-in intelligence keeps the CPU utilization by each process very low. It’s also possible to update the statistics asynchronously. This means when a query would normally cause statistics to be updated, instead that query proceeds with the old statistics, and the statistics are updated offline.

Benefits of Updating Statistics

The benefits of performing an auto update usually outweigh its cost on the system resources. To more directly control the behavior of the data, instead of using the tales in AdvantureWorks, for this set of examples you will create one manually.

Use tempdb;
Go
If ( OBJECT_ID (N'TempTable')) Is Not Null
Drop Table Temptable;
Go
Create Table
TempTable(C1 int , C2 int);
Go
Select Top 
(1500) IDENTITY(int,1,1) as N
Into #temp From master.dbo.syscolumns;
Go
Drop Table
#temp;

If you execute the SELECT statement with a very selective filter criterion on the indexed column to retrieve only one row, as shown in the following line of code, then the optimizer uses a non-clustered index seek as shown in the execution plan below.

Create Index TempIndex on TempTable (C1);
Go
Select
* From TempTable Where C1 = 2;

To understand the effect of small data modifications on a statistic update, create a trace using Profiler. In the trace, add the event Auto Stats, which captures statistics update and create events, and add SQL:BatchCompleted with a filter on the TextData column. The filter should look like Not Like Set% when you are done.

When you re-execute the preceding SELECT statement, you get the same execution plan as shown above.

 Select Top (1500) IDENTITY(int,1,1) as N
Into #temp From master.dbo.syscolumns;
Go
Insert Into
TempTable(C1) Select 2 From #temp;
Drop Table #temp;

The trace output does not contain any SQL activity representing a statistics update because the number of changes fell below the threshold where any table that has more than 500 rows must have an increase of 500 rows plus 20 percent of the number of rows. To understand the effect of large data modification on statistics update, add 1500 rows to the table as above statements.

Select * From TempTable Where C1 = 2;
Go
Select
* From TempTable Where C1 = 2;

 

A large result set will be retrieved. Since a large result set is requested, scanning the table directly is preferable to going through the non-clustered index to the base table 1502 times. Accessing the base table directly will prevent the overhead cost of bookmark lookups associated with the non-clustered index. The above execution plan represented the resultant.

The profiler trace output includes an Auto Stats event since the threshold was exceeded by the large-scale update this time. These SQL activities consume some extra CPU cycles. However, by doing this, the optimizer determines a better data-processing strategy and keeps the overall cost of the query low.

Drawbacks of Outdated Statistics

As explained in the previous slides, the auto update statistics feature allows the optimizer to decide on an efficient processing strategy for a query as the data changes. If the statistics become outdated, however, then the processing strategies decided on by the optimizer may not be applicable for the current data set and thereby will degrade performance.

To understand the detrimental effect of having outdated statistics, follow steps:

  1. Re-create the preceding test table with 1500 rows only and the corresponding non-clustered index.
  2. Prevent SQL Server from updating statistics automatically as the data changes. To do so, disable the auto update statistics feature by executing the following SQL statement:

Alter Database AdventureWorks Set Auto_Update_Statistics Off;

3. Add 1500 rows to the table as before.

With the auto update statistics feature switched off, the query optimizer has selected a different execution plan from the one it selected with this feature on. Based on the outdated statistics, which have only one row for the filter criterion (C1=2), the optimizer decided to use a non-clustered index seek. The optimizer could not make its decision based on the current data distribution in the column. For performance reasons, it would have been better to hit the base table directly instead of going through the non-clustered index, since a large result set (1501 rows out of 3000 rows) is requested.

The number of logical reads and the CPU utilization is significantly higher when the statistics are out-of-date even though the data returned is nearly identical and the query was precisely the same. Therefore, it is recommended that you keep the auto update statistics feature on. The benefits of keeping statistics updated outweigh the costs of performing the update.

Select Top (1500) IDENTITY(int,1,1) as N
Into #temp From master.dbo.syscolumns;
Go
Insert Into
TempTable(C1) Select 2 From #temp;
Drop Table #temp;
Go
Alter Database
Tempdb Set  Auto_Update_Statistics Off;
Go
Select Top (1500) IDENTITY(int,1,1) as N
Into #temp From master.dbo.syscolumns;
Go
Insert Into 
TempTable(C1) Select 2 From #temp;
Drop Table #temp;
Go
Select 
* From TempTable Where C1 = 2;

Statistics on Non-Indexed Columns

Sometimes you may have columns in join or filter criteria without any index. Even for such non-indexed columns, the query optimizer is more likely to make the best choice if it knows the data distribution (or statistics) of those columns.

In addition to statistics on indexes, SQL Server can build statistics on columns with no indexes. The information on data distribution, or the likelihood of a particular value occurring in a non-indexed column, can help the query optimizer determine an optimal processing strategy. This benefits the query optimizer even if it cannot use an index to actually located the values. SQL Server automatically builds statistics on non-indexed columns if it deems this information, valuable in creating a better plan, usually when the columns are used in a predicate.

In general, you should not disable the automatic creation of statistics on non-indexed columns. One of the scenarios in which you may consider disabling this feature is while executing a series of ad hoc SQL activities that you will not execute again. In such a case, you must decide whether you want to pay the cost of automatic statistics creation to get a better plan in this one case and affect the performance of other SQL Server activities. It is worthwhile noting that SQL Server eventually removes statistics when it realizes that they have not been used for a while. So, in general, you should keep this feature on and not be concerned about it.

Use tempdb;
Go
If 
( OBJECT_ID ( N'TempTable‘ ))  is not null
  Drop Table 
TempTable;
Go
Create Table
TempTable (C1 int Identity ,C2 int);
Insert into TempTable(C2) Values (1);
Select top (119026) IDENTITY (int,1,1) as N
Into #num From master.dbo.syscolumns;
Go
Insert Into
TempTable(C2) select 2 from #num;
Go
Create Clustered Index
IX_1 on TempTable(C1);

Benefits of Statistics on Non-Indexed Column

To understand the benefit of having statistics on a column with no index, create two test tables with disproportionate data distributions, as shown in the above code. Both tables contain 10001 rows. Table TempTable contains only one row for a value of the second column equal to 1, and the remaining 10000 rows contain this column value as 2. Table TempTable2 contains exactly the opposite data distribution.

Use tempdb;
Go
If
( OBJECT_ID ( N'TempTable2‘ ))  is not null
  Drop Table
TempTable2;
Go
Create Table
TempTable2 (C1 int Identity ,C2 int);
Insert into TempTable2(C2) Values (1);
Select top (10001) IDENTITY (int,1,1) as N
Into #num2 From master.dbo.syscolumns;
Go
Insert Into 
TempTable2(C2) select 1 from #num2;
Go
Create Clustered Index
IX_2 on TempTable2(C1); 

To understand the importance of statistics on a non-indexed column, use the default setting for the auto create statistics feature. By default, this feature is on. You can verify this using the DATABASEPROPERTYEX function.

Select T.C2, T2.C2  From TempTable T
Join TempTable2 T2 On T.C2 = T2.C2
Where T.C2 = 2

 

Use the above SELECT statement to access a large result set from table TempTable and a small result set from table TempTable2. Table TempTable has 10000 rows for the column value of C2 = 2 and table TempTable2 has 1 row for C2 = 2. Note that these columns used in the join and filter criteria have no index on either table.

The Profiler trace output with all completed events and the Auto Stats event for this query. You can use this to evaluate some of the added costs for a given query. The trace output shown includes one Auto Stats events creating statistics on the non-indexed column referred to in the JOIN and WHERE clauses, TempTable2. This activity consumes a few extra CPU cycles, but by consuming these extra CPU cycles, the optimizer decides upon a better processing strategy for keeping the overall cost of the query low.

Select OBJECT_NAME (object_id) as 'Table', name, stats_id, auto_created
From sys.stats
Where OBJECT_ID IN ( OBJECT_ID (N'TempTable2'),OBJECT_ID (N'TempTable‘)) And auto_created = 1;

To verify the statistics automatically created by SQL Server on the non-indexed columns of each table, run this SELECT statement against the sys.stats table as above query. 

Missing statistic is leading the query optimizer and cardinality estimator to choose the improper process strategy; therefore, optimizer always expect at least 1 match row based on the Join or Where criteria. By having missing statistics, the number of logical reads and CPU utilization increases significantly and optimizer cannot make cost effective plan.

Drawbacks of Missing Statistics on Non-Indexed Column

To understand the detrimental effect of not having statistics on non-indexed columns, drop the statistics automatically created by SQL Server and prevent SQL Server from automatically creating statistics on columns with no index by steps below.

Drop Statistics TempTable2._WA_Sys_00000002_6FE99F9F;
Go
Drop Statistics TempTable2._WA_Sys_00000001_6FE99F9F;
Go
Drop Statistics TempTable._WA_Sys_00000001_6FE95698;
Go
Alter Database tempdb set auto_create_statistics off;

Now re-execute the SELECT statement as below:

Select T.C2,T2.C2  From TempTable T
Join TempTable2 T2 On T.C2 = T2.C2
Where T.C2 = 1;

With the auto create statistics feature off, the query optimizer selected a different execution plan compared to the one it selected with the auto create statistics feature on. On not finding statistics on the relevant columns, the optimizer chose the first table (TempTable) in the FROM clause as the outer table of the nested loop join operation. The optimizer could not make its decision based on the actual data distribution in the column. Not only that, but the optimizer and the query engine determined that this query passed the threshold for parallelism, making this a parallel execution (those are the little arrows on the operators, marking them as parallel.

For example, if you modify the query to reference table TempTable2 as the first table in the FROM clause, as following statement:

Select T.C2,T2.C2  From TempTable2 T2
Join TempTable T On T.C2 = T2.C2
Where T.C2 = 1;

then the optimizer selects table TempTable2 as the outer table of the nested loop join operation.

You can see that turning off the auto create statistics feature has a negative effect on performance by comparing the cost of this query with and without statistics on a non-indexed column.


The number of logical reads and the CPU utilization are very high with no statistics on the non-indexed columns. Without these statistics, the optimizer cannot create a cost effective plan. A query execution plan highlights the missing statistics by placing an exclamation point on the operator that would have used the statistics. You can see this in the clustered index scan operators in the above execution plan.

In a database application, there is always the possibility of queries using columns with no indexes. Therefore, for performance reasons, leaving the auto create statistics feature of SQL Server databases on is recommended.

Analyzing Statistics

Statistics are collection of information stored as histograms. A histogram is a statistical construct that shows how often data falls into varying categories. The histogram stored by SQL Server consists of a sampling of data distribution for a column or an index key of up to 200 rows. The information on the range of index key values between two consecutive samples is called a step. These steps consists of varying size intervals between the 200 values stored. A step provides the following information:

  • The top value of a given step (RANGE_HI_KEY).
  • The number of rows equal to RANGE_HI_KEY (EQ_ROWS).
  • The range of rows between the previous top value and the current top value, without counting either of these samples (RANGE_ROWS).
  • The number of distinct rows in the range (DISTINCT_RANGE_ROWS). If all values in the range are unique, then RANGE_ROWS equals DISTINCT_RANGE_ROWS.
  • The average number of rows equal to a key value within a range (AVG_RANGE_ROWS).

The value of EQ_ROWS for an index key value (RANGE_HI_KEY) helps the optimizer decide how (and whether) to use the index when the indexed column is referred to in a WHERE clause. Because the optimizer can perform a SEEK or SCAN operation to retrieve rows from a table, the optimizer can decide which operation to perform based on the number of matching rows (EQ_ROWS) for the index key value. 

DBCC Show_Statistics'TableA‘ , 'IX_A‘ );

To understand how the optimizer’s data retrieval strategy depends on the number of matching rows, create a test table with different data distributions on an indexed column. When the preceding non-clustered index is created, SQL Server automatically creates statistics on the index key. You can obtain statistics for this non-clustered index key by executing the above command.

Beside the information on steps, other useful information in the statistics includes the following:

  • The time statistics were last updated.
  • The number of rows in the table.
  • The average index key length.
  • The number of rows sampled for the histogram.
  • Densities for combinations of columns.

Density

When creating an execution plan, the query optimizer analyzes the statistics of the columns used in the filter and JOIN clauses. A filter criterion with high selectivity limits the number of rows from a table to a small result set and helps the optimizer keep the query cost low. A column with a unique index will have very high selectivity, since it can limit the number matching rows to one.

On the other hand, a filter criterion with low selectivity will return a large result set from the table. A filter criterion with very low selectivity makes a non-clustered index on the column ineffective. Navigating through a non-clustered index to the base table for a large result set is usually costlier than scanning the base table (or clustered index) directly because of the cost overhead of bookmark lookups associated with the non-clustered index.

Statistics track the selectivity of a column in the form of a density ratio. A column with high selectivity will have low density. A column with low density is suitable for a non-clustered index, because it helps the optimizer retrieve a small number of rows very fast. This is also the principal on with filtered indexes operate since the filter’s goal is to increase the selectivity, or density, of the index.

Statistics on Multi-Column Index

In the case of an index with one column, statistics consist of a histogram and a density value for that column. Statistics for a composite index with multiple columns consist of one histogram for the first column only and multiple density values. This is one reason why it’s wise to put the more selective column, the one with the lowest density, first when building a compound index or compound statistics. The density values include the density for the first column and for each prefix combination of the index key columns. Multiple density values hel the optimizer find the selectivity of the composite index when multiple column can help referred to by predicates in the WHERE and JOIN clauses. Although the first column can help determine the histogram, the final density of the column itself would be the same regardless of column order.

Create Index IX_A on TableA(C1,C2) With Drop_Existing;

DBCC Show_Statistics ( ‘TableA’ , ‘IX_A’ );

To better understand the density values maintained for a multicolumn index, you can create a non-clustered index by the following statement:

Create Index IX_A on TableA(C1,C2) With Drop_Existing;

For a multicolumn index with two columns, the statistics for the index would also contain the density value of the (first + second) columns. The statistics won’t contain a density value for any other combination of columns, therefore, this index won’t be very useful for filtering rows only on the second column (C2), because the density value of the second column (C2) alone isn’t maintained in the statistics.

Statistics on Filtered Index

The purpose of a filtered index is to change the data that makes up the index and therefore change the density and histogram to make the index more performant. Instead of a test table, this example will use AdventureWorks database. Create an index on the Sales.PurchaseOrderHeader tale on the PurchaseOrderNumber column.

Create Index IX_B on Sales.SalesOrderHeader(PurchaseOrderNumber);
Go
DBCC Show_Statistics (‘Sales.SalesOrderHeader’ , ‘IX_B’);

If the same index is re-created to del with not null values on the column, it would look like something like this:

Create Index IX_B on Sales.SalesOrderHeader(PurchaseOrderNumber)
Where PurchaseOrderNumber Is Not Null 
with Drop_Existing;
Go
DBCC Show_Statistics (‘Sales.SalesOrderHeader’ , ‘IX_B’);

First you can see that the number of rows that compose the statistics have radically dropped in the filtered index because there is a filter in place. Notice also that the average key length has increased since you are no longer dealing with zero-length strings. A filter expression has been defined rather than the NULL value visible in above figure. The density measurements are very interesting, notice that the density is close the same for both values, but the filtered density is slightly lower, meaning fewer unique values. This is because the filtered data, while marginally less selective, is actually more accurate, eliminating all the empty values that wont contribute to search and the density of the second value, which represents the clustered index pointer, is identical with the value of the density of the PurchaseOrderNumber alone because each represents the same amount of unique data.

Once other option open to you is to create filtered statistics. This allows you to create even more fine-tuned histograms in partitioned tables. This is necessary because statistics are not automatically created on partitioned tables and you cannot create your own using Create Statistics. You can create filtered indexes by partition and get statistics or created filtered statistics specifically by partition.

Statistics Maintenance

SQL Server allows a user to manually override the maintenance of statistics in an individual database. The four main configurations controlling automatic statistics maintenance behavior of SQL Server are as follows:

  • New statistics on columns with no index.
  • Updating existing statistics.
  • The degree of sampling used to collect statistics.
  • Asynchronous updating of existing statistics.

 

You can control the preceding configurations at the levels of a database or on a case-by-case basis on individual indexes or statistics. The auto create statistics setting is applicable for non-indexed columns only, because SQL Server always creates statistics for an index key when the index is created. The auto update statistics setting and the asynchronous version, is applicable for statistics on both indexes and Where clause columns with no index.

Automatic Maintenance

By default, SQL Server automatically takes care of statistics. Both the auto create statistics and auto statistics settings are on by default. These two features together are referred to as autostats. As explained previously, it is usually better to keep these settings on. The auto update Async setting is off by default.

The auto creates statistics feature automatically creates statistics on non-indexed columns when referred to in the WHERE clause of a query.

The auto update statistics feature automatically updates existing statistics on the indexes and columns of a permanent table when the table is referred to in a query, provided the statistics have been marked as out-of-date. The types of changes are action statements, such as INSERT, UPDATE and DELETE.

If auto update statistics asynchronously is set to on, the basic behavior of statistics in SQL Server is not changed radically. When a set of statistics is marked as out-of-date and a query is then run against those statistics, the statistics update does not interrupt the query as normally happens. Instead, the query finishes execution using the older set of statistics. Once the query completes, the statistics are update. The reason this may be attractive is that when statistics are updated, query plans in the procedure cache are removed, and the query being run must be recompiled.

Manual Maintenance Situations

Manual statistics update should occur during one or more of the following situations:

After upgrading from a previous version to newer version of SQL Server, since the statistics maintenance of SQL Server has been upgraded, you should manually update the statistics of the complete database immediately after the upgrade instead of waiting for SQL Server engine to update it over time with the help of automatic statistics. While executing a series of Ad-Hoc queries, that you will not execute again. In such cases, you must decide whether you want to pay the cost of automatic statistics maintenance to get a better plan in that specific one case and affect the overall SQL Server performance. So, in general, you do not need to be concerned with such on-timers.

When you come upon an issue with the automatic statistics maintenance and the only workaround for the time being is to keep the automatic statistics maintenance feature off. Even in these cases you can turn the feature off for the specific database table that faces the problem instead of disabling it for the complete database. While analyzing the performance of a query, you realize that the statistics are missing for a few of the database objects referred to by the query. This can be evaluated from the graphical and XML execution plans.

While analyzing the effectiveness of statistics, you realize that they are inaccurate. This can be determined when poor execution plans are being created from what should be good sets of indexes. SQL Server allows a user to control many of its automatic statistics maintenance features. Auto create statistics and auto update statistics settings, respectively, and then you can get your hands dirty.

Alter Database Temdb Set Auto_Create_Statistics Off;
Go
Alter Database
Temdb Set Auto_Update_Statistics Off;
Go
Alter Database
Temdb Set Auto_Update_Statistics_Async Off;
Go
Exec
Sp_AutoStats ‘TempTable’ , ‘Off’;
Go
Exec
Sp_AutoStats ‘TempTable’ , ‘Off’ , ‘IX_A’;
Go
Update Statistics
dbo.TableA with Sample 50 Percent ,NoRecompute;

Statistics Recommendations

Throughout this module, we covered various recommendations for statistics. For easy reference, the following list is the consolidated recommendations.

Backward Compatibility of Statistics

Statistic information in new version of SQL Server is different from prior versions, you better update it manually after every successful upgrade process.

Auto Create Statistics

You better turn it on for non-indexed columns and for most ad hoc SQL Server activity.

Auto Update Statistics

It helps the optimizer to decide on the appropriate execution plan, usually the performance benefit provided by this feature outweighs the cost overhead. You better create a job and update the statistics regularly, if there is an issue about Auto_Update_Statistics feature.

Automatic Update Statistics Asynchronously

This feature will update the statistics after every query execution. And the current query is suffering from the statistic which is out of date. In order to enable this feature Auto_Update_Statistics feature must be on as well.

Amount of Sampling to Collect Statistics

It is better to let optimizer decide on the sampling rate based on the data size, but if you prefer to have most accurate statistic information, you better use with FULLSCAN option which it reduces performance, then you should run update statistics command on the peak-off hours by SQL Job.

You become an 
AI-Powered 
Administrator

Maximize performance benefits​ by automatically detecting and resolving issues through AI-DBA intelligence optimization.

Request Demo

Get started today.
Get more for your investments.

Stay Connected
Subscribe to our newsletter for exclusive offers!

Thank you! Your message has been sent.
Unable to send your message. Please fix errors then try again.