Microsoft – SQL Server – Compression?
One of the biggest drag in Very Large Databases (VLDB) is I/O throughput. I/O throughput is constrained by the underlying storage sub-systems throughput, network pipe between the hosts and SAN (in our case), and host’s memory headroom.
If you have done a lot and your wait states still lead down the storage wilderness, then it might be time to look at data compression.
What are the requirements?
- MS SQL Server 2008 and later (v2008, v2008/R2, v2012)
- Enterprise Edition
- Think hard about the requirement for “Enterprise Edition” as once you employ compression that database can no longer be restored on a downlevel edition (Standard, Desktop, etc)
When should you consider data compression?
- If you have data that repeats and you think might benefit from coalescing
- If your tables are not very well normalized and you do do have cross-referenced key columns
What are the benefits of data compression?
- In the storage-subsystem the data is compressed, and thus you will be moving less data from your storage sub-system to the hosts
- You will also be writing less data, as data is compressed before being written to the storage sub-system
- In memory, the data is kept as compressed data, and so your memory footprint per record should be a bit less
In summary, you will have less data at rest on the storage sub-system, less data while in memory, and less data while in-transit
To me if there is selling point to high database systems, and in this case I am including MS SQL Server, it is their inherent flexibility in configuration. They are very general purpose workhorse.
So when you think of Data Compression in SQL Server, do not feel boxed in. You have a wealth of choices. Those choices includes:
- You are able to apply various compression choices at very detailed level – Indexes – That is you do not make compression decision at the database nor the table level. But, at individual index granularity
- You are able to choose between page and record level compression
- You are able to apply your compression choice and later un-apply it or choose another compression choice
Knowing that we have a choice as to how we want to dip into the proverbial water, we can review what objects and indexes matter the most.
Identify what is churning in your memory:
Glen Berry has a really good lead-in. As you think about whether compression is worth it, first determine what is in your buffer cache.
The truth is you might be surprised… User queries change all the time and an index that you added for John L. Smith in Accounting a while ago might now be heavily used, indexes that are needed might not be present and you are instead bringing in the entire table…
So look again at your memory space and try to account for each table\index combination.
Here is Glen Berry’s query:
/* -- Breaks down buffers used by current database by -- object (table, index) in the buffer cache */ SELECT [object] = OBJECT_SCHEMA_NAME(p.[object_id]) + '.' + OBJECT_NAME(p.[object_id]) , tblIndex.name , p.index_id , IndexType = max(tblIndex.type_desc) , COUNT(*)/128 AS [Buffer size(MB)] , COUNT(*) AS [BufferCount] , [CompressionType] = p.data_compression_desc FROM sys.allocation_units AS a INNER JOIN sys.dm_os_buffer_descriptors AS b ON a.allocation_unit_id = b.allocation_unit_id INNER JOIN sys.partitions AS p ON a.container_id = p.hobt_id INNER JOIN sys.indexes as tblIndex ON p.object_id = tblIndex.object_id AND p.index_id = tblIndex.index_id WHERE b.database_id = DB_ID() AND p.[object_id] > 100 GROUP BY p.[object_id] , tblIndex.name , p.index_id , p.data_compression_desc ORDER BY [BufferCount] DESC
Review your Application workload
Microsoft has done a commendable job in its published white-paper:
Data Compression: Strategy, Capacity Planning and Best Practices
Among other things, MSFT listed out when compression and decompression actually occurs and did a good comparative analysis on various workloads; thus one can actually better determine which queries and access patterns will likely benefit from compression.
As an aside, anytime I see Sunil Agarwal’s name on a document, I know it is top quality.
Here is some published papers by Sunil Agarwal
Back to MSFT White-paper – Here is my read:
- When compressed ? Data at rest is obviously compressed
- The data is kept compressed in memory
- It rests on the SQL Engine to determine which data to decompress and generally that occurs when filtering is initially occurring or during query’s completion and data is being grouped and sorted
- To me, one has to be pretty good at reading query plans and at understanding where your cost is; and that leads to identifying which tables and indexes are involved in the costly operations
- In essence look at your access patterns (Query Plans) and try to see which ones are Seeks and which ones are Scans — Scans as they typically involve churning through quite a bit of data will probably benefit from compression. Compression indirectly offers you a bigger IO — You can bring in more compressed data as compared to un-compressed data
- Because MSFT design and implementation of compression is so very granular and is targeted at individual indexes and partitions, one has a lot of choices – You can compress indexes that are used during scan operations and leave indexes employed during seeks un-compressed
- Again understand your queries and try to use persisted computed columns and aggregated tables rather than pay a price during each retrieval. If is often better to pay that price once when the record is initially persisted
- Along the same lines your Database Engine is not where you want to perform complex query analysis. It is better to do so on your Application, Analysis Servers or whatever you call them. For one thing, you can distribute that load across many machines and you actually have better targeted tools to profile and tune them
- Look at your CPU utilization — If over 40%, then you probably do not want to introduce compression as it is slanted towards CPU. And, when high, make sure that you do not have “Power Savings” enabled. If their are boxes where you do not want Power Savings it is on your DB and VM Hosts
And, more precisely, this is where Microsoft nailed it. They talk about your U and S.
Here is the U:
- The percentage of U. That is the percentage of Updates. The lower this number, the more likely compression should be considered
- If you have a lot of updates, then again you are going back and changing data and thus re-introducing needs for reading (seeks) and compression
- These days, I rarely see an application that actually performs a high ratio of updates. Mostly I see a lot of Appends
Here is the S:
- The percentage of S. If you have a lot of Scans as a total of all access, then you will probably benefit from compression
Here are queries nicely prepared in MSFT’s white paper:
U: Percent of Update Operations on the Object
SELECT o.name AS [Table_Name] , x.name AS [Index_Name] , i.partition_number AS [Partition] , i.index_id AS [Index_ID] , x.type_desc AS [Index_Type] , i.leaf_update_count * 100.0 / (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count ) AS [Percent_Update] FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i JOIN sys.objects o ON o.object_id = i.object_id JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id WHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) != 0 AND objectproperty(i.object_id,'IsUserTable') = 1 AND o.name = <objectName> AND x.name = <indexName> ORDER BY [Percent_Update] ASC
S: Percent of Scan and Seek Operations on the Object
SELECT o.name AS [Table_Name] , x.name AS [Index_Name] , i.partition_number AS [Partition] , i.index_id AS [Index_ID] , x.type_desc AS [Index_Type] , (i.range_scan_count * 100.0) / (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count ) AS [Percent_Scan] , (i.singleton_lookup_count * 100.0) / (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + i.leaf_update_count + i.leaf_page_merge_count + i.range_scan_count ) AS [Percent_Seek] FROM sys.dm_db_index_operational_stats (db_id(), NULL, NULL, NULL) i JOIN sys.objects o ON o.object_id = i.object_id JOIN sys.indexes x ON x.object_id = i.object_id AND x.index_id = i.index_id WHERE (i.range_scan_count + i.leaf_insert_count + i.leaf_delete_count + leaf_update_count + i.leaf_page_merge_count + i.singleton_lookup_count) != 0 AND objectproperty(i.object_id,'IsUserTable') = 1 AND o.name = @objectName AND i.name = @indexName ORDER BY [Percent_Scan] DESC
What are the different types of compression?
There are two different types of compressions, page and row. Here is how they differ?
Compression -> Row
- Row works at the individual record level. It looks at each record independently and goes against some of what we assumed ar hard-wired into MS SQL Server data storage
- Ross Mistry has a good short article in http://www.informit.com/articles/article.aspx?p=1946159&seqNum=6 that made it easy for me to understand Row Level compression. It used by green to define a column such StreetAddress char(2000). Doing so will allocate 2000 bytes for each StreetAddress irrespective of the actual data contents. But, Row Level compression “saves you” in that only the actually needed byte size is used.
- From same article was able to determine that data is not actually compressed, just un-needed bytes are not taken up
Compression -> Page
- Actual compression based on data page’s data
- You get Row Level compression, as well
How much cost savings from compression?
-- Estimate data compression difference on clustered index -- if compression --type is NONE EXEC sp_estimate_data_compression_savings 'SchemaName' , 'TableName' , 1 , NULL , 'NONE' ; -- Estimate data compression difference on clustered index -- if compression type is ROW EXEC sp_estimate_data_compression_savings 'SchemaName' , 'TableName' , 1 , NULL , 'ROW' ; -- Estimate data compression difference on clustered index -- if compression type is PAGE EXEC sp_estimate_data_compression_savings 'SchemaName' , 'TableName' , 1 , NULL , 'PAGE' ;
Pasted below are speculative savings based on whether we employ Row or Page Compression:
Row Compression Savings?
Page Compression Savings?
- Because our table is already partitioned we have a lot more sample data; as savings are calculated for each partition in a table
- On an average, it appears that we will reap 50% savings for Row Level compression
- On the other-hand for Page Level compression, based on representative data, if compressed our size will be 15% of what it will be if left un-compressed; thus we will reap upwards of 85%
Actual Compression Steps:
Compress – Index (Clustered and Non-Clustered)
In the example below, we are employing Page Level Compression on a partitioned table. Here are the other characteristics of this particular processing:
- Sorting in Tempdb as this is an Enterprise Edition Instance. Sorting in tempdb is faster as we have TempDB on fast storage
- Make sure to repeat this step for all partitions in the table
--Syntax Alter Index indexName On [schema-name].[object_name] Rebuild Partition = [partitionNumber] with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE ) --Actual Alter Index idx_Customer_Name on [dbo].[customer] Rebuild Partition = 1 with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE )
Compress – Index (Clustered and Non-Clustered)
Page level compress to an un-partitioned index.
--Syntax Alter Index indexName On [schema-name].[object_name] Rebuild with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE ) --Actual Alter Index idx_Customer_Name on [dbo].[customer] Rebuild with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE )
Compress – Heap
In cases where we do not have a clustered index, but a heap you can apply compression to the table directly.
--Syntax Alter Table [schema-name].[object_name]; Rebuild Partition = [partitionNumber] with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE ) --Actual Alter Table [dbo].[customer] Rebuild Partition = 1 with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE )
Compress – Heap
In cases where your heap is not partitioned, then leave out the partition option
--Syntax Alter Table [schema-name].[object_name] Rebuild with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE ) --Actual Alter Table [dbo].[customer] Rebuild with ( SORT_IN_TEMPDB = ON , DATA_COMPRESSION = PAGE )
- Ensure that Database Mirroring is set to high performance — You really do not want to pay the price for High Safety
- Dedicate fast IO and abundant storage for Log Space
- Dedicate fast IO for TempDB
What should you expect to see:
SELECT StatMan([SC0], [SB0000]) FROM ( SELECT TOP 100 PERCENT [SC0] , step_direction([SC0]) over (order by NULL) AS [SB0000] FROM ( SELECT [column1] AS [SC0] FROM [dbo].[object-name] TABLESAMPLE SYSTEM ([percentile]; PERCENT) WITH (INDEX(1),READUNCOMMITTED) WHERE $PARTITION.partitionFunction([partitionFunction]) = 21 ) AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SB0000] ) AS _MS_UPDSTATS_TBL OPTION (MAXDOP 1)
- The SQL Engine gathers statistics
- As always, one learns a lot when one is able to look under the hood a bit
- In our case, I am able to confirm that we have chosen the wrong first column for our partitioned index — more on that in a later post
- The Table Hint — INDEX(1), READUNCOMITTED is also new to me — I am used to using NOLOCK as compared to the more probable ANSI standard
- Usage of MAXDOP 1 — asking for no parallelism and thus reducing the likelihood that we will starve the system for resources
“select *, %%bmk%% from tablename” while rebuilding the index
- This is one of the signatory statements that occurs when an index is being rebuilt; as is the case with turning on compression
Quite a bit of session blocking
- Index rebuilding needs a bit of locking — such as Schema locks (LCK_M_SCH_S)
Performance Profiling – Performance Counter
Performance Profiling – Dynamic Management Views
To track compression at individual index level you can use the sys.dm_db_index_operational_stats DMV.
Once you rebuild your indexes existing columns are compressed. To track ongoing compression attempts and success you can query the sys.dm_db_index_operational_stats view.
The column names are:
use [dbHR]; declare @dbid int declare @objectID int set @dbid = DB_ID() set @objectid = OBJECT_ID('dbo.employee') select OBJECT_NAME(tblOperationalStats.object_id) as objectName , tblIndex.name , tblOperationalStats.partition_number , tblOperationalStats.page_compression_attempt_count , tblOperationalStats.page_compression_success_count from sys.dm_db_index_operational_stats(@dbid, @objectid, null, null) tblOperationalStats inner join sys.indexes tblIndex on tblOperationalStats.object_id = tblIndex.object_id and tblOperationalStats.index_id = tblIndex.index_id order by tblOperationalStats.page_compression_success_count desc , tblOperationalStats.page_compression_attempt_count desc
Here is the result:
I was surprised to see that that our page_compression_success_count is very low relative to our page_compression_attempt_count.
But, luckily Jonathan Kehayias blogged about this:
An xevent a day 28 of 31 – Tracking Page compression operations
The sqlserver.page_compression_attempt_failed is really poorly named, and doesn’t provide information about failures in the sense that something broke, but provides information for why a page compression attempt did not actually change the compression of the data in the page.
The PageModCountBelowThreshold failure isn’t really a failed attempt at compression. This reason shows that the page was evaluated for recalculation, and the modified counter for the page hadn’t passed the internal threshold for recalculation so the actual compression operation wasn’t performed.
So based on Jonathan’s public commentary, it appears that MSFT rightfully considers whether compression is actually needed\beneficial. And, if so they notch up the page_compression_success_count.
And, if not the the page_compression_success_count is left as is.
This makes more sense once one considers that overall compression has already occurred during the index rebuild operation and that this current tally is just ongoing compression consideration.
Performance Profiling – OS (MS Windows) – Performance Monitor
At a MS SQL Server Instance level, one can get a high level view on how frequently page compression is attempted and actually fulfilled.
- SQLServer:Access Methods –> Page compression attempts/sec
- SQLServer:Access Methods –> Page compressed/sec
What to keep in mind
What to keep in mind – Table
- We have concentrated on index compression so far. And, really not said much about actual table compression. If you have a clustered index, then you can concentrate solely on the clustered index, as the clustered index is representative of the data — Clustered index is index=1
- On the other hand if you have a heap then apply compression to the table and look for index =0
- If you have chosen to compress more than one index on a table and your candidates index includes a clustered index, then apply compression to the clustered index first; as all indexes are rebuilt whenever the clustered index is rebuilt
- Data Compression Strategy
- Data Compression
- Some Data compression commands for SQL Server
- Data Compression in SQL 2012
- Managing Compression in SQL Server 2008
- An XEvent a Day (28 of 31) – Tracking Page Compression Operations