Technical: Microsoft – SQL Server – Data I/O – Computed Columns
For the last couple of weeks or so, I have struggled with a little SQL Script that transfers data between two SQL Server Instances.
The Transact SQL is simple enough. It uses Microsoft’s SQL Server highly regarded Linked Server (Heterogeneous Database) to copy data from a legacy system to a new one.
Original Query – Linked Server
Here is the original query that uses Linked Server
declare @logdateDestMax datetime declare @logdateDestMaxDefault datetime set @logdateDestMaxDefault = '1/1/1900' set @logdateDestMax = ( select max(createDate) from dbo.employee ) if ( @logdateDestMax is null) begin set @logdateDestMax = @logdateDestMaxDefault end insert into dbo.employee ( [username] ,[createDate] ,[createdBy] ) select [username] ,[createDate] ,[createdBy] from [DBLABJ_NC_10_ODBC].[dbHR].dbo.employee tblSrc where [logdate] > @logdateDestMax
Rewrite Query – OpenQuery
As I know that 4 part name queries can sometimes be problematic and slow, I thought may be I should rewrite to use openquery.
Here is the re-written query that uses Linked Server \ Open Query
declare @logdateDestMax datetime declare @logdateDestMaxDefault datetime set @logdateDestMaxDefault = '1/1/1900' set @logdateDestMax = ( select max(createDate) from dbo.employee ) if ( @logdateDestMax is null) begin set @logdateDestMax = @logdateDestMaxDefault end insert into dbo.employee ( [username] ,[createDate] ,[createdBy] ) select [username] ,[createDate] ,[createdBy] from openquery( [DBLABJ_NC_10_ODBC] , 'select * from [dbHR].dbo.employee' ) tblSrc where [logdate] > @logdateDestMax
OpenQuery – Poor Auditioning
The example above is a very poor staging of the beenfits of openquery.
Usually, we want a query that can be completely serviceable at the Other Server. The query above still brings all the data over to the querying server.
Once all the data is brought over, the resident SQL instance then issues a comparison.
There are things we can do to better ‘position’ the linked Server. Our alternate choices includes:
- Creating a Stored Procedure on the Linked Server. The SP will accept parameters that will help ‘our case’
Compare 4 part name and OpenQuery
Compare Execution Plans
Here is a quick comparison of both plans.
One can quickly see that the plan that uses the 4 part name is 53%, while the one that uses openquery is at 36%.
Compare IO Stats
With “set statistics io on”, we can see that IO stats for the first query:
Table 'employee'. Scan count 1, logical reads 1369, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
We do not have corresponding IO stats for the query that uses openquery; as that query is completely ‘materialized’ on the Linked Server and seemingly Profiling Stats are not returned for externally ran queries.
Remote Scan Stats
If we focus on the remote Scan Stats, we are able to dig a bit deeper into our estimated and actual costs:
Linked Server Cost
Open Query Cost
Here is the break-down
|Item||4 part name||Open Query|
|Actual Number of Rows||5||49821|
|Estimated CPU Cost||4.9921||3.3633|
|Estimated I/O Cost||0||0|
|Estimated Number of Executions||1||1|
|Estimated Number of Rows||14936.3||10000|
|Estimated Operator Cost||4.9921 (100%)||3.36333 (100%)|
|Estimated Row Size||448 Bytes||448 Bytes|
|Estimated SubTree Cost||4.9921||3.36333|
|Logical Operation||Remote Query||Remote Scan|
|Number of Executions||1||1|
|Output List||DBLAB].[dbo].[UpdateLog].col1, [DBLAB].[dbo].[UpdateLog].col2, [DBLAB].[dbo].[UpdateLog].logdate, [DBLAB].[dbo].[UpdateLog].col3||SQLNCLI11|
|Physical Operation||Remote Query||Remote Scan|
|Remote Object||SELECT “Tbl1001”.”col1″ “Col1003″,”Tbl1001”.”col2″ “Col1004″,”Tbl1001″.”logdate” “Col1005″,”Tbl1001”.”col3″ “Col1006” FROM “DBLAB”.”dbo”.”UpdateLog” “Tbl1001” WHERE “Tbl1001″.”logdate”>?||select * from [DBLAB].dbo.UpdateLog tblSrc|
|Remote Source||Named Instance name||Actual Name Linked Server|
- “Actual Number of Rows”. OpenQuery is bringing back all the records and comparing them against our data filter. It seems 4 part name is only indicating the number of records that will be returned by the select statement
- For 4 part name, we have rebinds. For best explanation of what rebinds are, please “Check out” Scary DBA’s Blog posting – Rebind and Rewind in Execution Plans (http://www.scarydba.com/2011/06/15/rebind-and-rewind-in-execution-plans/“)
- Estimated CPU Cost – 4 the four part name it is a bit higher
- Estimated Number of Rows – In case of “4 Part Name”, the Engine can consult with the Linked Server and ask it for the number of records in the table; whereas, for the “Open Query”, the query will have to be ran. And, so we get a place holder 10,000 count
- Estimated Operator Cost – Same as Estimated CPU Cost; it seems most of our cost is attributed to CPU; very little I/O cost
- Estimated Sub Tree Cost – All query costs eaten up by this operation
- Logical Operation – “Remote Query” (Linked Server) vs “Remote Scan” ( openquery)
- Output List :- Openquery lists our SQL Server Provider (SQLNCLI11)
- Parallel :- Our record count is relatively low and no parallelism
- Physical Operation – Same as Logical Operation **** “Remote Query” (Linked Server) vs “Remote Scan” ( openquery) ***
- In this case our Linked Server is on the same box as our calling Server. For 4 part name, our entry is the Instance Name, not the full Instance Name, just the instance itself (SQLExpress).
And, the openquery as the full instance name for our Linked Server
- Linked Server offers a bit more reliable instrumentation
- I/O Costs for the remote scan that is employed when we use OpenQuery are hidden
We decided to try bcp as Linked Server “selects” was taken about 1.5 hours for a measly 15K (15000) records.
Now that we are out of MS SQL Server, we will use Gammadyne’s Timer.exe ( http://www.gammadyne.com/cmdline.htm#timer) to time the performance of our bcp sessions.
As we are only profiling for performance, we will tweak our code by making a couple of adjustments.
- Add -L <N (-L 1000) :- Only return 1000 records
Bcp Performance – Table
Get data from the entire table.
bcp "dbo.employee" out employee.txt -c -S hrDB -d hr -T -L 1000
- So we are only getting 11.63 rows per sec.
- And, it is taken us 90 seconds to get 1000 records
Bcp Performance – QueryOut
Get data for all columns.
bcp "select * from dbo.employee" queryout employee.txt -c -S hrDB -d hr -T -L 1000
- So we are only getting 11.56 rows per sec.
- And, it is taken us 86.8 seconds to get 1000 records
So our numbers are still not nothing to jump around about.
Looked at the following areas on the Linked Server:
- Wait Stats
Bcp Performance – QueryOut – Filtered ColumnList
Reviewed our column list and noticed that one of the columns we are bringing back is a computed column.
We removed that column by explicitly listing the columns that we want:
bcp "select FIRSTNAME, LASTNAME, USERNAME, PASSWORD from dbo.employee" queryout employee.txt -c -S hrDB -d hr -T -L 1000
- So we are only getting 62,500 rows per sec.
- And, it is taken us 62 milliseconds to get 1000 records
So our biggest drag was the “computed column”. We do not even need it, as it is will be regenerated on the other side.
Find Computed Columns
Find Computed Columns – MS SQL Server 2000
select tblUser.name as schemaName , tblObject.name as objectName , tblColumn.name as columnName from syscolumns tblColumn inner join sysobjects tblObject on tblColumn.id = tblObject.id inner join sysusers tblUser on tblObject.uid = tblUser.uid where tblObject.[type] = 'U' and tblColumn.iscomputed = 1
Find Computed Columns – MS SQL Server 2005 and above
select tblSchema.name as schemaName , tblObject.name as objectName , tblColumn.name as columnName , tblColumnComputed.definition , tblColumnComputed.is_computed as [isComputed] , tblColumnComputed.is_persisted as isPersisted from sys.columns tblColumn inner join sys.objects tblObject on tblColumn.object_id = tblObject.object_id inner join sys.schemas tblSchema on tblObject.schema_id = tblSchema.schema_id inner join sys.computed_columns tblColumnComputed on tblColumn.object_id = tblColumnComputed.object_id and tblColumn.column_id = tblColumnComputed.column_id where tblObject.type = 'U' and tblColumn.is_computed = 1
So again keep an eye on computed columns.
They were introduced in MS SQL Server v2000, but at that time they can not be persisted.
From v2005 on out, they can be persisted.
Persisted means their values are computed upon initial creation or subsequent updates. And, thereafter the saved value is read and thus “readers” are not forced to pay the price of re-calculations.
But, even then make sure that you only include them when you need them. Be especially careful when you use * (select * from <schema-name>.<object-name>); or when you reference the entire table.