Dennis Forbes – The Real Advantage of Column-Oriented Databases

Background

About same time last year I had a conversation with someone and we spoke about Columnar databases.  He was trying to gauge my experience with it.

Unfortunately, though a year passed, I still don’t know anything about it.

 

Smart Phone and Commute

Using public transit bears fruits and so brought out the smart phone and googled for “use cases for Columnar Databases“.

Found some good hits. Here is the best one so far.

 

Dennis Forbes – The Real Advantage of Column-Oriented Databases

Sorry Mr Forbes, I have to post this one in its entirety..

 

Link

COLUMN-ORIENTED DATABASES ARE TRENDING…GET THEM WHILE THEY’RE HOT

Column-oriented databases are making a regular appearance on technology sites as a bit of a silver bullet for database performance issues.

It is often presented and perceived as an evolution of database designs, much as was seen with the emergent NoSQL options (on the topic of NoSQL, some claim that MongoDB is column-oriented. It is in actuality moving in exactly the opposite direction).

In this case many seem to believe that column-oriented databases correct the mistake of row-oriented storage.

So why are row-oriented databases still so common? How different are they in common practice? It seemed worth a quick post on the topic.

Note that the below is focused purely on read/analysis workloads, glossing over the increased complexity and cost of changes to data in a column-oriented database. Column-oriented databases are not a fit for OLTP workloads.

THE COMMON, BUT UNREALISTIC EXAMPLE

Column-oriented storage intuitively makes sense for certain problems, the most easy to understand being range aggregates: Isn’t it easier averaging an array of a billion floats than it is having to iterate a billion records, which may comprise many columns, and pull the related value?

To think of this from a code perspective (the ways that we would naively solve these same issues in code are similar to how they actually have been solved in database systems. It isn’t as foreign as it may seem, though of course I am simplifying out notions like b-trees and pages and extents), imagine dealing with row versus column storage in C code.


struct person_row {
        int id;
        char name[32];
        char sex[1];
        short year_of_birth;
        float income;
        char province[2];
        char country[3];
};

struct person_row rows[COUNT];

…populate COUNT rows…

Note that this is fairly close to how a database like SQL Server actually does store row data if you used fixed-width types, albeit with a bit of extra bookkeeping data, null flags and padding. If you used variable-width types things get a little more complex, but the concept is generally similar.

Oracle, as a curious aside, stores data in an effectively delimited form, which is how you end up with oddities like ” == null.

Imagine that you have a billion of these 48-byte records (for simplicity, there is no padding), most of it sitting in paged out virtual memory. For the purposes of this example your server has 8GB of available memory, so only a minority of your dataset can be in RAM at one time.

You are tasked with calculating the average income.

You need to iterate over those records — over 48GB of data, ignoring overhead and bookkeeping — swapping in page after page from glacially slow external storage, “extracting” the income and accumulating it in the average.

Contrast the “column-oriented” solution.


struct person_columns {
        int id[COUNT];
        char name[COUNT][32];
        char sex[COUNT][1];
        short year_of_birth[COUNT];
        float income[COUNT];
        char province[COUNT][2];
        char country[COUNT][3];
};

struct person_columns columns;

…populate COUNT column sets…

For the same one billion records example, you would have 4GB of id data, followed by 32GB of name data, and so on. The income would be one linear series of 4GB of floats that you could speedily iterate over if it was the only data you concerned yourself with. Not only would it minimize the IO that needs to happen, data locality maximizes CPU cache coherency and enables trivial vectorization.

Were you performing this query multiple times, the entirety of the income data could be paged into and stay resident in memory, yielding extremely speedy repeat runs, turning in magnitudes better results than the “row” design.

Such an example is oft used to pitch column-oriented databases. It makes for a pretty compelling comparison, and there is absolutely no question that for the purposes of performing calculations on a range of that single column, it is a dramatically superior layout.

This is the sort of example given in most pitches extolling the virtues of column-oriented databases.

HOW DO YOU REALLY USE DATA?

How realistic of an example is it, though? How often do we ever aggregate the entire population of a single column? Aside from management COUNT operations (which themselves can often be served purely through statistics), it is, in my experience, a very rare case.

Instead we filter and group and join. What if, for instance, we wanted average income by country, province and year_of_birth, where year_of_birth is less than 1998? For the purposes of estimating population counts, in our imaginary dataset 80% of the records have a year_of_birth pre-1998.

Using your C chops, consider how you would do that with both the column- and row-oriented data structures to generate results.

You might create accumulator buckets (sum and count) for [country, province, and year of birth]. You then iterate over the entire rowset checking if a bucket exists for each combination, creating one if it doesn’t, and then adding to the sum and incrementing the count on matches that satisfy the filter. You’re table scanning linearly through the entire 52GB of data, but because it’s end to end it is as fast as it can be processed. It is the best with the worst situation.

Now do the same with the column oriented data. You iterate over the year_of_birth, and if the filter matches — the year is less than 1998 — pulling the income and performing the bucket operations based upon the data pulled from country and province. In this C example, you’re dealing with data that is separated by GBs, which means address jumping that is more expensive both in memory and with the processor, and when pulling from storage. In the end you will have iterated over less total data (11GB or so), but performance will likely be similar to a whole “table scan” (or, in our case, struct array scan), which, as an aside, is why most RDBMS platforms will ignore indexes and just do table scans if the statistics hint that the result set will exceed some threshold (which is much lower than most people expect).

If it happened to be that you had 12GB of free memory in your server, and you ran the query multiple times, it would hit a sweet spot of high in-memory performance, so long as you did nothing else that needed to evict those pages.

Of course this is all very contrived. What if our rowset was bizarrely huge? What if we wanted to draw more columns in the groups and filters? What if we wanted to group by age? All dramatically skew how we would look-up data.

Change the quantity of data, the hot columns and data, the cardinality of a column, the stored sort order, the amount of memory in the machine, the type of storage, and everything changes.

As is the case with much of this industry, it really depends. There are no clear-cut answers in the general. Only for a specific purpose can the two types of databases be analyzed.

Of course both of these styles of databases can have indexes (I’ve written about rowset indexes before). Notably we could have a covering index on the rowset for the specific query – year_of_birth, province, country, INCLUDE(income). Such an index would allow a query like the above to be satisfied with only the index, much more efficiently than the separated column layout. Vertica, on the other hand, offers the notion of projections, which are effectively indexes going in the opposite direction: It pre-joins and pre-sorts a number of columns, essentially building a rowset.

For a properly indexed database, both Vertica and SQL Server would be working with indexes that are extremely similar, both simply optimizing in different directions (SQL Server prunes down the columns to allow a much smaller set of data to be loaded and processed. Vertica combines the columns).

SO WHERE IS THE BIG COLUMN ORIENTED ADVANTAGE

None of that is to say that column-oriented storage doesn’t have a place, as clearly they do.

In data warehouses, for instance, data is populated into dimensions and facts, with the fact components stored in a specific order such that the values used in computations can often be found in sequential ranges. e.g. facts 1027 – 2087 are the daily sales of widgets from a specific store. OLAP cubes are generally column-oriented, but prescribed for a very specific, singular use.

Financial systems, such as a kdb+ HFT system, are generally column-oriented because each set of data is a specific, linear set of values, such as streaming bids for AAPL.

But those benefits don’t carry over to general database systems and standard use.

There is, however, one area where column-oriented databases almost always demonstrate significantly better performance than row-oriented databases: Highly compressible data that would otherwise be much larger than RAM, but can be squeezed to fit in. Where many columns have a very low degree of cardinality.

In such a case the actual stored data (on disk and in memory) can be dramatically smaller than the data represented, and this is a tactic that works especially well for column-oriented storage.

In those benchmarks where the working set is conceptually much larger than memory, but because of significant redundancy can be compressed to smaller than memory, this is a significant reason why column oriented databases often lap traditional databases. And it’s worth noting that demonstrative benchmarks often purposefully choose exactly such a contrast (product A needing to hit slow I/O endlessly, while the working set of product B stays entirely resident in memory).

This benchmark on flight data compares some column-oriented databases, and it’s notable that on those benchmarks that touched significant parts of the dataset, Infobright easily won because it had compressed the data so significantly that it didn’t need to ever page. In those benchmarks that could just work with what was in memory, Monetdb won.

But that specific dataset is a data dump that could be reduced dramatically through simple, basic normalization. Instead of just storing airport code, for instance, it stores the entire hierarchy of jurisdictions that lead down to the airport code, and in this benchmark the entirety of wholly redundant data was imported for each row.

In any case, some products, such as Vertica mentioned above, can take it further and fuse data that is used together and fits the entropy profile to magnify the impact of compression.

The overall total volume of data actually increases (you still have all of the individual columns, but now have the fused projections as well), but the likely working data — what you repeatedly interact with it — will be much more efficient to work with. The same concept applies to rowset databases, where you might add GBs of targeted indexes, but in the end that means that the hot data used for the majority of transactions is much smaller, while the disk data is much larger.

Of course you might be looking at that compression and thinking “Isn’t that just normalizing? Why would I ever store all of that redundant data?”, and indeed to a degree (although not nearly to the magnitude) the savings are similar. Had your person/driver license number rows simply referenced a city row, itself referencing a county row, and so on, you yield some of the space savings that you do with RLE compression and a column-oriented database.

But row-oriented is still far less compressible (products like SQL Server do offer row compression, which again — to pound on this theme — is primarily beneficial if your hot data exceeds your available memory) than column-oriented in the majority of cases.

Again, the “it depends” factor. When I discussed Digg’s database architecture, it was in regards to exactly that sort of choice: Digg exploded a small amount of relational data into an enormous amount of denormalized data. For cases where they have limited IOPS and very limited memory, that may be the right choice. For many other cases it could absolutely work against them.

CONCLUSION

Column-oriented databases are very cool, and they have a role to play in data that is much larger than available RAM, or where naive aggregates or stream processing is paramount. They’re often found for specific purposes in the financial industry, and there are some absolutely fantastic products.

It is less conclusively a fit if your write load is at all significant, or if the usage of the data is general if not arbitrary. Column-oriented databases need to be even more purposefully used than row-oriented if you hope to have efficient operations.

Sorry

Once again sorry Mr. Forbes, I did not observe proper ethics and just post a link to you.

Like Tears for Fears, I fall all over good solid engineering.

And, added on, in this case, good legal cross examination.

 

DBeaver

Background

Here I am having used SQL Server Management Studio ( SSMS ) for so long, I am satisfied.

It’s GUI is versely superior to SQL 2000 Query Analyzer and it exposes just about every SQL Functionality that I use on daily basis.

 

Stumped

Spent hours trying to capture a query’s grid display unto Excel, but Excel column size was getting in the way.

And, so googled for ways to display columns as rows.

 

DBeaver

Googled and found DBeaver.

The nice thing about it is that I can export queries result set into XML or HTML.

 

Download

DBeaver is available here.

 

Export Query

Screen Shot

Data Transfer Target Type and Format

Image

Data Transfer – Extraction Settings

Image

Data Transfer – Settings – See export settings

Image

 

Data Transfer – Output – Configure export output parameters

Image

Data Transfer – Confirm – Check results

Image

 

Sample Files

HTML

XML

 

Other Functionalities

Database Diagram

Database Modeling is also easy and nice.

 

Table Structure

Properties

References

DDL

 

Data

 

Database Supported

DBeaver supports both SQL and NoSQL Databases.

For SQL databases, it relies on JDBC.

Driver Manager

Transact SQL – Identify In-Memory Objects

Introduction

Let us identify In Memory Objects in our SQL Server Instance.

 

User Tables – sys.tables

Overview

In the sample code below, we identify In Memory Objects by checking the is_memory_optimized column in the sys.tables DMV.

Code



select
		 [table]
			= quotename(tblSS.[name])
				+ '.'
				+ quotename(tblST.[name])

from   sys.tables tblST

inner join sys.schemas tblSS

		on tblST.[schema_id] = tblSS.[schema_id]

where  tblST.is_memory_optimized = 1

order by
		[table] asc


Output

identifyinmemorytables-systables

User Tables & Indexes – Based on FileGroups

Overview

In the sample code below, we identify In Memory Objects by seeking out objects that are located on In-Memory Filegroups.

Code



select 

		  [object]
			= tblSS.[name]
				+ '.'
				+ tblSO.[name]

		, [objectType]
			= tblSO.[type_desc]


		, [indexName]
			= tblSI.[name]


		, [isPrimaryKey]
			= case
				when tblSI.[is_primary_key] = 1 then 'Yes'
				else 'No'
			  end

		, [indexType]
			= tblSI.[type_desc]
	
		, [fileGroup]
			= tblSFG.[name]

		, [fileGroupType]
			= tblSFG.[type_desc]

from   sys.objects tblSO

INNER JOIN sys.schemas tblSS

	ON tblSO.schema_id = tblSS.schema_id

inner join sys.indexes tblSI

	on tblSO.object_id = tblSI.object_id

inner join sys.partitions tblSP

	on  tblSI.object_id = tblSP.object_id
	and tblSI.index_id = tblSP.index_id

		
INNER JOIN sys.allocation_units tblSAU

	on tblSAU.container_id = tblSP.hobt_id
		 
INNER JOIN sys.filegroups tblSFG
 
	ON tblSFG.data_space_id = tblSAU.data_space_id 

where tblSFG.[type] = 'FX'



Output

identifyinmemoryobjects

 

 

User Tables

DML Stats

Overview

Review Statistics of DML Operations against In-Memory Tables.

Code


SELECT
	  [object]
		= quoteName(tblS.[name])
		   + '.'
		   + quoteName(tblO.[name])

        , tblXTPOS.*

FROM sys.dm_db_xtp_object_stats tblXTPOS

INNER JOIN sys.objects tblO

	ON tblXTPOS.object_id = tblO.object_id

INNER JOIN sys.schemas tblS

	ON tblO.schema_id = tblS.schema_id


 

Output

dmlstats

 

Index Stats

Overview

Review Statistics of DML Operations against In-Memory Tables.

Code





SELECT
        [object]
        = quoteName(tblS.[name])
           + '.'
           + quoteName(tblO.[name])
 
		, [index]
			= tblSI.[name]

		, [indexType]
			= tblSI.[type_desc]

		, [isPrimaryKey]
			= case
				when ( tblSI.[is_primary_key] = 1 ) then 'Y'
				else 'N'
			  end

        , [scansStarted]
			= tblXTPIS.scans_started	

		, [scansRetried]
			= tblXTPIS.scans_retries
			
		, [rowsReturned]
			= tblXTPIS.rows_returned

		, [rowsTouched]
			= tblXTPIS.rows_touched	

		/*

		--- Expiring -----
		, tblXTPIS.rows_expiring
		, tblXTPIS.rows_expired
		, tblXTPIS.rows_expired_removed	
		--- Expiring -----

		-- Phantom ----
		, tblXTPIS.phantom_scans_started	
		, tblXTPIS.phantom_scans_retries	
		, tblXTPIS.phantom_rows_touched	
		, tblXTPIS.phantom_expiring_rows_encountered	
		--, phantom_expired_rows_encountered	
		, tblXTPIS.phantom_expired_removed_rows_encountered	
		, tblXTPIS.phantom_expired_rows_removed
		-- Phantom ----

		*/

FROM sys.dm_db_xtp_index_stats tblXTPIS
 
INNER JOIN sys.objects tblO
 
    ON tblXTPIS.object_id = tblO.object_id

INNER JOIN sys.indexes tblSI
 
    ON  tblXTPIS.object_id = tblSI.object_id 
    AND tblXTPIS.index_id = tblSI.index_id 

INNER JOIN sys.schemas tblS
 
    ON tblO.schema_id = tblS.schema_id

order by
		  tblS.[name]
		, tblO.[name]
		, tblSI.[name]

Output

imtabledmlindexstats

Explanation:

  1. In-Memory tables does not support Clustered Indexes as all data is stored in memory
    • If Clustered,then memory will have to be continuously shuffled to ensure proper sequence
  2. It makes sense to review ScanStarted
    • Doing so is informative in terms of which indexes are actually being used
    • And, to gauge popularity
  3. The RowsReturned Column
    • Reviewing the RowsReturned column is also important to follow trend in terms of access method

Memory Usage Stats

Memory Usage Stats – Object

Overview

Review Memory allocated and in-use by In-Memory Tables.

Code


SELECT

	[object]
		= quotename(tblSS.name)
			+'.'
			+ quotename(tblSO.name)

    , [allocatedMB]
		= SUM(tblXTPMC.[allocated_bytes]) / (1024* 1024)

	, [usedMB]
		= SUM(tblXTPMC.[used_bytes]) / (1024 * 1024)

FROM  sys.objects tblSO

INNER JOIN sys.schemas tblSS
 
	ON tblSO.schema_id = tblSS.schema_id 

INNER JOIN sys.indexes tblSI
 
	ON tblSO.object_id = tblSI.object_id 

INNER JOIN sys.dm_db_xtp_memory_consumers tblXTPMC 

	ON  tblSI.object_id = tblXTPMC.object_id
	AND tblSI.index_id  = tblXTPMC.index_id

GROUP BY 
		  quotename(tblSS.name)
		, tblSO.schema_id
		, tblSO.object_id
		, tblSO.[name]
		
ORDER BY 
		  quotename(tblSS.name)
		, tblSO.[name]
;


Output

memoryallocatedandinusedbyobject

Memory Usage Stats – Index

Overview

Review Memory allocated and in-use by In-Memory Table Indexes.

Code


SELECT

	[object]
		= quotename(tblSS.name)
			+'.'
			+ quotename(tblSO.name)

	, [index]
		= tblSI.name

	, [indexType]
		= tblSI.[type_desc]

    , [isPrimaryKey]
        = case
            when tblSI.[is_primary_key] = 1 then 'Yes'
            else 'No'
            end

    , [allocatedMB]
		= SUM(tblXTPMC.[allocated_bytes]) / (1024* 1024)

	, [usedMB]
		= SUM(tblXTPMC.[used_bytes]) / (1024 * 1024)

FROM  sys.objects tblSO

INNER JOIN sys.schemas tblSS
 
	ON tblSO.schema_id = tblSS.schema_id 

INNER JOIN sys.indexes tblSI
 
	ON tblSO.object_id = tblSI.object_id 

INNER JOIN sys.dm_db_xtp_memory_consumers tblXTPMC 

	ON  tblSI.object_id = tblXTPMC.object_id
	AND tblSI.index_id  = tblXTPMC.index_id

GROUP BY 
		  quotename(tblSS.name)
		, tblSO.schema_id
		, tblSO.object_id
		, tblSO.[name]
		, tblSI.name
		, tblSI.[type_desc]
		, tblSI.[is_primary_key]
		
ORDER BY 
		  quotename(tblSS.name)
		, tblSO.[name]
		, tblSI.name


Output

memoryallocatedandinusedbyobjectandindex

Programmable Objects – sys.modules

Overview

In the sample code below, we identify In Memory Objects by checking the is_memory_optimized column in the sys.tables DMV.

Code



/*

	sys.sql_modules

		execute_as_principal_id

			Value of -2 indicates that the batch submitted does not depend on implicit name resolution and can be shared among different users. 
			This is the preferred method. Any other value represents the user ID of the user submitting the query in the database.

*/

select 
		  [object]
			= quoteName(tblSS.[name])
				+ '.'
				+ quoteName(object_name(tblSSM.[object_id]))

		, [type]
			= tblSO.type_desc

		, [createDate]
			= tblSO.[create_date]

		, [isSchemaBound]
			= case
					when tblSSM.is_schema_bound = 1 then 'Y'
					else 'N'
			  end	

		, [principal]
			= case
			 
					when ( tblSSM.execute_as_principal_id is null ) 
						then tblSS.[name]
			  
					when  ( tblSSM.execute_as_principal_id = -2) 
						then '--Shared--'

					else user_name(tblSSM.execute_as_principal_id)

						  
			  end			 
		      
		, [definition]
			= tblSSM.[definition]

from  sys.objects tblSO

inner join sys.schemas tblSS

		on tblSO.[schema_id] = tblSS.[schema_id]

inner join sys.sql_modules tblSSM

		on tblSO.[object_id] = tblSSM.[object_id]

where tblSSM.[uses_native_compilation] = 1

Output

identifynativecompiledobjects

Loaded Modules – sys.dm_os_loaded_modules

Overview

In the sample code below, we identify In Memory Objects by reviewing loaded OS Modules.

The relevant DMVs are sys.dm_os_loaded_modules and sys.dm_os_virtual_address_dump.

 

Code



; with cteOSLM
(
	  [modulename]
	, [description]
    , [base_address]
	, [filenameFull]
	, [filename]
	, [filenameMassaged]
)
as
(

	SELECT
			  [name]

			, [description]

			, [base_address]

			, [filenameFull] = tblOSLM.[name]

			, [filename]
				= reverse(left(reverse(tblOSLM.[name]),
                    charindex('\',reverse(tblOSLM.[name]), 1) - 1))

			, [filenameMassaged]
				= replace(
							replace(
									 reverse(left(reverse(tblOSLM.[name]),
										charindex('\',reverse(tblOSLM.[name]), 1) - 1))
										, '_'
									, '.'
									)
							, '.dll'
							, ''
						)				

	
	FROM   sys.dm_os_loaded_modules tblOSLM 

	WHERE  tblOSLM.[description] = 'XTP Native DLL'

	
)

, cteOSLMObject
(
	  [modulename]
	, [description]
    , [base_address]
	, [filenameFull]
	, [filename]
	, [filenameMassaged]

	, [objectID] 
	, [databaseID] 
	, [objectType]

)
as
(

	SELECT 

		  	  [modulename]
			, [description]
			, [base_address]
			, [filenameFull]
			, [filename]
			, [filenameMassaged]


			, [objectID] 
				= PARSENAME([filenameMassaged], 1)

			, [databaseID] 
				= PARSENAME([filenameMassaged], 2)
			
			, [objectType]
				= case PARSENAME([filenameMassaged], 3)
						when 't' then 'Table'
						when 'p' then 'Procedure'
						when 'f' then 'Function'
						else PARSENAME([filenameMassaged], 3)
				  end 

	from   cteOSLM

)
, cteVirtualAddress
(
	  [region_allocation_base_address]
	, [regionSizeInBytes]
)
as
(
	select 
			  tblOSVAD.[region_allocation_base_address]
			, [regionSizeInBytes]
				= sum(tblOSVAD.region_size_in_bytes)
	
	from   sys.dm_os_virtual_address_dump tblOSVAD
	
	group by
	 
			tblOSVAD.[region_allocation_base_address]

)	 
SELECT 

		  tblOSLM.[description]

		, tblOSLM.[modulename]

		, tblOSLM.[filename]


		, [database]
			= case
				when tblOSLM.[databaseID] = 32767 then 'Resource DB'
				else db_name(tblOSLM.[databaseID])
			   end

		, [objectName]
			= quoteName
				(
				    object_schema_name
					(
					  tblOSLM.objectID
					, tblOSLM.databaseID
					)
				)
				+ '.'
				+ quoteName
				(
					object_name
					(
					   tblOSLM.objectID
					 , tblOSLM.databaseID
					)
				)

		, tblOSLM.[objectType]

		, [sizeInKB]
			= (tblOSVAD.[regionSizeInBytes])
				/ ( 1024 )

FROM   cteOSLMObject tblOSLM 

INNER JOIN cteVirtualAddress tblOSVAD

		on tblOSLM.[base_address] = tblOSVAD.[region_allocation_base_address]

order by

			  [database]
			, [objectName]


Output

osloadedmodules

 

Explanation

  1. We can see that we can the SQL Server Engine generates Database Object specific modules (dlls)
  2. The modules are aptly named once one know where to look
  3. And, they are very small in size

 

Summary

In summary, to identify In Memory Objects we have a couple of tracks we can take.

Those pathways includes:

  1. Checking Table – sys.tables
    • Column :- is_memory_optimized
  2. Checking objects that are sitting on Memory Optimized Filegroup by relying on the sys.filegroups.

We reviewed user usage by digging into the sys.dm_db_xtp_object_stats & sys.dm_db_xtp_index_stats.

We also touched on memory consumed by In Memory Objects and Indexes by exploring the sys.dm_db_xtp_memory_consumers view.

To identify natively compiled objects we narrow in on the uses_native_compilation column of the sys.sql_modules view

To dig a bit deeper and identify OS Modules that are produced once In-Memory tables are compiled, we look into sys.dm_os_loaded_modules and to size up them up we check the region_size_in_bytes column in the sys.dm_os_virtual_address_dump table.

 

Source Code Repository

GitHub

Blogs are not where to keep code, and so for easier and more polish consumption shipped out to GitHub.

Here is the Repository.

 

Dedicated

Again, the good thing about blogging is that one publicly acknowledges the source of Information.

Here is mine:

  1. SqlHints.com, Basavaraj Biradar
    • Working with In-Memory OLTP (a.k.a. Hekaton) enabled Databases, Memory Optimized Tables and Natively Compiled Stored Procedures and it’s Internals with extensive list of Examples – Sql Server 2014
      Link
  2.  Help: SQL Server, Balmukund
    • A-Z of In-Memory OLTP : Behind the scenes
      Link

 

References

Microsoft

    1. Dynamic Management Views and Objects
      • Transact-SQL Reference (Database Engine) > System Views (Transact-SQL) > Dynamic Management Views and Functions (Transact-SQL)
        Memory-Optimized Table Dynamic Management Views (Transact-SQL)
        Link

        • sys.dm_db_xtp_memory_consumers (Transact-SQL)
          Link
        • sys.dm_db_xtp_index_stats ( Transact SQL )
          Link
        • sys.dm_db_xtp_object_stats ( Transact-SQL )
          Link
        • sys.memory_optimized_tables_internal_attributes ( Transact-SQL )
          Link
    2. Developer Network
      • Database Features > In-Memory OLTP (In-Memory Optimization)  > Memory-Optimized Tables
        Native Compilation of Tables and Stored Procedures
        Link

MySQL :- Secondary Indexes and the Clustering Keys – Day 2

Preface

In our last post, we spoke about how one might not need to add the Clustering columns when defining a Secondary Index.

At that point, we touched on the fact that just reviewing the Query Plan via “Visual Explain” might not fully reveal whether the Index used is “covering” or whether the Clustered Index is also accessed.

 

Metadata

Indexes

Index – Key name & Columns

List the index names and the corresponding columns

Code


SHOW INDEX FROM dblab.errorLog;

Output

ShowIndex-20160620-1114AM

Explanation

  1. Clustered Index
    • PRIMARY
      • Column Names :- id
      • Cardinality :- 111209
  2. Non Clustered Indexes
    • INDX_DBA_RECORDCREATED
      • Column Names :- record_created
      • Cardinality :- 184

 

Index – Get Index Size

Get Index Size

Code


set @database := 'dblab';
set @table := 'errorlog';
set @convertToMB := 1024 * 1000;

select 
           database_name
         , table_name
         , index_name
         , stat_name
         , @@innodb_page_size as innodb_page_sizeInBytes
         , stat_value*@@innodb_page_size as IndexSizeInBytes
         , (stat_value*@@innodb_page_size) / (@convertToMB) as IndexSizeMB

from mysql.innodb_index_stats 

where stat_name = 'size'

and   database_name = @database

and   table_name = @table

;

Output

IndexStats-20160620-0111PM

Explanation

  1. Clustered Index
    • PRIMARY
      • Size :- 28.2240 MB
  2. Non Clustered Indexes
    • INDX_DBA_RECORDCREATED
      • Size :- 2.5760
  3. Note
    • Note that our table is INNODB, and as such
      • The default page size is 16384 bytes
      • Clustered ( PRIMARY )
        • The data is saved within the Clustered Index

 

Instrumentation

 

Visual Explain

Index Covered

IndexCovered

Index Not Covered

IndexNotCovered

 

Explanation

In the screenshots above, the Visual Explain:

  1. Shows the same exact singular operator for a covered and non-covering Index Scan
  2. The cost is different
    • Covered Index :- 1.41
    • Non Covering Index :- 2.41

 

 

Tabular Explain

Covering Index Exists

Guide

In the Query below, we are filtering on record_created and fetching same column ( record_created ) , and the Clustering Column ( id)

SQL


set @currentTime := Now();
set @currentDate := curdate();
	 
explain select  
            tblEL.record_created
          , tblEL.id
from    dblab.ErrorLog tblEL
where   tblEL.record_created 
         between @currentDate and @currentTime
;

 

Output

Explain-Tabular-20160620-1111AM

 

 

Non-Covering Index Exists

Guide

In the Query below, we are filtering on record_created and fetching same column ( record_created ) , the Clustering Column ( id), and an additional column ( userid) which is not part of the Indexed Columns, nor part of the Clustering keys.

SQL


set @currentTime := Now();
set @currentDate := curdate();
	 
explain select  
            tblEL.record_created
          , tblEL.id
          , tblEL.userid
from    dblab.ErrorLog tblEL
where   tblEL.record_created 
         between @currentDate and @currentTime
;

Output

Explain-Tabular-20160620-1107AM

 

Explanation

  1. The columns of the Index, INDX_DBA_RECORDCREATED, that we are using is record_created
  2. Here is what Explain returns as a Tabulated Output
  3.  Columns
    • Extra
      • In the first Query, we are covering and the Extra column indicates so by stating ‘Using where; Using index’
      • In the second Query, we are not covering as the Extra column reads ‘Using index condition

Summary

With the Explain command, both the graphical and the tabulated output have merits.

The Graphical has Costing information; while the tabulated furthers our understanding of whether the Index employed fully covers our need or whether additional work needs to be done upon processing the Index.

MySQL :- Secondary Indexes and the Clustering Keys

Preface

RDMS Database tables can either be stored as a Heap or Clustered. When stored as a Heap, data is appended as they come in.  When Clustered, data is sequenced based on the Clustering Columns.

Secondary Indexes

For Clustered tables, the Clustering data is written as an additional data on each record.  On the other hand for Heaps, the RID is recorded.

We will soon see that this is an important decision when selecting indexing columns for both Clustered and Secondary indexes.

 

Database

Let us consider the implication of storing the Clustering columns for Secondary Indexes.

MySQL

We will use the same table we used for our last post.

Here is what the table looks like.

Table Columns

TableColumns

Table Indexes

And, here are the Indexes.

TableIndexes

Explanation

We have two indexes:

  1. PRIMARY
    • Columns
      • id
  2. INDX_DBA_RecordCreated
    • Columns
      • record_created

 

Query

Let us issue a query against the table and issue explain to determine if an index is employed and any additional operators included.

SQL

SQL – Fetch on Secondary Index Columns and Clustered Index Columns

SQL Code


set @currentTime := Now();
set @currentDate := curdate();
	 
select  tblEL.record_created, id
from    dblab.ErrorLog tblEL
where   tblEL.record_created 
         between @currentDate and @currentTime
;

 

Output

Visual Explain

VisualQueryPlan-20160620-0351AM

SQL – Fetch on Secondary Index Columns, Clustered Index Column, and an additional column

SQL Code


set @currentTime := Now();
set @currentDate := curdate();
	 
select  
            tblEL.record_created
          , tblEL.id
          , tblEL.userid
from    dblab.ErrorLog tblEL
where   tblEL.record_created 
         between @currentDate and @currentTime
;

 

Output

Visual Explain

VisualQueryPlan

Index Range Scan

VisualQueryPlan_Clipped

Explanation

  1. Secondary Index and Clustering Columns
    • Fully satisfied with index on Secondary Index Columns
      • Operation
        • Index Range Scan
          • Index Name – INDX_DBA_RecordCreated
      • Secondary Index columns consulted for where clause
      • Secondary Index / Clustering Key consulted for projected clustering columns
      • Query Cost :- 1.41
  2. Secondary Index, Clustering Columns, and additional column(s)
    • Though, not all columns that need to be presented are available from the Index, the Index is still very useful for filtering, and it is used
    • Query Cost :- 2.41

 

Summary

Secondary Indexes do not need to include the Clustering Columns, as those columns are automatically hard-wired in.

When all filtering and projected columns are referenced in the Secondary Indexes, they can offer fast and sole access to the needed result.

On the other hand, when a query references other columns besides the columns that make up the Secondary Index and the Clustering Columns, it does not appear that the Access Path is fully realized through the Explain Guide.

But, if we pay close attention to Query Costs we are able to furtherance our comparison.

Technical: Microsoft – SQL Server – Analysis Services – MDX Query – Error – hierarchy already appears in the Axis1 axis

Technical: Microsoft – SQL Server – Analysis Services – MDX Query – Error – hierarchy already appears in the Axis1 axis

Introduction

Getting up to speed with SQL Server Analysis Service Cube Browsing and Reporting. And, finding that I need to get comfortable with MDX.

MDX is a query language for querying OLAPS.

Data Source View

DataSourceView


Fact Table Browser

What we are trying to do, that is filter on specific Country IDs and Names, is easy to do when using a query or reporting tool. Before Filtering Browser-Before-Filtering

Post Filtering Browser-After-Filtering

Query

But, when we resort to MDX and try to do same, we find out that we have to work a bit harder. In the next two examples, we attempt to filter by using the where clause.

Error Message :- The MDX function CURRENTMEMBER failed because current coordinate is empty.

In this scenario, we pass in a non-existence Country ID (47)


select

	{
		[Measures].[GDP Amount]

	} on COLUMNS

	, {

		[Country].[Country ID]

	  } on ROWS

from  [DBLAB]

where (

		{

			 [Country].[Country ID].&[47]

		}

	  )

Here is our error message:

Executing the query ...
The MDX function CURRENTMEMBER failed because current coordinate is empty.
Execution complete

Error Message – The Geography hierarchy already appears in the Axis1 axis.

In this scenario, we pass in an existing Country ID (4)


select

	{
		[Measures].[GDP Amount]

	} on COLUMNS

	, {

		[Country].[Country ID]

	  } on ROWS

from  [DBLAB]

where (

		{

			 [Country].[Country ID].&[4]

		}

     )

 

Error Message:

The Country ID hierarchy already appears in the Axis1 axis.

Resolution

There are a couple of ways to address our little problem.

Our proposed solution involves introducing our filtering at the Row sub-section.

To do so we can employ the exists or the Filter expressions.

Fix Problem by using exists keyword



select

	{
		[Measures].[GDP Amount]

	} on COLUMNS

	, EXISTS
		(
			  [Country].[Country ID].Members
			, {
			         [Country].[Country ID].&[3]			
			       , [Country].[Country ID].&[4]
                           }

	       ) on ROWS

from  [DBLAB]

Fix Problem by using “filter” keyword


select

	{
	    [Measures].[GDP Amount]

	} on COLUMNS

	, FILTER
		(
	 	    [Country].[Country ID].Members
			, (
				    ([Country].[Country ID].Member_value = 2)
				 or ([Country].[Country ID].Member_value = 3)
			  )	 

	    ) on ROWS

from  [DBLAB]

References

References – MDX Query

References – Restricting the Query with Query and Slicer Axes

References – Blogs

 

References – MDX Query – Q/A

Technical: Database – Dimension Modeling – Dimension – Time (Populating DimTime – SQL Server Implementation)

Technical: Database – Dimension Modeling – Dimension – Time (Populating DimTime – SQL Server Implementation)

Introduction

Hopefully, this is the first of many posts that tracks my foray into learning “Online analytical processing” (OLAP).

In this post, we will cover the Time Dimension.

Inspiration

As I read broadly about Data Mart, Data Warehouse, and Dimension Modeling the person that speaks most to me is Ralph Kimball.

The company he founded over shares on the Internet and leaves one with little reason not to read a bit.

Here is a beautifully written piece about the Time Dimension.

Time for Time

http://www.kimballgroup.com/1997/07/10/its-time-for-time/

Schema – Design

We are going to have one Dimension table called as dimTime.  And, two lookup lookup tables that we will use to populate our lone dimension table.

Schema – Design – Calendar – Weekday

dbo.lookupCalendarWeekday

Schema – Design – Calendar – Month

dbo.lookupCalendarMonth

Schema – Design – Calendar – Time

From Microsoft AdventureWorksDW database.

dbo.DimTime

Schema – SQL

Schema – SQL – Calendar – Weekday



set noexec off;
set nocount on;

use [DBLabDW]
go

if object_id('dbo.lookupCalendarWeekday') is not null
begin

	set noexec on;

end
go

create table dbo.lookupCalendarWeekday
(

      [id] tinyint not null

    , [dateAdded] datetime not null
		constraint [defaultlookupCalendarWeekdayDateAdded] default getdate()

    , [addedBy]   nvarchar(100) not null
		constraint [defaultlookupCalendarWeekdayAddedBy] default SYSTEM_USER

    , [dateModified] datetime null
    , [modifiedBy]   nvarchar(100) null

    , [weekday] tinyint not null
    , [sortOrder] tinyint not null 
		constraint [defaultlookupCalendarWeekdaySortOrder] default 0

    , [weekDayInEnglish] nvarchar(100) null
    , [weekDayInFrench] nvarchar(100) null
    , [weekDayInSpanish] nvarchar(100) null

)

ALTER TABLE dbo.lookupCalendarWeekday
    ADD CONSTRAINT [AK_lookupCalendarWeekday_weekday] UNIQUE NONCLUSTERED 
    (
	[weekday] ASC
    )

set noexec off;
go

Quick Points

Here are a couple of quick points:

  • We created a unique  constraint AK_lookupCalendarWeekday_weekday on the dbo.lookupCalendarWeekday table
  • Having this constraint will equip us with the uniqueness requirement that is needed for us to create a foreign key against this table (when we create the Time Dimension table)

Schema – SQL – Calendar – Month



set noexec off;
set nocount on;

use [DBLabDW]
go

if object_id('dbo.lookupCalendarMonth') is not null
begin

	set noexec on;

end
go

create table dbo.lookupCalendarMonth
(

      [id] tinyint not null

    , [dateAdded] datetime not null
		constraint [defaultlookupCalendarMonthDateAdded] default getdate()

    , [addedBy]   nvarchar(100) not null
		constraint [defaultlookupCalendarMonthAddedBy] default SYSTEM_USER

    , [dateModified] datetime null
    , [modifiedBy]   nvarchar(100) null

    , [sortOrder] tinyint not null 
		constraint [defaultlookupCalendarMonthSortOrder] default 0

    , [monthInEnglish] nvarchar(100) null
    , [monthInFrench] nvarchar(100) null
    , [monthInSpanish] nvarchar(100) null

)

ALTER TABLE [dbo].[lookupCalendarMonth]
	add constraint PK_LookupCalendarMonth
	    primary key
	(
		[id]
	)

set noexec off;
go

Schema – SQL – Calendar – dimTime

For the sake of consistency, knowing we are unlikely to do better, and since we do not currently have any need for customization, we will stick with the dimTime structural definition from Microsoft AdventureWorksDW database.



SET NOEXEC OFF
go

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

use [DBLabDW]
go

if object_id('dbo.DimTime') is not null
begin

	set noexec on

end
go

CREATE TABLE [dbo].[DimTime]
(
	[TimeKey] [int] IDENTITY(1,1) NOT NULL,
	[FullDateAlternateKey] [datetime] NULL,
	[DayNumberOfWeek] [tinyint] NULL,
	[EnglishDayNameOfWeek] [nvarchar](10) NULL,
	[SpanishDayNameOfWeek] [nvarchar](10) NULL,
	[FrenchDayNameOfWeek] [nvarchar](10) NULL,
	[DayNumberOfMonth] [tinyint] NULL,
	[DayNumberOfYear] [smallint] NULL,
	[WeekNumberOfYear] [tinyint] NULL,
	[EnglishMonthName] [nvarchar](10) NULL,
	[SpanishMonthName] [nvarchar](10) NULL,
	[FrenchMonthName] [nvarchar](10) NULL,
	[MonthNumberOfYear] [tinyint] NULL,
	[CalendarQuarter] [tinyint] NULL,
	[CalendarYear] [char](4) NULL,
	[CalendarSemester] [tinyint] NULL,
	[FiscalQuarter] [tinyint] NULL,
	[FiscalYear] [char](4) NULL,
	[FiscalSemester] [tinyint] NULL,
    CONSTRAINT [PK_DimTime_TimeKey] PRIMARY KEY CLUSTERED 
    ( 
	[TimeKey] ASC
    )

WITH (
          PAD_INDEX = OFF
        , STATISTICS_NORECOMPUTE = OFF
        , IGNORE_DUP_KEY = OFF
        , ALLOW_ROW_LOCKS = ON
        , ALLOW_PAGE_LOCKS = ON
     ) 
     ON [PRIMARY],
 CONSTRAINT [AK_DimTime_FullDateAlternateKey] UNIQUE NONCLUSTERED 
(
	[FullDateAlternateKey] ASC
)
WITH (
           PAD_INDEX = OFF
         , STATISTICS_NORECOMPUTE = OFF
         , IGNORE_DUP_KEY = OFF
         , ALLOW_ROW_LOCKS = ON
         , ALLOW_PAGE_LOCKS = ON
     ) 
     ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[DimTime]
	add constraint FK_DimTime_DayNumberOfWeek
	    foreign key
	(
		[DayNumberOfWeek]
	)
	references dbo.lookupCalendarWeekday
	(
		[weekday]
	)
GO

ALTER TABLE [dbo].[DimTime]
	add constraint FK_DimTime_MonthNumberOfYear
	    foreign key
	(
		[MonthNumberOfYear]
	)
	references dbo.lookupCalendarMonth
	(
		[id]
	)
GO

SET NOEXEC OFF
go

SET ANSI_PADDING ON
GO

Quick Points

Here are a couple of quick points:

  • We are creating a foreign key FK_DimTime_DayNumberOfWeek on the dbo.DimTime table
  • The foreign key joins the resident DayNumberOfWeek column to the weekday table using the unique column (weekday)

 

Populate Date Table

SQL – Populate – Calendar – Weekday

There are a couple of important points:



use [DBLabDW]
go

/*
	dbo.lookupCalendarWeekday Population
*/

/*
	French Calendar Vocabulary
	http://french.about.com/od/vocabulary/a/calendar.htm

*/

/*
	Spanish - My Calendar
	http://www.bbc.co.uk/schools/primarylanguages/spanish/my_calendar/

*/

BEGIN TRAN;

	/*
		http://technet.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
	*/
	-- Define the CTE expression name and column list.
	WITH cteWeekday
	(
		  [id]
                , [weekday]
		, [weekDayInEnglish]
		, [weekDayInFrench]
		, [weekDayInSpanish]
		, [sortOrder]
	)
	AS
	-- Define the CTE query.
	(
		SELECT 1, 1, 'Sunday', 'dimanche', 'domingo', 1
		UNION
		SELECT 2, 2, 'Monday', 'lundi', 'lunes', 2
		UNION
		SELECT 3, 3, 'Tuesday', 'mardi', 'martes', 3
		UNION
		SELECT 4, 4, 'Wednesday', 'mercredi', 'miércoles', 4
		UNION
		SELECT 5, 5, 'Thursday', 'jeudi', 'jueves', 5
		UNION
		SELECT 6, 6, 'Fridday', 'vendredi', 'viernes', 6
		UNION
		SELECT 7, 7, 'Saturday', 'samedi', 'sábado', 7
	)

	/*
		Merge Statement Definition
		http://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
	*/
	merge dbo.lookupCalendarWeekday as tblTarget
	using cteWeekday as tblSource

	ON (tblTarget.[id] = tblSource.[id]) 

		WHEN NOT MATCHED BY TARGET
			then Insert 
			(
				  [id]
                                , [weekday]
				, [weekDayInEnglish]
				, [weekDayInFrench]
				, [weekDayInSpanish]
			)
			values
			(
				  tblSource.[id]
                                , tblSource.weekday
				, tblSource.[weekDayInEnglish]
				, tblSource.[weekDayInFrench]
				, tblSource.[weekDayInSpanish]
			)

		WHEN MATCHED and 
			(

			    (
				isNull(tblTarget.[weekday], -1) 
				!= isNull(tblSource.[weekday], -1)
			    )

                         or
			    (
				isNull(tblTarget.[sortOrder], -1) 
				!= isNull(tblSource.[sortOrder], -1)
			    )

			  or
			   (
				isNull(tblTarget.[weekDayInEnglish], '') 
				!= isNull(tblSource.[weekDayInEnglish], '')
			  )

			  or
			  (
				isNull(tblTarget.[weekDayInFrench], '') 
				!= isNull(tblSource.[weekDayInFrench], '')
			  )

			  or
			 (
				isNull(tblTarget.[weekDayInSpanish], '') 
					!= isNull(tblSource.[weekDayInSpanish], '')
			 )

		)

		then Update set
                           [weekday] = tblSource.[weekday]
			 , [sortOrder] = tblSource.[sortOrder]
			 , [weekDayInEnglish] = tblSource.[weekDayInEnglish]
			 , [weekDayInFrench] = tblSource.[weekDayInFrench]
			 , [weekDayInSpanish] = tblSource.[weekDayInSpanish]
			 , [dateModified] = getdate()
			 , [modifiedBy] = SYSTEM_USER
	;

COMMIT TRAN;

SQL – Populate – Calendar – Month



use [DBLabDW]
go

/*
	dbo.lookupCalendarMonth Population
*/

/*
	French Calendar Vocabulary
	http://french.about.com/od/vocabulary/a/calendar.htm

*/

/*
	Spanish - My Calendar
	http://www.bbc.co.uk/schools/primarylanguages/spanish/my_calendar/

*/

/*
	delete from dbo.lookupCalendarMonth;
*/
BEGIN TRAN;

	/*
		http://technet.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
	*/
	-- Define the CTE expression name and column list.
	WITH ctemonth
	(
		  [id]
		, [monthInEnglish]
		, [monthInFrench]
		, [monthInSpanish]
		, [sortOrder]
	)
	AS
	-- Define the CTE query.
	(
		SELECT 1, 'January', 'janvier', 'enero', 1
		UNION
		SELECT 2, 'February', 'février', 'febrero', 2
		UNION
		SELECT 3, 'March', 'mars', 'marzo', 3
		UNION
		SELECT 4, 'April', 'avril', 'abril', 4
		UNION
		SELECT 5, 'May', 'mai', 'mayo', 5    
		UNION
		SELECT 6, 'June', 'juin', 'junio', 6
		UNION
		SELECT 7, 'July', 'juillet', 'julio', 7
		UNION
		SELECT 8, 'August', 'août', 'agosto', 8
		UNION
		SELECT 9, 'September', 'septembre', 'septiembre', 9
		UNION
		SELECT 10, 'October', 'octobre', 'octubre', 10
		UNION
		SELECT 11, 'November', 'novermbre', 'noviembre', 11
		UNION
		SELECT 12, 'December', 'décembre', 'diciembre', 12
	)

	/*
		Merge Statement Definition
		http://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
	*/
	merge dbo.lookupCalendarMonth as tblTarget
	using ctemonth as tblSource

	ON (tblTarget.[id] = tblSource.[id]) 

		WHEN NOT MATCHED BY TARGET
			then Insert 
			(
				  [id]
				, [monthInEnglish]
				, [monthInFrench]
				, [monthInSpanish]
			)
			values
			(
				  tblSource.[id]
				, tblSource.[monthInEnglish]
				, tblSource.[monthInFrench]
				, tblSource.[monthInSpanish]
			)

		WHEN MATCHED and 
			(

				(
					isNull(tblTarget.[sortOrder], -1) 
				      != isNull(tblSource.[sortOrder], -1)
				)

				or
				(
				      isNull(tblTarget.[monthInEnglish], '') 
				     != isNull(tblSource.[monthInEnglish], '')
				)

				or
				(
					isNull(tblTarget.[monthInFrench], '') 
					!= isNull(tblSource.[monthInFrench], '')
				)

				or
				(
				     isNull(tblTarget.[monthInSpanish], '') 
				      != isNull(tblSource.[monthInSpanish], '')
				)

			)

			then Update set
				   [sortOrder] = tblSource.[sortOrder]
				 , [monthInEnglish] = tblSource.[monthInEnglish]
				 , [monthInFrench] = tblSource.[monthInFrench]
				 , [monthInSpanish] = tblSource.[monthInSpanish]
				 , [dateModified] = getdate()
				 , [modifiedBy] = SYSTEM_USER
	;

	select * 
        from dbo.lookupCalendarMonth

COMMIT TRAN;

SQL – Populate – DimTime

Pasted below is the sample code for populating the DimTime table.

As always, I left out the hardest part; which is how to calculate Fiscal Dates when it does not match the Calendar Dates.

Hopefully, will steal some cost later on and use that.

Here is the code:



set nocount on
go

/*
	How to join 2 tables without an ON clause
	http://stackoverflow.com/questions/15508142/how-to-join-2-tables-without-an-on-clause

*/
use [DBLabDW]
go

declare @dateBegin		datetime
declare @dateEnd		datetime
declare @dateCurrent	datetime

set @dateBegin = '1/1/2014'
set @dateEnd = '12/31/2014'

set @dateCurrent = @dateBegin

begin tran

   while (@dateCurrent <= @dateEnd)
   begin

	insert into dbo.DimTime
	(
             [FullDateAlternateKey]
	   , [DayNumberOfWeek]

	   , [EnglishDayNameOfWeek]
	   , [SpanishDayNameOfWeek]
	   , [FrenchDayNameOfWeek]
	   , [DayNumberOfMonth]
	   , [DayNumberOfYear]
	   , [WeekNumberOfYear]

	   , [EnglishMonthName]
	   , [SpanishMonthName]
	   , [FrenchMonthName]

	   , [MonthNumberOfYear]

	    , [CalendarQuarter]
	    , [CalendarYear]
	    , [CalendarSemester]

	    , [FiscalQuarter]
	    , [FiscalYear]
	    , [FiscalSemester]

	)
	select
		  @dateCurrent as [FullDateAlternateKey]
		, datepart(weekday, @dateCurrent) --[DayNumberOfWeek]
		, tblCalendarWeekday.[weekDayInEnglish]
		, tblCalendarWeekday.[weekDayInSpanish]
		, tblCalendarWeekday.[weekDayInFrench]
		, datepart(day, @dateCurrent) --[DayNumberOfMonth]
		, datepart(dayofYear, @dateCurrent) --[DayNumberOfMonth]
		, datepart(week, @dateCurrent) --[DayNumberOfMonth]
		, tblCalendarMonth.[monthInEnglish]
		, tblCalendarMonth.[monthInSpanish]
		, tblCalendarMonth.[monthInFrench]
		, datepart(month, @dateCurrent) -- [MonthNumberOfYear]

		, case 

		     when datepart(month, @dateCurrent) between 1 and 3 then 1
		     when datepart(month, @dateCurrent) between 4 and 6 then 2
		     when datepart(month, @dateCurrent) between 7 and 9 then 3
		     when datepart(month, @dateCurrent) between 10 and 12 then 4
		     else -1

		  end as [CalendarQuarter]

		, datepart(year, @dateCurrent)
		     as [CalendarYear]

		, case 
			when datepart(month, @dateCurrent) between 1 and 6 then 1
			when datepart(month, @dateCurrent) between 7 and 12 then 2
			else -1

		  end as [CalendarSemester]

		, case 

			when datepart(month, @dateCurrent) between 1 and 3 then 1
			when datepart(month, @dateCurrent) between 4 and 6 then 2
			when datepart(month, @dateCurrent) between 7 and 9 then 3
		        when datepart(month, @dateCurrent) between 10 and 12 then 4
			else -1

		  end	as [FiscalQuarter]

		, datepart(year, @dateCurrent)
			 as [FiscalYear]

		, case 

			when datepart(month, @dateCurrent) between 1 and 6 then 1
			when datepart(month, @dateCurrent) between 7 and 12 then 2
		        else -1
		  end as [FiscalSemester]

		from  [dbo].[lookupCalendarWeekday] tblCalendarWeekday

			cross join [dbo].[lookupCalendarMonth] tblCalendarMonth

		where tblCalendarWeekday.[weekday] = datepart(weekday, @dateCurrent)

		and   tblCalendarMonth.[id] = datepart(month, @dateCurrent)

		set @dateCurrent = dateadd(day, 1, @dateCurrent)

	end

	select * 
	from   dbo.DimTime

commit tran

go

Quick Points

Here are a couple of quick points:

  • As always, there is a very spirited debate that ran out into the open (Internet) surrounding the differences between datepart\week and datepart\iso_week and the role of locale\languages and the inferred DATE_FIRST setting

Conclusion

The Time Dimension is foundational to any discussion about Data Warehousing modeling.

In SQL Server Analysis Services (SSAS), one can either explicitly create it or have SSAS create it.

Best Practices suggests that one should explicitly create and populate it.

If you follow Best Practice (BP), please keep in mind the countries and locales that your system will be used.

And, whether you will have systems for each locale or whether you will use a centralized system.

Listening

Listening to my favorite story teller:

Kenny Chesney – The Boys of Fall

http://www.youtube.com/watch?v=AlXDo5WhQXI

References

References – Dimension Modeling

References – Calendar – Internationalization

References – Transact SQL

References – Transact SQL / Example

References – Transact SQL / DateFirst and DatePart 

References – Transact SQL / Compare ISO against ISO_WEEK