Dennis Forbes – The Real Advantage of Column-Oriented Databases

Background

About same time last year I had a conversation with someone and we spoke about Columnar databases.  He was trying to gauge my experience with it.

Unfortunately, though a year passed, I still don’t know anything about it.

 

Smart Phone and Commute

Using public transit bears fruits and so brought out the smart phone and googled for “use cases for Columnar Databases“.

Found some good hits. Here is the best one so far.

 

Dennis Forbes – The Real Advantage of Column-Oriented Databases

Sorry Mr Forbes, I have to post this one in its entirety..

 

Link

COLUMN-ORIENTED DATABASES ARE TRENDING…GET THEM WHILE THEY’RE HOT

Column-oriented databases are making a regular appearance on technology sites as a bit of a silver bullet for database performance issues.

It is often presented and perceived as an evolution of database designs, much as was seen with the emergent NoSQL options (on the topic of NoSQL, some claim that MongoDB is column-oriented. It is in actuality moving in exactly the opposite direction).

In this case many seem to believe that column-oriented databases correct the mistake of row-oriented storage.

So why are row-oriented databases still so common? How different are they in common practice? It seemed worth a quick post on the topic.

Note that the below is focused purely on read/analysis workloads, glossing over the increased complexity and cost of changes to data in a column-oriented database. Column-oriented databases are not a fit for OLTP workloads.

THE COMMON, BUT UNREALISTIC EXAMPLE

Column-oriented storage intuitively makes sense for certain problems, the most easy to understand being range aggregates: Isn’t it easier averaging an array of a billion floats than it is having to iterate a billion records, which may comprise many columns, and pull the related value?

To think of this from a code perspective (the ways that we would naively solve these same issues in code are similar to how they actually have been solved in database systems. It isn’t as foreign as it may seem, though of course I am simplifying out notions like b-trees and pages and extents), imagine dealing with row versus column storage in C code.


struct person_row {
        int id;
        char name[32];
        char sex[1];
        short year_of_birth;
        float income;
        char province[2];
        char country[3];
};

struct person_row rows[COUNT];

…populate COUNT rows…

Note that this is fairly close to how a database like SQL Server actually does store row data if you used fixed-width types, albeit with a bit of extra bookkeeping data, null flags and padding. If you used variable-width types things get a little more complex, but the concept is generally similar.

Oracle, as a curious aside, stores data in an effectively delimited form, which is how you end up with oddities like ” == null.

Imagine that you have a billion of these 48-byte records (for simplicity, there is no padding), most of it sitting in paged out virtual memory. For the purposes of this example your server has 8GB of available memory, so only a minority of your dataset can be in RAM at one time.

You are tasked with calculating the average income.

You need to iterate over those records — over 48GB of data, ignoring overhead and bookkeeping — swapping in page after page from glacially slow external storage, “extracting” the income and accumulating it in the average.

Contrast the “column-oriented” solution.


struct person_columns {
        int id[COUNT];
        char name[COUNT][32];
        char sex[COUNT][1];
        short year_of_birth[COUNT];
        float income[COUNT];
        char province[COUNT][2];
        char country[COUNT][3];
};

struct person_columns columns;

…populate COUNT column sets…

For the same one billion records example, you would have 4GB of id data, followed by 32GB of name data, and so on. The income would be one linear series of 4GB of floats that you could speedily iterate over if it was the only data you concerned yourself with. Not only would it minimize the IO that needs to happen, data locality maximizes CPU cache coherency and enables trivial vectorization.

Were you performing this query multiple times, the entirety of the income data could be paged into and stay resident in memory, yielding extremely speedy repeat runs, turning in magnitudes better results than the “row” design.

Such an example is oft used to pitch column-oriented databases. It makes for a pretty compelling comparison, and there is absolutely no question that for the purposes of performing calculations on a range of that single column, it is a dramatically superior layout.

This is the sort of example given in most pitches extolling the virtues of column-oriented databases.

HOW DO YOU REALLY USE DATA?

How realistic of an example is it, though? How often do we ever aggregate the entire population of a single column? Aside from management COUNT operations (which themselves can often be served purely through statistics), it is, in my experience, a very rare case.

Instead we filter and group and join. What if, for instance, we wanted average income by country, province and year_of_birth, where year_of_birth is less than 1998? For the purposes of estimating population counts, in our imaginary dataset 80% of the records have a year_of_birth pre-1998.

Using your C chops, consider how you would do that with both the column- and row-oriented data structures to generate results.

You might create accumulator buckets (sum and count) for [country, province, and year of birth]. You then iterate over the entire rowset checking if a bucket exists for each combination, creating one if it doesn’t, and then adding to the sum and incrementing the count on matches that satisfy the filter. You’re table scanning linearly through the entire 52GB of data, but because it’s end to end it is as fast as it can be processed. It is the best with the worst situation.

Now do the same with the column oriented data. You iterate over the year_of_birth, and if the filter matches — the year is less than 1998 — pulling the income and performing the bucket operations based upon the data pulled from country and province. In this C example, you’re dealing with data that is separated by GBs, which means address jumping that is more expensive both in memory and with the processor, and when pulling from storage. In the end you will have iterated over less total data (11GB or so), but performance will likely be similar to a whole “table scan” (or, in our case, struct array scan), which, as an aside, is why most RDBMS platforms will ignore indexes and just do table scans if the statistics hint that the result set will exceed some threshold (which is much lower than most people expect).

If it happened to be that you had 12GB of free memory in your server, and you ran the query multiple times, it would hit a sweet spot of high in-memory performance, so long as you did nothing else that needed to evict those pages.

Of course this is all very contrived. What if our rowset was bizarrely huge? What if we wanted to draw more columns in the groups and filters? What if we wanted to group by age? All dramatically skew how we would look-up data.

Change the quantity of data, the hot columns and data, the cardinality of a column, the stored sort order, the amount of memory in the machine, the type of storage, and everything changes.

As is the case with much of this industry, it really depends. There are no clear-cut answers in the general. Only for a specific purpose can the two types of databases be analyzed.

Of course both of these styles of databases can have indexes (I’ve written about rowset indexes before). Notably we could have a covering index on the rowset for the specific query – year_of_birth, province, country, INCLUDE(income). Such an index would allow a query like the above to be satisfied with only the index, much more efficiently than the separated column layout. Vertica, on the other hand, offers the notion of projections, which are effectively indexes going in the opposite direction: It pre-joins and pre-sorts a number of columns, essentially building a rowset.

For a properly indexed database, both Vertica and SQL Server would be working with indexes that are extremely similar, both simply optimizing in different directions (SQL Server prunes down the columns to allow a much smaller set of data to be loaded and processed. Vertica combines the columns).

SO WHERE IS THE BIG COLUMN ORIENTED ADVANTAGE

None of that is to say that column-oriented storage doesn’t have a place, as clearly they do.

In data warehouses, for instance, data is populated into dimensions and facts, with the fact components stored in a specific order such that the values used in computations can often be found in sequential ranges. e.g. facts 1027 – 2087 are the daily sales of widgets from a specific store. OLAP cubes are generally column-oriented, but prescribed for a very specific, singular use.

Financial systems, such as a kdb+ HFT system, are generally column-oriented because each set of data is a specific, linear set of values, such as streaming bids for AAPL.

But those benefits don’t carry over to general database systems and standard use.

There is, however, one area where column-oriented databases almost always demonstrate significantly better performance than row-oriented databases: Highly compressible data that would otherwise be much larger than RAM, but can be squeezed to fit in. Where many columns have a very low degree of cardinality.

In such a case the actual stored data (on disk and in memory) can be dramatically smaller than the data represented, and this is a tactic that works especially well for column-oriented storage.

In those benchmarks where the working set is conceptually much larger than memory, but because of significant redundancy can be compressed to smaller than memory, this is a significant reason why column oriented databases often lap traditional databases. And it’s worth noting that demonstrative benchmarks often purposefully choose exactly such a contrast (product A needing to hit slow I/O endlessly, while the working set of product B stays entirely resident in memory).

This benchmark on flight data compares some column-oriented databases, and it’s notable that on those benchmarks that touched significant parts of the dataset, Infobright easily won because it had compressed the data so significantly that it didn’t need to ever page. In those benchmarks that could just work with what was in memory, Monetdb won.

But that specific dataset is a data dump that could be reduced dramatically through simple, basic normalization. Instead of just storing airport code, for instance, it stores the entire hierarchy of jurisdictions that lead down to the airport code, and in this benchmark the entirety of wholly redundant data was imported for each row.

In any case, some products, such as Vertica mentioned above, can take it further and fuse data that is used together and fits the entropy profile to magnify the impact of compression.

The overall total volume of data actually increases (you still have all of the individual columns, but now have the fused projections as well), but the likely working data — what you repeatedly interact with it — will be much more efficient to work with. The same concept applies to rowset databases, where you might add GBs of targeted indexes, but in the end that means that the hot data used for the majority of transactions is much smaller, while the disk data is much larger.

Of course you might be looking at that compression and thinking “Isn’t that just normalizing? Why would I ever store all of that redundant data?”, and indeed to a degree (although not nearly to the magnitude) the savings are similar. Had your person/driver license number rows simply referenced a city row, itself referencing a county row, and so on, you yield some of the space savings that you do with RLE compression and a column-oriented database.

But row-oriented is still far less compressible (products like SQL Server do offer row compression, which again — to pound on this theme — is primarily beneficial if your hot data exceeds your available memory) than column-oriented in the majority of cases.

Again, the “it depends” factor. When I discussed Digg’s database architecture, it was in regards to exactly that sort of choice: Digg exploded a small amount of relational data into an enormous amount of denormalized data. For cases where they have limited IOPS and very limited memory, that may be the right choice. For many other cases it could absolutely work against them.

CONCLUSION

Column-oriented databases are very cool, and they have a role to play in data that is much larger than available RAM, or where naive aggregates or stream processing is paramount. They’re often found for specific purposes in the financial industry, and there are some absolutely fantastic products.

It is less conclusively a fit if your write load is at all significant, or if the usage of the data is general if not arbitrary. Column-oriented databases need to be even more purposefully used than row-oriented if you hope to have efficient operations.

Sorry

Once again sorry Mr. Forbes, I did not observe proper ethics and just post a link to you.

Like Tears for Fears, I fall all over good solid engineering.

And, added on, in this case, good legal cross examination.

 

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s