I got an email asking me what my thoughts are on dimensional modeling so I thought it would be better to make a post about this rather than an email. Thanks Allan for the inspiration.
First, I think it would be very wise to read books from Ralph Kimball and Bill Inmon to understand the approach to data modeling for the Enterprise Data Warehouse. Dimensional modeling can be good if not great but it does have drawbacks. First, let’s talk about what it is.
Dimensional modeling is a data model that is broken up into Dimensions and Facts. Dimensions have hierarchies like Country, Region, State, and City while Facts have measures like quantity, price, cost, and score. This approach makes it simple for a BI tool to take a person’s input in a GUI and translate that to SQL that is reasonably sure it will get the correct result. If you have a good dimensional model then you can switch BI Tools rather easily too.
Building the Model
How you build the model is argued by both Inmon and Kimball fans. One approach is to take various disparate data models from source databases and merge this into a common third normal form (3NF) structure. Then you can create dimensional models “easily” based on the various customers’ needs. The other approach is to skip the 3NF model and build sets of dimensional models with conforming dimensions for the entire enterprise.
I have found that the Kimball approach works better for the projects I’ve worked on. He does a great job in his books to stress the importance of having a strong sponsor so that IT doesn’t build a dimensional model they think the business wants. That is great advice. The problem with this is the “loser” of any dispute on how the model should be built won’t use the model. Terms are frequently overloaded in organizations and mean different things to different people. Getting everyone to agree on virtually every term in an organization is not usually feasible. So you end up with a model that only one group uses and it took too long to build because of the interviews and discussions.
This approach build that 3NF structure that takes the disparate data into a single model. So you acquire a new system or company and then that new data has to fit into this model. How long does that take to do? How much rework is there every time there is something that you hadn’t thought of?
A basic concept in project management is the Golden Triangle. It simple refers to the 3 variables in any project which are time, resources, and deliverables. You can’t control all 3 but you can control 2. You have to give some with one of these variables.
Most likely you have a fixed set of resources like 3 developers. You also have a set deliverable; build a dimensional model. The last variable is time and you can’t control this unless you either add more developers or deliver less of the model. This is key in understanding why a dimensional model has a big flaw. It takes too long to build it.
So what does the business users do while they wait for you to build their data model?
Time to Market
Time is everything in organizations. How fast can I get to my data? How fast can my queries execute? Making business users wait for you to build a model eats up time which equates to money. So why don’t you get out of the way?
I think most business users are smart and I think they know SQL. I’ve never been in an organization that doesn’t have SQL savvy business users. So if they are smart and know how to program enough to get their job done, why don’t you just give them the data?
I think the best model is a simple one. Just replicate the transactional systems into a large database like Greenplum and even Pivotal Hadoop now that it has HAWQ. Business users will query the data and then create their own de-normalized tables. You got out of their way and they got their job done much faster than if you created a dimensional model for them. You might build a dimensional model to make it easier to use with some BI tools after they have access to the data.
Most databases crumble on thought of letting users just execute any “bad” query they want. That is because most databases are designed for Online Transaction Processing (OLTP) and not big data. Look at my Tuning Guide for Greenplum to see how easy it is to maintain a rather robust database platform.
Direction of Industry
More and more organizations are embracing this approach. With Pivotal Hadoop or Greenplum database, you load all of the data into one cluster that scales easily, get great insights out of the data, have excellent time to market, have little to no development costs, and have very little administrator costs.
Businesses can’t wait for a long running queries and they certainly can’t wait for you to build a dimensional model.