Building a Data Warehouse Part II: Building a new schema

by Steve Forte

September 22, 2010 Web 0 Comments

In Part I: When to build your data warehouse we looked at when you should build your data warehouse and concluded that you should build it sooner rather than later to take advantage of reporting and view optimization. Today we will look at your options to build your data warehouse schema.

When architecting a data warehouse, you have two basic options: build a flat “reporting” table for each operation you are performing, or build with BI/cubes in mind and implement a “star” or “snowflake” schema. Let’s take a quick look at the first option and then we will take a look at the star and snowflake schemas.

Whenever the business requests a complex report, developers usually slow down the system with a complex SQL statement or operation. For example, pretend in our order entry system (OLTP) the business wants a report that says this: show me the top ten customers in each market including their overall rank. You would usually have to perform a query like this:

Complex joins for unique customer
Rollup the sales
Ranking functions to determine overall rank
Partition functions to segment the rank by country
Standard aggregates to get the sales
Dump all of this to a work table in order to pull out the top 10 (if you don’t do this, you will lose the overall rank)

A typical SQL statement to do steps 1-5 would look like this:

With CTETerritory
As
(
   Select cr.Name as CountryName, CustomerID, 
                Sum(TotalDue) As TotalAmt
   From Sales.SalesOrderHeader soh 
   inner join Sales.SalesTerritory ter
   on soh.TerritoryID=ter.TerritoryID
   inner join Person.CountryRegion cr 
   on cr.CountryRegionCode=ter.CountryRegionCode
   Group By cr.Name, CustomerID
)
Select *, Rank() Over (Order by TotalAmt DESC) as OverallRank,
Rank() Over
     (Partition By CountryName Order By TotalAmt DESC,
            CustomerID DESC) As NationalRank
From CTETerritory

Argh! No wonder developers hate SQL and want to use ORMs! (I challenge the best ORM to make this query!)

Instead you can create a table, lets call it SalesRankByRegion, with the fields: CountryName, CustomerID, TotalSales, OverallRank, and NationalRank, and use the above SQL as part of a synchronization/load script to fill your reporting table on a regular basis. Then your SQL statement for the above query looks like this:

SELECT * FROM SalesRankByRegion
WHERE CustomerNationalRank Between 1 and 10
ORDER BY CountryName, CustomerNationalRank

The results look like:

That is more like it! A simple select statement is easier for the developer to write, the ORM to map, and the system to execute.

The SalesRankByRegion table is a vast improvement over having to query all of the OLTP tables (by my count there are three tables plus the temp table). While this approach has its appeal, very quickly, your reporting tables will start to proliferate.

Your best option is to follow one of the two industry standards for data warehouse tables, a “star” or a “snowflake’ schema. Using a schema like this gives you a few advantages. They are more generic than the SalesRankByRegion, which was a table built for one query/report, giving you the ability to run many different reports off each table. Another advantage is that you will have the ability to build cubes very easily off of a star or snowflake schema as opposed to a bunch of SalesRankByRegion tables.

The design pattern for building true data warehouse tables are to build a “fact” table, or a table that contains detail level (or aggregated) “facts” about something in the real world, like an order or customer for example. Inside of the fact table you will also have “measures” or a numeric value that represents a “fact.” To support your fact table you will have “dimension” tables. Dimensions are a structure that will categorize your data, usually in the form of a hierarchy. A dimension table for example could be “time” with a hierarch of OrderYear, OrderQuarter, OrderMonth, OrderDate, OrderTime.

There are tons of tutorials on the internet that show you how to build a star or snowflake schema and the difference between them, so I will not repeat them here. (You may want to start here.) I’ll give you the high level on a simple star schema here.

Let’s say we have an order entry system, such as Northwind (in the Microsoft SQL Server sample database.) You can have a fact table that revolves around an order. You can then have three (or more) fact tables that focus on: time, product, and salesperson. The time dimension would roll-up the order date by year, quarter, month, and date. The product dimension would roll-up the product by the product and category. (In most systems you would have a much deeper hierarchy for products.) The salesperson dimension would be roll-up of the employee, the employee manager and the department they work in. The key in each of these tables would then be foreign keys in the fact table, along with the measure (or numerical data describing the fact.)

There is an example similar to this in Programming SQL Server 2008, a book where I am a co-author. Here is modified version of that demo:

Dimension tables:

CREATE TABLE [dwh].[DimTime] (
[TimeKey] [int] IDENTITY (1, 1) NOT NULL Primary Key,
[OrderDate] [datetime] NULL ,
[Year] [int] NULL ,
[Quarter] [int] NULL ,
[Month] [int] NULL 
) 

CREATE TABLE [dwh].[DimProduct] (
[ProductID] [int] not null Primary Key,
[ProductName] nvarchar(40) not null,
[UnitPrice] [money] not null,
[CategoryID] [int] not null,
[CategoryName] nvarchar(15) not null
) 

CREATE TABLE [dwh].[DimEmployee] (
EmployeeID int not null Primary Key,
EmployeeName nvarchar(30) not null,
EmployeeTitle nvarchar(30),
ManagerName nvarchar(30)
)

Fact table:

CREATE TABLE [dwh].FactOrder (
[PostalCode] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[ProductID] [int] NOT NULL ,
[EmployeeId] [int] NOT NULL ,
[ShipperId] [int] NOT NULL ,
[Total Sales] [money] NULL ,
[Discount] [float] NULL ,
[Unit Sales] [int] NULL ,
[TimeKey] [int] NOT NULL 
)

We have the basis of a star schema. Now we have to fill those tables and keep them up to date. That is a topic for Part III.

Data, SQL Server

About the Author

Steve Forte

Stephen Forte sits on the board of several start-ups including Triton Works. Stephen is also the Microsoft Regional Director for the NY Metro region and speaks regularly at industry conferences around the world. He has written several books on application and database development including Programming SQL Server 2008 (MS Press).

Comments

Comments are disabled in preview mode.

All articles

Topics

Latest Stories
in Your Inbox

Subscribe to be the first to get our expert-written articles and tutorials for developers!

All fields are required

Country/Territory

Blog

Product Bundles

DevCraft

Web

Mobile

Document Management

Desktop

Reporting & Mocking

Automated Testing

CMS

UI/UX Tools

Debugging

Free Tools