Andy on Enterprise Software

Data warehouse architectures

May 26, 2006

Rick Sherman writes an interesting article about how data warehousing, despite being quite venerable in IT terms, is still poorly understood. He makes a good point, discussing various typical implementation approaches and how thee fail to get to the “single version of the truth” dream. Let’s consider for a moment a few architectural choices:

(a) direct access (EII)
(b) data marts only - no pesky warehouse
(c) a single warehouse for the enterprise
(d) a federation of linked warehouses.

The first approach is limited to only a small subset of the reporting needs, and is insufficient to meet most enterprise reporting requirements. To have only single subject data marts was still surprisingly commonly advocated as late as the mid 1990s (born mainly out of the frustration of lengthy or failed data warehouse projects) yet pretty clearly is not going to scale for a company of any size. The sheer number of combinations of data sources required to build the marts means that the problem of resolving inconsistency is being done every time a mart is built, rather than being dealt with in the warehouse, so each mart either becomes a major project in itself, or (more likely) people just give up and go with some data source without getting a complete or even accurate picture.

The single giant warehouse certainly has a lot of appeal, as it resolves the semantic differences of source systems just once, allowing dependent data mars to be deployed easily. The trouble is one of practicality: for a large corporation the sheer scale of the task is scary. Large enterprises have hundreds (and usually thousands if they are counting properly) of applications where data is being captured, and these applications are often duplicated by country or major business lines. Hence the sheer scale of getting hold of all these sources and bring them into line is going to be a massive challenge. In the cases of certain industries (retail, Telco, retail banking) the scale of the data itself is also daunting, bring major technical challenges.

Hence for any large corporation it seems to me that a federated warehouse approach is what you will end up with, whether you like it or not. Few companies will have the energy or resources to deliver the single giant warehouse, and even those few that do will, in reality, have a series of skunk works data marts/warehouses dotted around the corporation since such a behemoth warehouse will be a bottleneck, hard to change and inevitably slow to respond to rapidly changing business needs.

The most pragmatic approach would seem to me to acknowledge this reality and architect for a federated approach, rather than staying in denial. It is practical to build a warehouse for either a country-level subsidiary (or groups of countries) or each business line, let that deal with the needs of that particular country or business line, and then link these together to a global warehouse which deals at the summary level. The global warehouse does not need to store every transaction in the enterprise; at that level you need to know what the sales were in Germany yesterday by product, channel and perhaps customer, but not that a particular customer bought a specific item at 14:25 at a store in Rhine-Westphalia. The detailed information like this is the domain of the country-level warehouse. Because the transaction detail is not needed at the enterprise level, you avoid the problems of technical scale that may otherwise occur, and only deal with the data that makes sense to look at across the enterprise as a whole.

del.icio.us:Data warehouse architectures  digg:Data warehouse architectures  reddit:Data warehouse architectures  Y!:Data warehouse architectures

Data warehouse architectures

Rick Sherman writes an interesting article about how data warehousing, despite being quite venerable in IT terms, is still poorly understood. He makes a good point, discussing various typical implementation approaches and how thee fail to get to the “single version of the truth” dream. Let’s consider for a moment a few architectural choices:

(a) direct access (EII)
(b) data marts only - no pesky warehouse
(c) a single warehouse for the enterprise
(d) a federation of linked warehouses.

The first approach is limited to only a small subset of the reporting needs, and is insufficient to meet most enterprise reporting requirements. To have only single subject data marts was still surprisingly commonly advocated as late as the mid 1990s (born mainly out of the frustration of lengthy or failed data warehouse projects) yet pretty clearly is not going to scale for a company of any size. The sheer number of combinations of data sources required to build the marts means that the problem of resolving inconsistency is being done every time a mart is built, rather than being dealt with in the warehouse, so each mart either becomes a major project in itself, or (more likely) people just give up and go with some data source without getting a complete or even accurate picture.

The single giant warehouse certainly has a lot of appeal, as it resolves the semantic differences of source systems just once, allowing dependent data mars to be deployed easily. The trouble is one of practicality: for a large corporation the sheer scale of the task is scary. Large enterprises have hundreds (and usually thousands if they are counting properly) of applications where data is being captured, and these applications are often duplicated by country or major business lines. Hence the sheer scale of getting hold of all these sources and bring them into line is going to be a massive challenge. In the cases of certain industries (retail, Telco, retail banking) the scale of the data itself is also daunting, bring major technical challenges.

Hence for any large corporation it seems to me that a federated warehouse approach is what you will end up with, whether you like it or not. Few companies will have the energy or resources to deliver the single giant warehouse, and even those few that do will, in reality, have a series of skunk works data marts/warehouses dotted around the corporation since such a behemoth warehouse will be a bottleneck, hard to change and inevitably slow to respond to rapidly changing business needs.

The most pragmatic approach would seem to me to acknowledge this reality and architect for a federated approach, rather than staying in denial. It is practical to build a warehouse for either a country-level subsidiary (or groups of countries) or each business line, let that deal with the needs of that particular country or business line, and then link these together to a global warehouse which deals at the summary level. The global warehouse does not need to store every transaction in the enterprise; at that level you need to know what the sales were in Germany yesterday by product, channel and perhaps customer, but not that a particular customer bought a specific item at 14:25 at a store in Rhine-Westphalia. The detailed information like this is the domain of the country-level warehouse. Because the transaction detail is not needed at the enterprise level, you avoid the problems of technical scale that may otherwise occur, and only deal with the data that makes sense to look at across the enterprise as a whole.

del.icio.us:Data warehouse architectures  digg:Data warehouse architectures  reddit:Data warehouse architectures  Y!:Data warehouse architectures