Andy on Enterprise Software

Size isn’t everything

November 15, 2005

An October 2005 survey by IT Toolbox shows that, even amongst large companies, the size of the corporate data warehouse is, er, not that big. Out of 156 responses (40% US), only 12% had enterprise data warehouses larger than 4TB, while 18% had ones between this and 1 TB, while the rest had data warehouses less than 1TB. Indeed 25% had data warehouses less than half a terabyte. Admittedly only 20% of customer had just one data warehouse, with 26% having over five warehouses, but these figures may seem odd when you hear about gigantic data warehouses in the trade press. Winter Group publish a carefully checked list of the 10 largest data warehouses in the world, and their 2005 survey shows the winner at Yahoo weighing in at 100TB. The tenth largest, however (at Nielsen), is 17TB, which shows that such mammoths are still a rarity.

Why are IT folks obsessed about this? I can recall speaking at a data warehouse conference a few years ago and speaker after speaker eagerly quoted the size of his data warehouse as some sort of badge of courage: “Well, you should see how big mine is…”. Of course companies that sell hardware and disk storage love such things, but why is there such a big discrepancy between the behemoths in the Winter Group survey and the less than a terabyte brigade? The answer is quite simple: business to business companies don’t have large transaction volumes. If you are a large retailer or a high-street bank, then you may have thousands of branches, each one contributing thousands of individual transactions a day. These add up, and constitute the vast majority of the volume in a data warehouse (perhaps 99% of the volume). The rest of the data is the pesky master data (or reference data, or dimension data - choose your jargon) such as “product”, “customer”, “location”, “brand”, “person”, “time” etc that provides the context of these business transactions. You may have millions of transactions a day as a retailer, but how many different products do you stock? 80,000 for a convenience store chain? 300,000 for a department store? Certainly not tens of millions. Similarly McDonalds has 27,000 retail outlets, not millions. The same for organizational units, employees etc. One exception that can be very large is “customer” but again this is true only for business to consumer enterprises e.g. retailers or Telcos. Companies like Unilever are very large indeed, but primarily sell to other businesses, so the number of direct customers they deal with is measured in the many thousands, but not millions.

So B2B enterprises usually have quite small data warehouses in volume, even though they may have extremely sophisticated and complex master data e.g. elaborate customer segmentation or product or asset classification. One way to measure such complexity is by adding up the types of business entity in the data model e.g. each level of a product hierarchy might count as one “class of business entity” (CBE), “customer” as another. Some very large data warehouses in volume terms often have very simple business models to support, perhaps with 50 CBEs. On the other hand a marketing system for a company like BP may have 400 or more CBEs. This dimension of complexity is actually just as important as raw transaction size when looking at likely data warehouse performance. A data warehouse with 1TB of data but 50 CBEs may be a lot less demanding than one with 200GB of data but 350 CBEs (just think of all those database joins). Oddly, this complexity measure never seems to feature in league tables of data warehouse size, perhaps because it doesn’t sell much disk storage. I feel a new league table coming on. Anyone out there got a model with more than 500 CBEs?

del.icio.us:Size isn't everything  digg:Size isn't everything  reddit:Size isn't everything  Y!:Size isn't everything

Data quality blues

A 2005 research note from Gartner says that more than 50% of data warehouse projects through 2007 will be either outright failures or have limited acceptance. This does not surprise me in the least. There are several aspects in which data warehouse projects are under unusual strain, as well as the normal problems that can beset any significant project. Data warehouses take in data from several separate data sources (ERP, supply chain, CRM etc) and consolidate it. Consequently they are dependent upon both the quality of the data and the stability of those source systems: if any of the underlying source systems has a major structural change (e.g. a new general ledger structure or customer segmentation) then it will affect the warehouse. You might think that data quality was a minor problem these days with all those shiny new ERP and CRM systems, but you’d be wrong. In Kalido projects in the field we constantly encounter major data quality issues, including with data captured in the ERP systems. Why is this?

An inherent problem is that systems typically capture information that is directly needed by the person entering the data, and other things as well, which are useful to someone, but not directly to that person. I remember doing some work in Malaysia and seeing a row of staff entering invoice data into a J.D. Edwards system. I was puzzled to see them carefully typing in a few fields of data, and then just crashing their fingers at random into the keyboard. After a while, they would resume normal typing. After seeing this a few times my curiosity got the better of me and I asked one of then what was going on. The person explained that there were about 40 fields that they were expected to enter, and many of the fields were unnecessary they could not move to the next screen without tabbing through each field in turn unless they entered some gibberish in one of the main fields, at which point the system conveniently took them to the last field. So by typing nonsense data into a field that turned out to be quite relevant (but not to them) they could save lots of keystrokes.

Of course this is an extreme case, but have you ever filled out an on-line survey, got bored or frustrated because they are asking something you don’t have the data for and started just answering any old thing just to get to the end? The point is that people care about data quality when they are going to get something back. You can be sure they enter their address correctly on the prize draw form. But in many IT systems people are asked to enter information that doesn’t affect them, and human nature says that they will be less accurate with this than something which does mean something directly to them. Some data quality issues can be dramatic. In the North Sea one oil company drilled through an existing pipe because it was not there according to the system that recorded the co-ordinates of the undersea pipes: this merely cost a few million dollars to fix, as fortunately the pipe was not in active use that day or the consequences would have been much worse. Another company discovered that it was making no profit margin on a major brand in one market due to a pricing slip in its ERP system that had gone undetected for two years.

The reason data warehouses suffer so much from data quality issues is that they not only encounter the data problems in each of the source systems they deal with, but because they also bring all the information together the data problems often only becomes clear at this point e.g. the problem with the pricing became apparent because the data warehouse showed that they were making zero gross margin, which was not apparent inside the ERP system since the margin calculation was made up of data from several systems combined. It is the data warehouse that shines light on such issues, but often is wrongly blamed when the project is delayed as a result.

Data quality problems are one major issue where there is no magic solution. Data quality tools can help, but this is a people and process issue rather than a technology issue. However another reason data warehouse projects are perceived to fail is that they take a long time, and cost a lot to maintain. Since it takes 16 months to build an average data warehouse (TDWI) survey) it is not surprising that some changes to the business occur in this time. The only way to really address this is to use a packaged data warehouse solution, which takes less time to implement (typically less than 6 months for Kalido). Maintenance costs are another major problem, and here again there are modern design techniques that can be applied to improve this situation. See my blog the data warehouse carousel.

It is only by making use of the most modern design approaches, iterative implementation approaches that show customers early results, and the most productive technologies that data warehouse project success rates will improve. There will always be projects that run into trouble due to poor project management, political issues and lack of customer commitment, but data warehouse projects at least need to stop making life harder for themselves than they need be.

del.icio.us:Data quality blues  digg:Data quality blues  reddit:Data quality blues  Y!:Data quality blues