Data Warehouses/Data Marts

Repositories for Data Mining

White Paper
by
Sean Walton and Alan Cline
Carolla Development

 


Introduction

Companies over the last couple decades have done more logging and data capture with the advent of computers with database capabilities. Many have found that these data are quite useful to augment or focus market groups if only the information were available for statistical analyses. The increasing popularity of intranets and the Internet, itself, has given rise to repositories of data and engines that can search for correlates for internal uses and "sellable" information (e.g. "what kinds of people watch what kind of television shows during what times of the day"). Regrettably, these "gold mines" of hidden information has been as speculative as the analogy implies.  


What are they?

Both the warehouses and the marts store information about clients, demographics, interactions and transactions. They are not limited to commercial gains but can be applied in any number of fields (from astronomy to zoology—anything that can be measured and has volumes of data). For example, a transaction log history would keep information like: "Joe X withdrew $50 each weekend at 9am Saturday morning" or "most reliable visual astronomical observations were before sunrise". Of course, like the last case, the conclusions are fairly obvious.

Data warehouses combine data from all types of sources and have the following characteristics: subject oriented, integrated, time-variant (has a time component), and nonvolatile (no data are deleted: everything is stored, timestamped and logged). (Sakaguchi) These collections of data are very huge and are often unnaturally homogeneous. The structures needed to make the data homogeneous made the storage and maintenance very unmanageable. Also, warehouses often have to be custom-built to meet the needs of the user and IT. (Atre, Data Warehousing, 2/9/98)

In response to the unnatural homogeneity and sheer data collection problems of data warehouses, data marts tried to cut down the database by focusing on topics or specific subjects. Focusing on more specific topics helped structure the data in a more intuitive way and made the information more accessible. The collections would still be gathered from other sources including warehouses and other data marts. Lastly, marts were easier to compartmentalize so that off-the-shelf solutions could be sold.  


What is the current state of the art?

The classical transaction database is not able to do analytical processing, because:

  1. "Transactional databases contain only raw data, and thus, the processing speed will be considerably slower.
  2. Transactional databases are not designed for queries, reports and analyses…
  3. Transactional databases are inconsistent in the way that they represent information." (Sakaguchi)

Data warehouses are specially designed to handle different types of queries—queries based on statistical analysis.

Most companies, until recently, were forced to build their own warehouses. Now, there are several companies which sell warehouse and mart databases. However, these tools are very costly (marts range in price 100s of thousands of dollars and warehouses often exceed millions) and are more general than the custom-built ones. (Firestone) On one hand, the advantage of custom-built marts and warehouses ensures that the structure and queries match the data, but the customness makes it very difficult and expensive to maintain. On the other hand, off-the-shelf marts and warehouses are maintained by the third party but are more general and less useful than the custom ones. In either case, they can easily grow beyond anything manageable.  


What are the corporate benefits?

If well understood and used correctly, companies can gather information to create a special niche for their market in this increasingly competitive economy. Many times the object (or client) of the analysis does not even know its behaviors. Having statistical engines to analyze and draw useful marketing conclusions may mean the difference between coming in the market at the crest or at the lull: using the data to create an opportunity or niche.  


Advantages/Issues

There has been substantial discussion and controversy (even name-calling) in the pursuit and definition of data warehouses/mart/mining. In all, it's clear that there is not enough knowledgeable, theoretical professionals who can help guide the IT/IS world. (Sakaguchi, "A Review")

Data Warehouse

Data Mart

Both


What are the design issues?

Start small & simple

Growth exponential

Match concepts

Cull data

Reliability & availability


Conclusion

Until there are better definitions of what and how to process the volumes of available data within a company in a meaningful and reliable way, any company considering implementing a data warehouse or data mart will have to anticipate a growing monster that will require more IT/IS staff than they currently employ and will be marginally reliable in reporting "nuggets of market-savvy truth".  


References
[Links last verified May 27, 1998]

Copyright © 2000 by Carolla Development, Inc. All rights reserved.


For more information, please contact Carolla Development at
614/431-1944 (voice), 614/431-9084 (fax), or info@carolla.com (Email).