Data Warehouses/Data Marts
Repositories for Data Mining
Introduction
Companies over the last couple decades have done more logging and data
capture with the advent of computers with database capabilities. Many
have found that these data are quite useful to augment or focus market
groups if only the information were available for statistical analyses.
The increasing popularity of intranets and the Internet, itself, has
given rise to repositories of data and engines that can search for
correlates for internal uses and "sellable" information (e.g. "what
kinds of people watch what kind of television shows during what times
of the day"). Regrettably, these "gold mines" of hidden information
has been as speculative as the analogy implies.
What are they?
Both the warehouses and the marts store information about clients,
demographics, interactions and transactions. They are not limited to
commercial gains but can be applied in any number of fields (from
astronomy to zoology—anything that can be measured and has volumes of
data). For example, a transaction log history would keep information
like: "Joe X withdrew $50 each weekend at 9am Saturday morning" or
"most reliable visual astronomical observations were before sunrise".
Of course, like the last case, the conclusions are fairly obvious.
Data warehouses combine data from all types of sources and have
the following characteristics: subject oriented, integrated,
time-variant (has a time component), and nonvolatile (no data are
deleted: everything is stored, timestamped and logged). (Sakaguchi)
These collections of data are very huge and are often unnaturally
homogeneous. The structures needed to make the data homogeneous made
the storage and maintenance very unmanageable. Also, warehouses often
have to be custom-built to meet the needs of the user and IT. (Atre,
Data Warehousing, 2/9/98)
In response to the unnatural homogeneity and sheer data collection
problems of data warehouses, data marts tried to cut down the
database by focusing on topics or specific subjects. Focusing on more
specific topics helped structure the data in a more intuitive way and
made the information more accessible. The collections would still be
gathered from other sources including warehouses and other data marts.
Lastly, marts were easier to compartmentalize so that off-the-shelf
solutions could be sold.
What is the current state of the art?
The classical transaction database is not able to do analytical
processing, because:
- "Transactional databases contain only raw data, and thus, the
processing speed will be considerably slower.
- Transactional databases are not designed for queries, reports and
analyses…
- Transactional databases are inconsistent in the way that they
represent information." (Sakaguchi)
Data warehouses are specially designed to handle different types of
queries—queries based on statistical analysis.
Most companies, until recently, were forced to build their own
warehouses. Now, there are several companies which sell warehouse and
mart databases. However, these tools are very costly (marts range in
price 100s of thousands of dollars and warehouses often exceed
millions) and are more general than the custom-built ones. (Firestone)
On one hand, the advantage of custom-built marts and warehouses ensures
that the structure and queries match the data, but the customness makes
it very difficult and expensive to maintain. On the other hand,
off-the-shelf marts and warehouses are maintained by the third party
but are more general and less useful than the custom ones. In either
case, they can easily grow beyond anything manageable.
What are the corporate benefits?
If well understood and used correctly, companies can gather information
to create a special niche for their market in this increasingly
competitive economy. Many times the object (or client) of the analysis
does not even know its behaviors. Having statistical engines to analyze
and draw useful marketing conclusions may mean the difference between
coming in the market at the crest or at the lull: using the data to
create an opportunity or niche.
Advantages/Issues
There has been substantial discussion and controversy (even
name-calling) in the pursuit and definition of data
warehouses/mart/mining. In all, it's clear that there is not enough
knowledgeable, theoretical professionals who can help guide the IT/IS
world. (Sakaguchi, "A Review")
Data Warehouse
- Historically custom built. The IT professional often, after
reviewing the different commercial tools used to opt for a custom, in
house tool. Now, most would rather buy than build. (Atre, "From
Build") This inserts a randomness into the data, since no tool is
able to process all data meaningfully.
- Tends to be unmanageable. Many claim a false "easiness" to
installation and maintenance. Unfortunately, the resources alone
needed to maintain, clean and feed a warehouse may require a doubled
IT workforce. Additionally, with off-the-shelf tools and once a
vendor is selected, the vendor is entrenched in the needs of the
company, and the company might as well "write a 401K" for the support
provider. ("The Meta Myth")
- Requires consultation for user-clients. Often the process
of creating a meaningful query is the bulk of the effort. With
hundreds of tables and tools, specialists would be required to serve
those who are merely trying to analyze the data.
Data Mart
- Will be easier to maintain than a warehouse until
more, dependent marts are created. So, the idea is to keep the marts
relatively independent. Unfortunately, these marts tend to be very
data hungry (or the data architects are) and will eventually form
natural dependencies.
- Grow increasingly complex to maintain as more marts are
added—may be more expensive than doing the data warehouse itself.
The connections and dependencies between marts will make mart removal
or overhaul very difficult.
- Lure "nontechies" to implement and query without proper
understanding and foresight. This is a serious problem: for proper
growth and nurturing of a mart, proper architecting is critical.
Otherwise the system will grow and make any fixes nearly
impossible.
- Gives a false sense of security: many claim that "it's
fast, cheap and easy... That's true if you're building one or two.
But it's terribly misleading, because most companies end up with lots
more." (PC Week, p69)
- Massive duplication. Marts, by their own nature, have
"massive duplication'... database administration costs for multiple
data marts can be as much as 50 percent higher than…comparable
enterprise data warehouses."
Both
- Grow at an enormous rate, giving little time to react to
schema revisions.
- Eventually will collapse under its own weight unless
checked by some kind of expiration measures.
- Results from queries are usually contradictory. Often (99%
of the time) the results are meaningless or obvious. (Stedman, "Data
Mining") (Moad)
- Require special query processing and analysis by someone
very familiar with the schemas and data layout. (Stedman, "Data
Mining")
- Data associations may be too random. In the process of
collecting and processing the information, illogical associations may
occur. (Stedman, "Data Mining")
What are the design issues?
Start small & simple
- Stay away from multiple, connected data marts.
- Use similar dimensions and structures.
- Require only minor changes to existing marts.
Growth exponential
- Be prepared to respond to nonlinear, increasing data
accumulation.
- When problem becomes clearer, re-architecting will be needed. In
fact some have suggested that the demand for data warehousing and
accuracy means revamping or expanding every four months (Stedman,
Software, 2/16/98)
- Legacy marts should be re-architected a topic or subject
at-a-time.
Match concepts
- Statistical measures the of the same type need to be kept
together.
- No query is adequate for all data.
- Correlates are not an indication of a cause.
- Enforce consistent data model definitions.
- Don't try to create a complete data model for all data marts.
Cull data
- Many times, the data does not match the intent of the query.
- Verify that the data is accurate and consistent.
Reliability & availability
- More and more people are expecting the servers to be available
when they need them—nearly 24x7. "The phone calls you get [from
users] are when the warehouse is down." (Stedman, 2/16/98)
- System redundancy, mirroring and other techniques will be needed
to ensure that the services are always available despite eventual
hardware failure.
- Naturally query response-time is a real concern for many
customers. Hence, PrimeStar, for example, only specifies that the
queries will run within a "reason time." (Stedman, 2/16/98)
Conclusion
Until there are better definitions of what and how to process the
volumes of available data within a company in a meaningful and reliable
way, any company considering implementing a data warehouse or data mart
will have to anticipate a growing monster that will require more IT/IS
staff than they currently employ and will be marginally reliable in
reporting "nuggets of market-savvy truth".
References
[Links last verified May 27, 1998]
- "." PC Week. Dec 22, 1997. Vol 14. No 53. p69.
- Asbrand, Deborah. "Is Datamining Ready for the Masses?"
Datamation. Nov 1997. No 11. pp. 66(5).
- Atre, Shaku. "From Build to Buy." Data Warehousing.
Feb 9, 1998.
- Atre, Shaku. "The Trouble with Training." Computerworld.
Oct 20, 1997. Vol 31. No 42. p73(2).
- Bort, Julie. "Getting Your Data Mart in Shape." InfoWorld.
Mar 16, 1998. Vol 20. No 11. p77(2).
- Dorsey, Paul. "Data Warehouses, Ad Hoc Query Tools and Other Ways
to Destroy Your Company."
http://www.dulcian.com/destroy.htm
- Firestone, Joseph M. "Data Warehouses and Data Marts: A Dynamic
View."
http://www.softwarejobs.com/firestone.html
[Link not responding when last tested]
- Gibbons, Lauren Paul. "Anatomy of a Failure." CIO Magazine.
Nov. 15, 1997.
- Gordon, K.I. "The Why of Data Standards—Do You Really Know Your
Data?"
http://www.island.net/~gordon/whystds.htm
- "Managing and Maintaining the Data Warehouse." Systems Techniques,
Inc.
http://www.systecinc.com/white/roles.htm.
- "The Meta Myth." Database Group.
http://www.standishgroup.com/db.html
- Orr, Ken. "Data Warehousing Technology."
http://www.kenorrinst.com/dwpaper.html.
- Moad, Jeff. "." PC Week Online. Nov 24, 1997.
- Sakaguchi, Toru, Mark N. Frolick. "A Review of the Data
Warehousing Literature."
http://www.people.memphis.edu/~tsakagch/dw-web.htm.
- Stedman, Craig. "Data Mining for Fool's Gold." News.
12/1/1997.
- Stedman, Craig. "Data Warehouse Users Seek Guarantees."
Software. 2/16/1998.
- "Ten Mistakes to Avoid." The Data Warehousing Institute.
http://www.dw-institute.com/papers/10mistks.htm.
- Vizard, Michael. "Distinguishing Data Marts." InfoWorld.
Mar 2, 1998. Vol 2. No 9. P36.
Copyright © 2000 by
Carolla Development, Inc. All rights reserved.
For more information, please contact Carolla Development at
614/431-1944 (voice), 614/431-9084 (fax), or info@carolla.com (Email).