Benchw : A Data Warehouse Benchmarking Toolkit
==============================================

Version 0.8b


Author And History
==================

This toolkit was written by Mark Kirkwood (markir@paradise.net.nz).
The whole thing grew from a few scripts I found useful for trying out tuning
ideas.


Introduction
============

The objective of this toolkit is to be able to compare capabilities of several 
different database managers for data warehouse type activities - data loading, 
index creation and query performance.

This comparison does not test multiple threads or concurrent updates - its all 
about bulk loads and queries.


Restrictions
============

The restrictions mentioned below should apply when comparing results from 
different database managers. This is an attempt to keep the playing field
level, and at the same time be reasonably realistic.

For your own tuning and experimentation, none of these need apply - in fact
you may find it quite informative to try out some of the features forbidden
below, to see how effective they actually are!

This test is designed to be in the spirit of TCP H. This choice was made to 
firstly keep things simple and to realistically model the environment many
ad-hoc query tools actually work in. Therefore none of the more elaborate
warehousing optimizations can be used. This includes :

- table partitioning.
- clustered or pre-joined tables.
- automatic summary tables and query rewriting systems.
- session and statement level optimizer coercions.
- parallel query handling.

However indexes are free, that is any of the default ones may be removed, and
any new ones added. There is one exception (derived from the 2nd point above),
that is not allowed :
 
- pre-joined indexes.
  
The supplied dataset generator and load scripts must be used to produce and
populate the actual data. It is allowable to add extra parameters to the load
commands (e.g. buffer sizes etc - but see below).

The database manager itself may be freely configured (however memory allocation
is restricted - see below).
 
In order to attempt some realism concerning query caching, the following
restrictions apply to memory allocation :

- the database cache maximum size is 10% of the dataset (dumped) size.
- the database sort buffer maximum size is 2% of the dataset size.
- the database load buffer is restricted similarly to the sort buffer.
 
This means that the default settings for these parameters are approximately
100M, 20M and 20M respectively.

Not all database managers have all three of these structures, or else may
break them up into other components, so some leeway is allowed. The moral is
to try to be fair.

                  
The Data Model
==============

The model is an idealized "star" schema. 
There are (by default) three "dimension" tables dim0 -> dim2. 
The first dimension is timelike ( i.e. it has a date column representing some 
time measure), the other two are generic non timelike. 
There is a "fact" table fact0. This represents generic observations of interest.


The "star" relationship is expressed by means of the fact table having a 
foreign key type column for each primary key of the dimension tables. 
These relationships are not formally represented as constraints - as not all 
database managers can implement them.


Database Manager Specifics
==========================

For a list of which ones are currently support see the file INSTALL.
For notes on script changes and optimizations for specific database engines 
see the file DBMNOTES.


Expert Users And Those Who Like To Code
=======================================

Many of the attributes of this toolset can be changed. However this does require
some delving into the code.

The number of relations (i.e tables) is #defined in rel.h and can be changed.
The various filler texts are just #defines in dat.h and are readily alterable.


Licence
=======

This code is covered by a BSD license, see the file COPYING


Credits
=======

This code was mainly written with Kdevelop 2.1 (www.kdevelop.org) -
its a cool product.

