Data Warehouse

The vast majority of requests for information from a data warehouse involve dynamic ad hoc queries. The ability to extract data to answer such queries quickly is a critical issue in the data warehouse environment. Proper indexing is crucial to avoid I/O intensive scans against the large data warehouse tables. To support the dynamic nature of the ad hoc queries, the index has to be scalable. The cost of building indexes using all the important attributes is prohibitive. Hence, the challenge is to find the subset of indexes that would improve the ad hoc queries’ performance automatically. This is our project’s goal.

The major problem in selecting indexes is workload determination since it is unknown until users start using the data warehouse. However it may be possible to estimate it. Knowing what the data warehouse’s workload looks like plays a major role in determining the attributes considered by the index selection algorithm.

To derive an algorithm for the auto-selection of an index, our project is currently conducting the following tasks: 1) identifying different transaction classes, 2) identifying the characteristics of each class, 3) building models from predefined training queries that will classify the user queries, and 4) selecting indexes based on the estimated workloads of the transaction classes.

To identify the transaction classes and their characteristics, users' requirements need to be understood. Reasonable assumptions have to be made to define the characteristics of a transaction class and periodical reviews are needed to improve them. By applying a data mining technique to the training queries of the transaction classes, the class models are generated. These models are then used to generate the likely workload from the query history. The index attributes that would best suit the need of the user queries are selected from the estimated workload by using an algorithm based on a data mining technique.

There are many issues that still need to be addressed. These include determining how long to keep the history, how to measure the classification accuracy of the models, how to detect runaway queries, and how to manage resources for such queries.

For problems or questions regarding this web contact database@cs.ou.edu.