|
Project Summary
Query response time and system throughput are the number one metrics when it
comes to database and file access performance. Because of data proliferation,
efficient access methods and data storage techniques have become increasingly
critical to maintain an acceptable query response time and system throughput.
Retrieving data from disk is several orders of magnitude slower than retrieving
it from memory. One of the common ways to reduce disk I/Os and therefore
improve query response time is database clustering, which is a process that
partitions the database/file vertically (attribute clustering) and/or
horizontally (record clustering). To take advantage of parallelism to improve
system throughput, clusters can be placed on different nodes in a cluster
machine. A clustering result is optimized for a given set of queries. However
in dynamic systems the queries change with time, the clustering in place
becomes obsolete, and the database/file needs to be re-clustered dynamically.
This proposal proposes to develop an efficient algorithm for database/file
clustering that dynamically and automatically generates attribute and record
clusters based on closed item sets mined from the attributes and records sets
found in the queries running against the database/files. The proposal then
develops ways to implement the algorithms using the cluster computing paradigm
to reduce query response time and system throughput even further through
parallelism and data redundancy. The developed algorithms will be prototyped on
a cluster computer with 486 compute nodes available at the University of
Oklahoma. Performance studies will be conducted using the decision support
system database benchmark (TPC-H) and real data recorded in database and file
formats collected from applications, such as meteorology, microbiology and
healthcare.
Intellectual Merits:
While much work has been published on indexing, buffering, clustering and
parallelism - techniques for improving system performance, little has been done
about automating these processes, especially automatic and dynamic clustering
of data on storage medium for high end computing. This is an important area
because with data proliferation, human attention has become a precious and
expensive resource. It is therefore important to automate this process in order
to minimize the operating cost. The intellectual merits of this proposal lie in
three important contributions: 1) the database/file clustering technique that
makes use of data mining to automatically and dynamically clusters and
re-clusters a database/file with little intervention of a database/system
administrator, 2) the approach of integrating the proposed clustering technique
and the cluster computing architecture to improve query response time and
system throughput, and 3) comprehensive performance studies by means of
prototyping that use not only a popular database benchmark, but also real
database and file datasets from data- and computation-intensive applications.
This proposal is high risk and high payoff and is suitable for EAGER as the
proposed ideas for integrating autonomous database/file clustering with cluster
computing are in their early stage even though they are novel and potentially
transformative. Details need to be developed and tested to prove their
feasibility on cluster computers with at least few hundred nodes. Once proved,
a full proposal that addresses both autonomous attribute and record clustering
for high performance computers running on local and wide area networks will be
developed and submitted.
Broader Impacts:
The research results will be beneficial to many applications as they are
expected to improve query response time and system throughput. Collaboration
will be carried out with scientists in the Oklahoma Center for Analysis and
Prediction of Storms (CAPS) and domain experts in other application areas. The
research results including the proposed prototype and the datasets used for
performance studies will be published in journals and conference proceedings,
and will be posted on the Website of the PI's Database Group (
http://www.cs.ou.edu/~database) for public
use. One graduate student and one undergraduate student will be supported by
the project as research assistants (RAs) to conduct research and build the
prototype for performance evaluations. The RAs will be trained in database and
file management, and high end computing and applications. The PI will work with
the minority engineering program in the College of Engineering and the ACM-W
Chapter at the University of Oklahoma to identify minority and female students
for RA recruitment. The PI has supervised and graduated ten minority and female
PhD and Master's students.
|