Data Clustering

Data clustering is a technique that can be used to improve the performance of an OODBMS. When data cannot fit in the main memory, they are stored on the hard disks. Without data clustering, accessing two related objects usually requires two disk I/Os because they are not stored in the same page. This degrades the performance of an OODBMS because accessing a hard disk is slow. In contrast, the main memory can perform very fast random access. Thus, related objects should be stored close to each other in order to maximize the amount of relevant information returned when a page is loaded from the disk.

We have conducted a comprehensive comparison of existing clustering techniques and developed a new one that offers several innovative features. First, the required storage size of our technique is linear with the number of objects. Second, we developed a replication strategy adaptable for each class of objects. Objects are duplicated only when they are read accessed to increase object locality. Third, our technique is flexible and can be tuned to avoid big overhead when the OODBMS is overloaded. Fourth, the clustering process can be tuned by using a reduced set of statistics which is easier to set compared with those required in existing techniques. The simulations have shown that our technique outperforms many static and dynamic clustering techniques.

For problems or questions regarding this web contact database@cs.ou.edu.