Model-driven Benchmark data generation for digital preservation of webpages / von Clemens Sauerwein
VerfasserSauerwein, Clemens
Begutachter / BegutachterinRauber, Andreas
UmfangIX, 68 S. : graph. Darst.
HochschulschriftWien, Techn. Univ., Dipl.-Arb., 2014
Zsfassung in dt. Sprache
URNurn:nbn:at:at-ubtuw:1-64614 Persistent Identifier (URN)
 Das Werk ist frei verfügbar
Model-driven Benchmark data generation for digital preservation of webpages [1.64 mb]
Zusammenfassung (Englisch)

Digital Preservation (DP) is the process of keeping digital information accessible and usable in an authentic manner for a long term. Preservation activities are used to guarantee long term and error free accessibility of data regardless of technological change. Different approaches based on continuous transformation of data are used to perform these preservation activities. Several tools exist for the execution of these activities. Digital objects have significant properties which must be preserved during the transformations. To evaluate these preservation activities information about these characteristics (e.g. structure, size) are necessary. The annotations of digital objects with this information are used as ground truth. A benchmark data set can be formed with real world data but the verification of the properties has to be done manually. Every automatic analysis is based on the correct interpretation of an analysis program (e.g. characterization tool). Due to the fact that these programs must be evaluated there is a profound lack of annotated benchmark data in Digital Preservation. For this reason the evaluation and improvement of digital preservation approaches and tools is hindered. This thesis introduces a model driven benchmark data generation framework with the purpose of automatic generation of benchmark data with corresponding ground truth. The system uses the Model Driven Architecture (MDA) as underlying concept which facilitates the usage of well-known model driven engineering tools and frameworks. Instead of analyzing existing benchmark data collections of computer science it generates the benchmark data sets referred to property distributions of different kinds of documents (e.g. webpages). The framework specifies ground truths for the Platform Independent and Platform Specific Models of the generated benchmark data. These ground truths together with the benchmark data are used for evaluation. The model driven benchmark data generation framework is evaluated by generating benchmark data for testing preservation action tools for web pages. They are widely used and a complex challenge in digital preservation settings. We define a Platform Independent and a Platform Specific Model for representing webpages and demonstrate how benchmark data can be created with these.