File format analysis : monitoring the life cycle of file formats in the internet

Schindler, Stefan

doi:10.34726/hss.2014.22632

DC Field

Value

Language

dc.contributor.advisor

Rauber, Andreas

dc.contributor.author

Schindler, Stefan

dc.date.accessioned

2020-06-29T16:51:19Z

dc.date.issued

2014

dc.date.submitted

2014-06

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Schindler, S. (2014). <i>File format analysis : monitoring the life cycle of file formats in the internet</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2014.22632</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2014.22632

dc.identifier.uri

http://hdl.handle.net/20.500.12708/7504

dc.description

Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers

dc.description

Zsfassung in dt. Sprache. - Literaturverz. S. 91 - 95

dc.description.abstract

Die Menge an Informationen, die im World Wide Web veröffentlicht wird, ist in den letzten Jahren rapide angestiegen. Heutzutage ist das Internet eines der meistgenutzten Medien für Bildung, Kommunikation, Unterhaltung und auch Geschäftsprozesse. Da publizierte Daten im Internet sehr leicht veränderbar sind, und sich dadurch auch in sehr kurzen Zeitabständen ändern, ist es notwendig, rechtzeitig Maßnahmen zu ergreifen, um diese Daten für eine spätere Verwendung zu archivieren (dieser Prozess wird "Web Archiving" genannt). Es werden nicht nur ständig neue Informationen im WWW publiziert, auch die verwendeten Dateiformate werden ständig erweitert, verbessert und durch neuere ersetzt. Dadurch besteht auch die Gefahr, dass Daten in älteren Formaten vorliegen, die durch neuere Versionen eines Programmes nicht mehr korrekt gelesen und angezeigt werden können. Diese Softwareobsoleszenz stellt eine große Gefahr für digitale Objekte dar. Eine weitere Information, die hauptsächlich für Entwickler von Programmen, die bestimmte Dateiformate erzeugen, aber auch für Anwender, die mit diesen erzeugten Formaten arbeiten, oder diese weiterverwenden, wichtig sein kann, ist die Information darüber, wie lange es dauert, bis ein Dateiformat verschwindet, bzw. von seinem Nachfolger abgelöst wird. Aus diesen Anforderungen leiten sich unter anderem folgend typische Fragen ab, die sowohl für Entwickler, als auch für Anwender essentiell sind: (1) Inwieweit wurde ein neu eingeführtes Dateiformat akzeptiert? (2) Wie lange dauerte es für eine bestimmte Version, ihren Vorgänger zu ersetzen? (3) Wann wird eine bestimmte Version oder ein bestimmtes Dateiformat obsolet? Diese Arbeit erzeugt ein Framework zur Identifikation und Analyse von aus dem Web geladenen Dateien, und ist in der Lage, umfangreiche Statistiken über die Entwicklung einzelner Versionen von bestimmten Dateiformaten über einen längeren Zeitraum zu berechnen. Dafür werden verschiedene Tools verwendet und erzeugt, die es auch ermöglichen, das Framework zukünftig noch weiter adaptieren und diesem Funktionalität hinzufügen zu können. Beispielsweise werden in der Arbeit zusätzlich auch HyperText Markup Language-Dateien, die einen großen Anteil der Dateien im WWW ausmachen, im Detail analysiert, und spezifische Eigenschaften aufgezeigt. Bestimmte (Sub-)Domains des WWW können gecrawled werden, die erzeugten warc-Dateien werden ausgelesen, und die einzelnen Dateien mit Hilfe des File Information Tool Set identifiziert. Die erzeugten Identifikationsdateien werden aggregiert und mit einem Statis- tikprogramm analysiert. Mit dem erzeugten Framework und den dadurch generierten Statistiken wird versucht, einen Einblick in die Evolution des WWW, und Trends und Patterns aufzuzeigen, die dabei helfen können, Digital Preservation zu unterstützen und Risiken aufzudecken. In der Arbeit werden die gefundenen Erkenntnisse mit bereits durchgeführten Studien zu diesem Themenbereich verglichen, und eine Übersicht über zukünftige, weiterführende Schritte, und durchzuführende Forschungen gegeben.

dc.description.abstract

In the last years the amount of information, that is published and shared in the world wide web, increased rapidly. The web became one of the main platforms for education, communication and entertainment, but also for business processes. As published web data is very ephemeral and can be changed very easily, long-term accessibility is prevented, so it is necessary to take actions to archive these data for later use and prevent this loss of information (this process is called "web archiving"). As the world wide web grew, many new file formats and versions have been developed, that found their way into the web usage. This leads to the risk, that data is available in older formats, that are no more readable and interpretable by newer versions of a program. This software obsolescence represents a big issue for digital objects. Another information, that is mainly important for developers of programs, that are able to create certain file formats, but also for users of these created formats, is the time it takes for a file format to disappear or to be replaced by its successor. From these exigences, the following questions can be derived, that can be seen as typical for developers as well as users: (1) To what extend has a newly invented file format been accepted? (2) How long did it take for a specific version to replace its predecessor? (3) When will a specific version or a certain file format become obsolete? This thesis creates a framework for the identification and analysis from files loaded from the web, and also has the ability to compute extensive statistics of the development of particular versions of certain file formats over a longer period. Therefore, different tools are used and created, that also allow future adaptation and the adding of new functionalities to the framework. As an example, an extra in-depth analysis of HyperText Markup Language files, that make a big part of www-data is performed and specific characteristics of these files are illustrated. It is possible to crawl certain (sub-)domains of the world wide web, to filter the created warc files, and to identify the individual files with assistance of the File Information Tool Set. The created identification files are aggregated and analyzed with a statistical computing program. The produced framework and the generated statistics are used to try giving an insight in the evolution of the world wide web, and to highlight patterns and trends that can be used to support digital preservation and to discover potential preservation risks. The thesis compares the findings with already performed studies of this subject area, and gives an overview of further steps and studies to be conducted.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

file format analysis

dc.subject

web archiving

dc.subject

(x)html analysis

dc.subject

analysis framework

dc.subject

file format lifecycle

dc.title

File format analysis : monitoring the life cycle of file formats in the internet

dc.title.alternative

Fileformatanalyse: Überwachung des Lebenszyklus von Dateiformaten im Internet

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2014.22632

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Stefan Schindler

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Becker, Christoph

tuw.publication.orgunit

E188 - Institut für Softwaretechnik und Interaktive Systeme

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC11705165

dc.description.numberOfPages

dc.identifier.urn

urn:nbn:at:at-ubtuw:1-70521

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.languageiso639-1

item.openaccessfulltext

Open Access

item.openairetype

master thesis

item.grantfulltext

open

crisitem.author.dept

TU Wien

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(1.71 MB)

In Copyright

Show simple item record

Page view(s)

211

checked on Nov 21, 2023

Download(s)

122

checked on Nov 21, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM