大家好我是Helen，上一次有寫一篇關於電子書開放潛能的文章，接著希望稍微討關於網路上用各種XML標示語言，例如 Docbook、DITA、TEI 等語言標示的檔案在轉換到 EPUB 使用的 XHTML 檔案時，避免內容或架構相關資訊流失的重要性。
現有的XML標示語言可以用來定義豐富的語彙，Docbook、TEI 〈Text Encoding Initiative〉與 DITA 〈Darwin Information Typing Architecture〉只是比較常見用來標示電子檔案的其中三種語彙，因為 XML 有可以自己設計標籤的特性，又透過 namespace 可以在同一份文件理使用多種字彙，使 XML 有高彈性、用途廣的特色。
而每一個 XML 語言都有其各自用途、語彙、結構上的特色。舉例說 Docbook 比較適用於論文以及科學相關文件標示，因此它有一個複雜的槽狀結構，一層層下去，並且它的語彙中增加許多資訊科技的字彙；TEI 則是比較適合用在文學、語言學，以及原始文獻的標示與保存，它針對人文計算科學領域提供非常細部的標示〈如 Fig.2 所示〉；DITA 與 Docbook 性質上比較相似〈都是偏向自然科學文獻標示〉但結構卻與前二者完全不同，它重主題性以及主題與主題之間的關聯，因此檔案出來會有網狀或是樹狀的架構。
〈Fig2. TEI 歷史文獻的標示〉
而在檔案轉換的同時，很有可能會失去甚麼？例如剛剛看到各個語言都有其獨特的結構，轉換後就必須硬把它填入一個截然不同的結構中，或是因為跨語言之間的語彙不同，使有一些語彙的本意會流失。因為 XML 語言變得如此之複雜，每一個語言都有自己的特色，如何找到一個共通、每一個語言皆適用的轉換規則又更困難了。
拍照只能擷取你現在看到的畫面，卻無法記錄這個畫面背後的故事，不過隨著各種 XML 語言以及語意網的發展，網路能夠提供給人的服務會越來越多，也提供新的紀錄方式，讓你所得到的，遠遠超過眼前的畫面。期望在科技的進步下，能夠顧及資訊的保護又同時鼓勵資訊的開放，讓資訊激發創意、創意也將帶來更多創意以及整體網路效能的提升。
WYS is not just WYG—The Importance of Preserving Information in EBook Documentation
Hi! My name is Helen. Last time I wrote an article about the open possibilities of eBooks, and would like to further discuss the various XML document markup languages across the web such as Docbook, DITA, and TEI. Also, I hope to discuss the importance in preserving structural and content level information when converting these XML documents to EPUB vocabulary.
During the summer internship, I heard an interesting metaphor for data loss through format converting: in a picture with a cow grazing on the grassland, if one was to remove the cow, then we would not be able to replace the space with the same cow. Similarly, when a highly structured and content specific document such as Docbook (an XML vocabulary) is converted into a more presentational document type such as EPUB or PDF, some details will be lost, and without the critical data, it would be impossible to transform the document back to its original form.
(Fig1. Lost data is irretrievable)
This problem is brought to light with the advancement of internet technologies. Since it would be difficult to find a solution for the end user, the alternative is to try and preserve as much structural and content level information while converting a file to ensure that the least amount of data is lost.
Simple Introduction to XML (Extensible Markup Language)
XML is a language to define mark-up vocabularies for use in different application domains. There are an abundant number of XML vocabularies that can currently be found across the World Wide Web; Docbook, TEI (Text Encoding Initiative) and DITA (Darwin Information Typing Architecture) are just a few examples. XML has many strength including flexible vocabulary and the capacity for developers to import namespaces and to create personalized specifications.
Every XML language has its usage, vocabulary set, and structural characteristics. For example, Docbook has a more complicated embedding structure and an emphasis on elements for use in computer science and technical documents. TEI on the other hand has a vocabulary suitable for linguistic and literary content markup, and also markup for manuscripts for digital preservation purposes (See Fig. 2 for manuscript markup example). DITA is related to Docbook in that its vocabulary is more suited for document markup in the field of computer science. Differently, however, DITA has a more map-like structure, emphasizing relation between topics.
〈Fig2. Manuscript markup in TEI〉
When converting a file from languages like TEI to XHTML, what is lost? For example, the structure of the document, or the different vocabulary for content markup, these things are often lost through the conversion. Because each XML vocabulary is complicated in its own way, each has its own characteristics, it would be difficult to find a conversion method that would fit all languages without the loss of information.
The Importance of Information
Why then is it important to preserve all this structural, content, and metadata information? We live in an era where most documents are being digitalized, and the importance of preserving that which is original and recording necessary background information becomes essential. Through the complicated conversions and transmissions, data is easily lost. By preserving information relevant to an e-document, we maintain its accessibility, its reusability, and moreover protect publisher rights and allow smoother flow of information at the same time. Open and accurate information inspires, and ignites creativity.
Not only does well marked-up information benefit open creativity. It also helps machines understand and interpret our documents. Currently, the idea of the Semantic Web and also the above mentioned markup languages do just this: they help the computer to interpret documents, and in turn, make web resources more applicable and make software and search engines work more intelligently.
By taking a picture with a camera, any camera, we could only get the picture we see. A photograph would not be able to record the whole background story. However, with XML and the various markup vocabularies, and the recent development of semantic web tools, the Internet can provide more and more services to the end user, and creates a new way to record, to preserve information. In other words, What You Get is “more than” What You See. It is in hope that with the advancement of technology, information can be protected as well as widely accessible from every point of view, igniting creativity, and through creativity, raise the Internet to a new level.