XIRUSS-T

What Is XIRUSS-T?

The Xinclude-based Re-Use Support System, Toy (XIRUSS-T) is an experimental system intended to both demonstrate basic techniques in managing compound documents and provide a sandbox for exploring various content management and link management problems and their solutions. XIRUSS-T, as an explicitly toy system, is intended only for demonstration and educational purposes. It intentionally avoids implementation characteristics that would make it perform and scale. This is both to keep the implementation as simple as possible and to ensure that it is not useful for production purposes. The XIRUSS-T system is not intended to compete with any existing or future production-capable content management system. Rather it is hoped that XIRUSS-T will be a positive and productive influence on the production systems through the demonstration of sound content management principles and techniques.

The XIRUSS-T system focuses on two key areas of content management:

The generic management of versioned hyperdocuments as collections of storage objects. That is, the management of interconnected systems of files irrespective of their data types, source formats, and so on.
The XIRUSS-T system can manage files of any type, not just XML, although it has the most built-in knowledge of XML as a fundamental data type. But it could be just adapted as easily to understand Microsoft Word documents or Framemaker documents or whatever you have.
Support for modular, flexible, and easily extensible import and export processes. As explained below most of the pain of content management is in import and export, especially when you accept that integration with editors always requires an export followed by an import and that, for XML, each non-trivial document type will require its own specialized importer and exporter.

Other issues, such as workflow, full-text search, and metadata management are also important but the two issues listed above are the fundamental features that a content management system must get right or it will simply be either incapable of satisfying the remaining business requirements or it will be so expensive to adapt to specific use cases as to be prohibitively expensive compared with other solutions. Of course, the XIRUSS-T system would be a solid foundation on which to experiment with approaches to satisfying these other requirements, but it is not the immediate focus of the XIRUSS-T project.

Through our work at Innodata Isogen as systems integrators attempting to design, build, and implement content management systems that manage complex hyperdocuments with sophisticated and challenging versioning requirements, we have realized that few, if any, off-the-shelf systems are fully capable of satisfying these requirements, either because they have architectural limitations, implementation flaws, or are too expensive to adapt and maintain.

We also, out of our experience, developed a simple but powerful abstract model for managing versioned hyperdocuments, which we named SnapCM, for Snapshot-based Configuration Management. We have, in the past, implemented SnapCM-based systems, proving that the model is sound and practical. In fact we developed a production-quality SnapCM-based system, code named Bonnell, that satisfied all our requirements as integrators and met the performance and scalability requirements of the systems we were integrating it with. Unfortunately, historical events unfolded to make the Bonnell system unavailable, due to complications of ownership, choices of implemetation languages, and so on. But the SnapCM model survives as a proven, freely available abstract approach to version management that can be applied to any content management system regardless of implementing technology or underlying architecture.

Out of this experience I came to the XIRUSS project with the following:

Long professional experience working with and thinking about complex issues of industrial-scale hyperdocument creation, management, and deployment, starting back in the 1980's with IBM's BookMaster markup system, which included sophisticated hyperlinking features needed to create systems of interlinked online manuals for products and systems of products. This work continued through participation in the development of the HyTime standard in the 1990s and through task- and business-specific design and implementation of SGML- and XML-based link management and content management systems in my role as a professional systems integrator.
A deep and powerful frustration with the existing available tools, all of which failed in some serious way to satisfy the requirements at hand.
The SnapCM abstract versioning model and the knowledge that in both theory and practice it satisfies the most challenging requirements of managing versioned hyperdocuments, through our experience implementing and deploying the Bonnell system.
A knowledge of how much effort is required to implement a SnapCM-based system so that it scales and performs (and therefore how easy it would be to implement one that doesn't scale or provide high performance but that is functionally correct).
A realization that the XInclude specification provides a key tool for the task of complex technical document management, namely the ability to define standards-based use-by-reference links.
The full realization that Innodata Isogen is not and never will be a product company (and that I will never again work for a product company), meaning that we will never build the all-singing, all-dancing content management system I and my customers want and need. We are integrators and want things that we can integrate effectively, affordably, and sustainably.

This means that I will always either be dependent on third parties to develop products that meet my requirements or build systems from scratch as one-off works for hire again and again (which is good money but not really a service to the community over the long term).

Therefore I have developed XIRUSS-T as a way to help people, both users of content management systems and the builders of content management systems, understand the following key facts:

Implementing sophisticated version management, hyperdocument management systems doesn't have to be that hard if you just use the right model (SnapCM) and use it to drive your implementation design from the get go.
That most of the complexity in content management systems is in the import and export processes and that such processes will always be document type specific.
That systems can provide flexible import and export frameworks that minimize the cost of implementing new importers and exporters given an existing importer or exporter that is similar to the one you need.
That use-by-reference is an almost universal requirement for technical documentation management and that XInclude is an attractive standard for implementing use by reference in XML documents.

Thus the XIRUSS-T system has two primary purposes: to demonstrate how easy it is to implement the SnapCM model, and how effective it is as a basis for implementing sophisticated versioned hyperdocument management features, primarily as required by documents that use XInclude and similar systems to created link-based compound documents, and how a flexible and extensible importer and exporter framework can be designed and implemented.

The XIRUSS-T system has been implemented quickly, ignoring most issues of scalability and performance, keeping the focus entirely on the general data models, framework design, and core algorithms needed to implement SnapCM and the import and export functionality and characteristics I want and need as an integrator. For example, I spent maybe a total of eight hours implementing the core SnapCM classes, which are a literal reflection of the abstract SnapCM model, as defined in the SnapCM paper (http://www.isogen.com/papers/snapCM.pdf). The current implementation leaves out many important things, including implementation of the Sync class, but the code that's there indicates that the SnapCM model is fundamentally simple. Given an implementation of Sync, everything else would be about performance and scalability optimization, not core processing semantics.

I have also ignored all issues of access control and security, again to keep the focus on the core data management functions and to keep the code simple. Access control techniques are well understood and there's no need for me to demonstrate that I know what they are. And it helps ensure that XIRUSS-T is not itself useful for production purposes.

The next section explains in more detail the challenges inherent in import and export.

Import and Export Are the Hard Problems

The management of complex systems of linked documents is inherently challenging and therefore the XIRUSS-T system provides a wide range of functionality. However, the primary focus of the system is on the challenges of document import and export. This is because, given an appropriate versioning model, the management of documents once they are within a repository is relatively easy (ignoring issues of performance and scale).

Almost all the complexity of content management is concentrated at the boundary between the repository and the outside world. That is, the task of importing systems of interconnected documents into a the repository in away that then enables their management as hyperdocuments requires quite sophisticated processing. This processing includes handling standards-defined semantics such as XInclude's use-by-reference, schema associations, and so on, as well as document-type-specific semantics and business-use-specific semantics.

For example, imagine your enterprise has developed its own sophisticated document type to manage the creation and authoring of technical documents. This document type uses XInclude to represent use-by-reference links and uses its own purpose-built element types for representing navigational links (possibly reflecting some ancient legacy markup that predated standards like XLink by 10 or 15 years). In this system, a typical user manual for a single product might be composed of several hundred or thousand files, all linked together, managed through complex workflows, and so on. Such a manual is a "compound document" (to use XIRUSS's terminology) in that it is a single "logical" document composed, through XInclude links, of many individual XML document instances.

You want and need a content management system that will enable the management of these complex documents in sophisticated ways. At a minimum you need to satisfy the following requirements:

Can import an entire compound document as a single action, for example, from a normal file system. The imported result must account for all the compound document's dependencies (schemas, included graphics, etc.).
Can export an entire compound document as a single action, for example, to a normal file system. The exported result must reflect all the compound document's dependencies, either by exporting them as well or by rewriting the references to them to point to their location within the repository (if necessary).
Can query the repository to find all documents that are the roots of compound documents.
For a given single document, query the repository to find where it is used by other documents.
All actions that change the state of the repository should be transaction safe. For example, the import of a compound document should be an atomic action: either all the member documents are imported or non are. The should should not allow a partial import that leaves the repository in an inconsistent or incorrect state.

These are just the cost-of-entry requirements. For a complete solution to a specific business problem you would of course need much more: task-specific workflows, full-text search, metadata management, custom user interfaces for access, integration with editors, integration with output production systems, and so on.

Now think about what it would mean to implement an import process for your compound documents. This importer would have to do the following:

For each document, determine what documents it is dependent on. In this scenario there are two types of dependencies: XInclude use-by-reference links and DTD-specific navigation links. Starting with the root document, is produces the "bounded object set" of documents that make up a single compound document or hyperdocument.
For each document determine its governing schema and whether or not that schema is already in the repository. If it is not, import the schema too (including resolving pointers to schema components referenced by the top-level schema document).
For each document, determine whether it is to imported as an entirely new resource or if it is a new version of an existing resource.
For each document, rewrite all pointers to other files to reflect their new locationas imported, not their location as they currently exist on the file system. For example, an href= on a xi:include element that pointed to a file by a relative path would have to be replaced by a pointer that would be a valid URI to that same resource as it will exist with the repository.
Finally, you have to actually copy the documents to be imported into the repository and capture all the metadata gathered: dependencies among documents, schema associations, etc.

This is clearly sophisticated processing and the above doesn't even get into issues of metadata capture, access control, workflow integration, and full-text indexing, which all further complicate the import process.

By the same token, export is essentially the reverse of import, except that it may not be necessary to instantiate dependency relationships in the exported documents because they are inherent in the data itself. (That is, the reason we capture the dependencies on import is to facilitate quick answers to the "where used?" question, but that question can always be answered by brute force simply be processing all the documents involved. Doing it at import time simply means we only have to do it once, not every time we ask the question.)

In this case note that you need both standard processing (XInclude) and DTD-specific processing (your old-school navigatation links). This means that while part of the import process might be generic and re-usable, part of it must, by necessity, be DTD-specific. This means that for each different document type that you want to support in the repository you will need a custom importer. This also means that no off-the-shelf importer will ever satisfy the import requirements of any given DTD unless that DTD uses standards exclusively for all dependency-defining pointers, which at least today is impossible because that necessary standards simply don't exist (although specifications like DocBook and DITA, and especially DITA with it's sophisticated specialization mechanism, provide some hope of a solution in the not-to-distant future).

Note also that editing documents that are in the repository always involves an export followed by a re-import (even so-called hot integrations are still doing this under the covers for those content management systems that are not themselves just distributed editors [see below]).

This is all to make the point that the tasks of import and export are inherently challenging, necessarily document type and business use specific, and critical to the usability and utility of the repository.

XIRUSS-T's Content Management Principles

The XIRUSS-T system reflects the core content management design principles that I have held to for many years. In short, they are:

There is a fundamental distinction between the management of storage objects (files) and the semantic content those storage objects contain (XML, Word, text, graphics, etc.) and content management systems should always make a clear distinction between these domains.
Version and dependency management is applied to storage objects. Therefore the lowest layer in a content management system is the storage management layer, which manages storage objects as versions. A complete storage management system maintains knowledge of the dependencies among storage objects and arbitrary metadata on those storage objects.
Semantic processing (e.g., processing XML elements) is done in a layer above the storage layer, the semantic layer. It is in this layer that data is exposed or processed in terms of its fundamental abstractions, not as sequences of bytes. For example, constructing an XML DOM is done in the semantic layer, using data retrieved from the storage mananagement layer.
Authoring support systems require indirect addressing.
Clean API boundaries among the functional components of a content management system are key to having a flexible, extensible, easy-to-integrate system.
There is no magic to XML. Any content management system that can only manage XML is fundamentally broken and misguided or is really what I call a "distributed editor" in that it is simply providing distributed access to trees of nodes (not documents as storage objects), which may be useful in certain use cases but does not make for a general-purpose content management system.

All of these principles are reflected in the design and implementation of XIRUSS-T in one way or another.

XIRUSS-T

Contents

What's New

What Is XIRUSS-T?

Import and Export Are the Hard Problems

XIRUSS-T's Content Management Principles

Running the XIRUSS-T Server

Minimally-Required Features Yet To Be Implemented

Limitations on the Use of XIRUSS-T