SourceForge.net Logo

Related pages:


What's New

New package released 30 June 2004: xiruss.0.1_b.zip

This release of XIRUSS is just barely functional enough to be demonstrable. I am releasing it primarily to make the built-in DITA support available to the DITA community and in particular the OASIS DITA Technical Committee.

NOTE: This release is not anywhere near complete in terms of the functionality that even a minimally-useful repository would need. In particular it is essentially a read-only repository in that once the repository is started and initialized there is no way to add more documents to it. Therefore this version simply demonstrates the basic principles of compound document import but does not yet act as an interactive content management repository.

Java programmers who want to exeriment with importing their own files can do so by modifying the populateRepository() method in the XmlImporterTest class.

While it is only a read-only repository, because it provides a standard HTTP server and supports standard HTTP-based URLs for version access, you can access versions stored in the repository from any tool that supports generic HTTP access. This includes all XSLT engines, most, if not all, XML editors, and so on. For example, because XIRUSS implements an XSLT importer, you can store both XML documents and the XSLTs that operate on them in the repository and then apply those style sheets to those documents using standard command-line syntax for calling your XSLT processor. The only constraint is that the XSLT script must set the URL base in order to correctly resolve relative URLs back into the repository. The sample XSLT file test/com/innodata/resources/xslts/html_test.xsl demonstrates how this can be done.


The Xinclude-based Re-Use Support System, Toy (XIRUSS-T) is an experimental system intended to both demonstrate basic techniques in managing compound documents and provide a sandbox for exploring various content management and link management problems and their solutions. XIRUSS-T, as an explicitly toy system, is intended only for demonstration and educational purposes. It intentionally avoids implementation characteristics that would make it perform and scale. This is both to keep the implementation as simple as possible and to ensure that it is not useful for production purposes. The XIRUSS-T system is not intended to compete with any existing or future production-capable content management system. Rather it is hoped that XIRUSS-T will be a positive and productive influence on the production systems through the demonstration of sound content management principles and techniques.

The XIRUSS-T system focuses on two key areas of content management:

  1. The generic management of versioned hyperdocuments as collections of storage objects. That is, the management of interconnected systems of files irrespective of their data types, source formats, and so on.

    The XIRUSS-T system can manage files of any type, not just XML, although it has the most built-in knowledge of XML as a fundamental data type. But it could be just adapted as easily to understand Microsoft Word documents or Framemaker documents or whatever you have.

  2. Support for modular, flexible, and easily extensible import and export processes. As explained below most of the pain of content management is in import and export, especially when you accept that integration with editors always requires an export followed by an import and that, for XML, each non-trivial document type will require its own specialized importer and exporter.

Other issues, such as workflow, full-text search, and metadata management are also important but the two issues listed above are the fundamental features that a content management system must get right or it will simply be either incapable of satisfying the remaining business requirements or it will be so expensive to adapt to specific use cases as to be prohibitively expensive compared with other solutions. Of course, the XIRUSS-T system would be a solid foundation on which to experiment with approaches to satisfying these other requirements, but it is not the immediate focus of the XIRUSS-T project.

Through our work at Innodata Isogen as systems integrators attempting to design, build, and implement content management systems that manage complex hyperdocuments with sophisticated and challenging versioning requirements, we have realized that few, if any, off-the-shelf systems are fully capable of satisfying these requirements, either because they have architectural limitations, implementation flaws, or are too expensive to adapt and maintain.

We also, out of our experience, developed a simple but powerful abstract model for managing versioned hyperdocuments, which we named SnapCM, for Snapshot-based Configuration Management. We have, in the past, implemented SnapCM-based systems, proving that the model is sound and practical. In fact we developed a production-quality SnapCM-based system, code named Bonnell, that satisfied all our requirements as integrators and met the performance and scalability requirements of the systems we were integrating it with. Unfortunately, historical events unfolded to make the Bonnell system unavailable, due to complications of ownership, choices of implemetation languages, and so on. But the SnapCM model survives as a proven, freely available abstract approach to version management that can be applied to any content management system regardless of implementing technology or underlying architecture.

Out of this experience I came to the XIRUSS project with the following:

This means that I will always either be dependent on third parties to develop products that meet my requirements or build systems from scratch as one-off works for hire again and again (which is good money but not really a service to the community over the long term).

Therefore I have developed XIRUSS-T as a way to help people, both users of content management systems and the builders of content management systems, understand the following key facts:

Thus the XIRUSS-T system has two primary purposes: to demonstrate how easy it is to implement the SnapCM model, and how effective it is as a basis for implementing sophisticated versioned hyperdocument management features, primarily as required by documents that use XInclude and similar systems to created link-based compound documents, and how a flexible and extensible importer and exporter framework can be designed and implemented.

The XIRUSS-T system has been implemented quickly, ignoring most issues of scalability and performance, keeping the focus entirely on the general data models, framework design, and core algorithms needed to implement SnapCM and the import and export functionality and characteristics I want and need as an integrator. For example, I spent maybe a total of eight hours implementing the core SnapCM classes, which are a literal reflection of the abstract SnapCM model, as defined in the SnapCM paper (http://www.isogen.com/papers/snapCM.pdf). The current implementation leaves out many important things, including implementation of the Sync class, but the code that's there indicates that the SnapCM model is fundamentally simple. Given an implementation of Sync, everything else would be about performance and scalability optimization, not core processing semantics.

I have also ignored all issues of access control and security, again to keep the focus on the core data management functions and to keep the code simple. Access control techniques are well understood and there's no need for me to demonstrate that I know what they are. And it helps ensure that XIRUSS-T is not itself useful for production purposes.

The next section explains in more detail the challenges inherent in import and export.

Import and Export Are the Hard Problems

The management of complex systems of linked documents is inherently challenging and therefore the XIRUSS-T system provides a wide range of functionality. However, the primary focus of the system is on the challenges of document import and export. This is because, given an appropriate versioning model, the management of documents once they are within a repository is relatively easy (ignoring issues of performance and scale).

Almost all the complexity of content management is concentrated at the boundary between the repository and the outside world. That is, the task of importing systems of interconnected documents into a the repository in away that then enables their management as hyperdocuments requires quite sophisticated processing. This processing includes handling standards-defined semantics such as XInclude's use-by-reference, schema associations, and so on, as well as document-type-specific semantics and business-use-specific semantics.

For example, imagine your enterprise has developed its own sophisticated document type to manage the creation and authoring of technical documents. This document type uses XInclude to represent use-by-reference links and uses its own purpose-built element types for representing navigational links (possibly reflecting some ancient legacy markup that predated standards like XLink by 10 or 15 years). In this system, a typical user manual for a single product might be composed of several hundred or thousand files, all linked together, managed through complex workflows, and so on. Such a manual is a "compound document" (to use XIRUSS's terminology) in that it is a single "logical" document composed, through XInclude links, of many individual XML document instances.

You want and need a content management system that will enable the management of these complex documents in sophisticated ways. At a minimum you need to satisfy the following requirements:

These are just the cost-of-entry requirements. For a complete solution to a specific business problem you would of course need much more: task-specific workflows, full-text search, metadata management, custom user interfaces for access, integration with editors, integration with output production systems, and so on.

Now think about what it would mean to implement an import process for your compound documents. This importer would have to do the following:

This is clearly sophisticated processing and the above doesn't even get into issues of metadata capture, access control, workflow integration, and full-text indexing, which all further complicate the import process.

By the same token, export is essentially the reverse of import, except that it may not be necessary to instantiate dependency relationships in the exported documents because they are inherent in the data itself. (That is, the reason we capture the dependencies on import is to facilitate quick answers to the "where used?" question, but that question can always be answered by brute force simply be processing all the documents involved. Doing it at import time simply means we only have to do it once, not every time we ask the question.)

In this case note that you need both standard processing (XInclude) and DTD-specific processing (your old-school navigatation links). This means that while part of the import process might be generic and re-usable, part of it must, by necessity, be DTD-specific. This means that for each different document type that you want to support in the repository you will need a custom importer. This also means that no off-the-shelf importer will ever satisfy the import requirements of any given DTD unless that DTD uses standards exclusively for all dependency-defining pointers, which at least today is impossible because that necessary standards simply don't exist (although specifications like DocBook and DITA, and especially DITA with it's sophisticated specialization mechanism, provide some hope of a solution in the not-to-distant future).

Note also that editing documents that are in the repository always involves an export followed by a re-import (even so-called hot integrations are still doing this under the covers for those content management systems that are not themselves just distributed editors [see below]).

This is all to make the point that the tasks of import and export are inherently challenging, necessarily document type and business use specific, and critical to the usability and utility of the repository.

XIRUSS-T's Content Management Principles

The XIRUSS-T system reflects the core content management design principles that I have held to for many years. In short, they are:

All of these principles are reflected in the design and implementation of XIRUSS-T in one way or another.

Running the XIRUSS-T Server

The class com.innodata.xiruss.JettyXirussRunner's main() method will populate a repository and then start an HTTP server that you can then use to browse the repository or access versions from any HTTP-capable tool.

I have not yet created an all-in-one jar that you can just run, so to run the class you need to do one of three things:

Having started the server by one of these methods you can then access it from a Web brower using the url "http://localhost:9090/". This will bring up the initial repository access page. From here you can navigate into the repository as a "user" would or you can use the Repository dump to see all the details of the repository contents. Note that this is a read-only view of the repository--there is currently no way to create new versions through the Web because I haven't yet implemented support for the HTTP put or post methods. This is next up on my to-do list.

Minimally-Required Features Yet To Be Implemented

In order for a content management system like XIRUSS-T to be minimally useful it needs to support the following functions:

Other nice-to-have features that would help show how the basic XIRUSS-T features would be integrated with a complete task-specific system include:

Limitations on the Use of XIRUSS-T

The XIRUSS-T system is a demonstration system intended for purely educational and experimental purposes. It is not in any way intended to be used for any production purposes.

XIRUSS-T is licensed through the GNU Public License, which means that the XIRUSS-T code, whether compiled or in source form, may only be used with systems that are themselves licensed as GPL open source. In particular, the compiled XIRUSS-T library MAY NOT be used with any non-open-source systems, commercial or otherwise.

While the XIRUSS-T source code is restricted in how it may be used, the ideas embodied in and reflected by the XIRUSS-T source are not. Developers are free to use any of the concepts or approaches, architectural or implementation, illustrated by XIRUSS-T in any way they find productive, as long as the source for those ideas is credited. Likewise, developers are encouraged to use the concepts behind the SnapCM versioning model in their own systems if they so desire, including implementing systems that reflect the abstract SnapCM model. As an abstract idea, SnapCM is not usefully protectable in any way and in any case the ideas have been published. As systems integrators the value of SnapCM is in providing a path to solutions to challenging versioning and link management problems. Therefore, we want people to use this model in their products and tools because we think it is the most effective way to solve these problems and we, as integrators, want solutions to them. Note that the actual task of implementing SnapCM requires only standard computer science techniques and algorithms and therefore no SnapCM implementation is likely to be itself patentable, at least as far as the implementation of SnapCM's data model and basic semantics is concerned. It is always possible that an implementation might reflect unique and innovative optimization or scalability techniques that are themselves patentable but those techniques would almost certainly be generally applicable techniques that could be applied to data models and semantics other than SnapCM, meaning that any patents resulting from those techniques would be on those techniques, not on SnapCM in particular. That is it is almost certainly not possible to implement a patented SnapCM system that limits the ability for others to implement SnapCM.

SourceForge.net Logo