Creating DITA topics with a blog or wiki

I've decided to conclude my explorations of alternative DITA writing approaches with a discussion of how existing blogs or wikis can be pressed into service, at least for intial content creation (often the hardest "first mile" in getting knowledge out of the minds of subject matter experts in a company and into a repurposable form). Whereas the expeDITA project demonstrates how to create a blog or wiki with DITA as native source, here I explore the other path--using common collaboration tools to create content that can be exported relatively easily as DITA.

Don Daymigration, editing

In my previous post on creating DITA content using the wikitext-like reStructuredText, I demonstrated a command line approach to harvesting content for conversion into DITA. Here, we will be looking at the use of blog APIs via web services to access similar content and convert it directly into DITA format on the requesting server.

There is no universal API yet for all blogs or wikis, so I selected to test with a blogging application known for its simplicity, both in function and in APIs: Posterous.

The principle is straightforward: use RESTful services documented for the posterous blog engine to query the access requirements for a particular blogger's site and then build a query to retrieve one or more of the posts as XML feeds back to the calling server. There, capture the data for each post and transform it as needed into the necessary components for reassembly as DITA concepts, tasks, or reference topics. Finally, write the resulting data structure to a file and log it as needed for subsequent use on that system.

What can you do with this kind of process? Imagine making a single query that would retrieve all the topics corresponding to a series, such as this one on alternative DITA content creation. As each topic is created, a corresponding map entry is also created. Upon exiting that loop through the results, a process is invoked to build that map as a PDF, at the conclusion of which an email is sent to the requestor with a link to the hosted PDF file. Not only is this method able to produce an aggregated deliverable on demand, but the intermediate DITA can be edited to remove or change the order of links and thus refine a new print job request using the already-cached DITA files, all without having to touch any content directly.

Practical observations:

Title and major structure were fairly easy to generate, but since the body content was HTML, the transforms need to be adaptive for various combinations of markup that the blog editor allows, simple it is. Most of the issues I ran into pertained to the different ways that I had coded the content of block-like contexts, like labeled code samples and images.

Could the process could support round-tripping, as in using a blog like Posterous as a review system for content authored in a more robust information development setting? With my practical hat on, I'd have to say, No, not really--the best role for this capability is using it to enable non-ID contributors such as Subject Matter Experts in a company to write freely about whatever is on their mind, and then harvest the content as DITA periodically for a one-time migration into the more robust content management path. Without DITA or some other structured XML as the core source format on the collaboration engine, I doubt that a round-tripping capability can be scaled were the content complexity requirements to grow to the cross-enterprise level.

It's not to say that the principle cannot be done. Lisa Dyer and Anne Gentle documented a process developed by Lombardi (prior to that company's acquisition by IBM) for supporting a modicum of update cycle for harvested DITA (see DITA and wiki hybrids for an insightful review and a link to a white paper on the process).

I'd include the code, but this effort was more of a hack that I prefer not to support. However I'll provide it on request. What I CAN give you is a PDF generated by DITA Open Toolkit from a query to the posterous blog that copies this WordPress blog. The only editing was to image URLs and to the hierarchy of the otherwise flat DITA map that is generated from the sequential nature of blog posts.

Download PDF: Sample DITA request from blog