Thursday, January 20, 2011

A no-risk proposition

Open source used to evoke the scrunching of executive eyebrows and a patronizing -"it's still young and risky" comment. Things have surely changed, and the open source "movement" has proven its maturity and is fulfilling its potential.

So why are gov't executives still rolling their eyes and tensing up at the mere mention of "open source." Government is in so many cases the proverbial "late adopter". Well, better late than never. The government development and open source fit is a natural one. The transparency promise (a promise of virtually every government administration) presupposes that software developed in government IT shops is a community resource ... like any other government asset - it is public property.

This is the bottom line of what we are trying to achieve here, with our Sharegov initiative - to align the themes firmly embedded in the philosophy of public service (openness, low cost to the public, objectivity, public participation), with our software development practice. Much overdue. And substantially less risky than permanently committing to some expensive software product only because a charming sales engineer from a big company swept us off our feet during a powerpoint presentation.

We dare to suggest a little formula that will help government developers start sharing more and form a dynamic, energetic community: a little bit of extra, focused effort a couple of times a week! Just like exercise - start with regularity, and maintain discipline - working out will pay off in overall health, stamina, nimbleness... in our case, for our government organizations, for our e-government applications and projects We realize no one will give us the extra time and money to dedicate to kick-off a gov't app development "co-op", so we will give our own time and effort to get things going. We encourage you to do so too - by commenting on our blogs, taking a look at the applications we're making available, by sharing your issues, projects and your own applications. Please don't let inertia and organizational fatigue take over. We're fighting it every day ourselves.


Started looking at the problem of storing OWL instances to RDBMs:
  1. If an OWL instance is to be persisted in a relational database, everything about it should be persisted there. It should be completely recoverable from the RDBMs.
  2. It would be impossible to create a classic ER schema to manage OWL instances since a lot of properties can be very dynamic. So there should be a generic schema to store any OWL individual.
  3. On the other hand, we'd like to use some of the SQL capabilities for querying large data sets, data warehousing and business analytics, so an actual RDBMs schema would be useful in practice.
  4. Hence, an OWL individual is potentially stored in two portions: a generic "set of assertions" portion that is disjoint from a "fixed entity with attributes SQL table" portion.
Some resources I found that have to do with the problem of OWL <-> RDBMS:
Relation.OWL extracts OWL from DB schemas:
Mapping between Relational Databases and OwL:
Mapping between Relational Database Schema and OWL Ontology for Deep Annotation:
Maintaining Mappings between Conceptual Models and Relational Schemas:
Resource space model, OWL and database: Mapping and integration :
There is also the converse problem of dealing with legacy database schemas and some of the resources above talk about that. There is OntoBase and some plugins for older Protege that would transform an RDBMs schema into an ontology. That will be useful if we need to write code within our semantic dev framework that talks to legacy databases (which is likely to be the case).
In meta terms however, the relational (i.e. entity relationship) meta model needs to be described in the ontology with concepts such as "Database_Schema", "Table", "Column", "Database_Type" type etc. I didn't come across to anything already done in that area, it seems trivial so we may just do it from scratch.
In any event, we've stipulated in the initial architecture that we'd like the ability to keep older versions of everything. That doesn't necessarily mean that everything should be automatically versioned. The decision should be made depending on meta data about the particular individual.
So the simplest thing to do is to start with a generic RDBMs schema for storing arbitrary OWL individuals and think about optimizations from there. That schema should already support versioning. The next step would be mapping of (portion of the information about) OWL individuals to RDBMs entities (tables) and specifying the abstract rules that govern what goes where. Those rules should work for all 4 CRUD operations.
Business Object Ontologies
The schema for storing individuals doesn't need to also store the full ontology. It doesn't need class or property information as those remain in the metadata database. The information about each individual is encapsulated in an ontology of its own, a mini-ontology so to speak, the contains only the relevant assertions pertaining to it. We establish a naming schema for those "mini ontologies" that has the following form:
'bo' is short for "business object"
main-type-iri=The IRI of the unique class this individual is explicitly declared to be an instance of.
object-id=The unique integer identifier of this business object within the scope of its main type.
For example, the ontology describing an individual of a service type Garbage_Complaint and that has an id 67 will be:
The IRI of the individual itself within that ontology is formed by appending the string '#bo' to the ontology IRI. In other words, the individual has a local name 'bo' within its own ontology:
The RDBMs schema
  • Needs a dedicated table mapping IRI <-> numerical IDs (NUMBER(19)) - IRI_TABLE. This will make the other tables smaller, more compact, and since most of the IRIs deal with meta-data, caching could be used to avoid consulting IRI_TABLE. An important advantage of this table is renaming: the IRI of an object can be changed with having to modify all its references everywhere.
  • Seems better to separate the object properties and data properties in two different tables, since object properties have a fixed form/size: they are just triplets of integer identifiers, while data properties have to deal with data as big as blobs, or arbitrary long strings.
  • The object property table, OWL_OBJECT_PROPERTIES has the whole row (triplet) as a primary key. But the that's unfeasiable for the OWL_DATA_PROPERTY table because potentially big data values, so each data property assertion will have its own unique ID.
  • Naturally, we can use various indexes to improve lookup when those tables get large.
There are two strategies to maintain versions of data: real time and logical time. By logical time I mean incrementing version numbers for each piece of data. That logical clock can be global or it can be a separate sequence for each type of data. However, clearly, one would want a snapshot of several related objects at a particular point in time so the clock must be global. Using real time as the clock has the advantage of allowing queries by specific time ranges. On the other hand, real time can be problematic with unsynchronized computers.
One way to maintain versions is to augment each table with columns carrying version information:
[attr1, attr2, ..., version_number]
where version number could be a timestamp or some sort of a logical clock. To get the latest version, one would have to do a 'max' on the version column. To get a snapshot of a set of related objects "as of version N", one needs to get, for each object, the max version <= N. Again, a lot of aggregation, and the queries are not very simple. An alternative schema is to store pairs of timestamps:
[attr1, attr2, ..., from, to]
where the from and to indicate an interval of the time clock where that particular version is valid. A null value of the 'to' column (or the maximal possible number) indicates that this is the latest version. Thus the latest version is found more directly and it's also easier to retrieve object "as of" a particular time (time >= from and time < to).
The second version takes up more space, but querying is faster, so we will use that.