Thursday, January 20, 2011

OWL to RDBMs

Started looking at the problem of storing OWL instances to RDBMs:
  1. If an OWL instance is to be persisted in a relational database, everything about it should be persisted there. It should be completely recoverable from the RDBMs.
  2. It would be impossible to create a classic ER schema to manage OWL instances since a lot of properties can be very dynamic. So there should be a generic schema to store any OWL individual.
  3. On the other hand, we'd like to use some of the SQL capabilities for querying large data sets, data warehousing and business analytics, so an actual RDBMs schema would be useful in practice.
  4. Hence, an OWL individual is potentially stored in two portions: a generic "set of assertions" portion that is disjoint from a "fixed entity with attributes SQL table" portion.
Some resources I found that have to do with the problem of OWL <-> RDBMS:
RELATIONAL DATABASE SCHEMA TO ONTOLOGY MAPPING APPROACHES:http://www.unbsj.ca/sase/csas/data/awoss2-2010/AWOSS_Yassaman%20Zand-Moghaddam.pdf
Relation.OWL extracts OWL from DB schemas: http://sourceforge.net/projects/relational-owl/
Mapping between Relational Databases and OwL: http://www.lu.lv/materiali/apgads/raksti/756_pp_99-117.pdf
Mapping between Relational Database Schema and OWL Ontology for Deep Annotation: http://portal.acm.org/citation.cfm?id=1249215&CFID=4471127&CFTOKEN=65330774
Maintaining Mappings between Conceptual Models and Relational Schemas: http://www.ischool.drexel.edu/faculty/yan/publications/JDM-21(3)-36-68-2010.pdf
Resource space model, OWL and database: Mapping and integration : http://www.knowledgegrid.net/~H.zhuge/data/ACM-TOIT-Zhuge-final.pdf
There is also the converse problem of dealing with legacy database schemas and some of the resources above talk about that. There is OntoBase and some plugins for older Protege that would transform an RDBMs schema into an ontology. That will be useful if we need to write code within our semantic dev framework that talks to legacy databases (which is likely to be the case).
In meta terms however, the relational (i.e. entity relationship) meta model needs to be described in the ontology with concepts such as "Database_Schema", "Table", "Column", "Database_Type" type etc. I didn't come across to anything already done in that area, it seems trivial so we may just do it from scratch.
In any event, we've stipulated in the initial architecture that we'd like the ability to keep older versions of everything. That doesn't necessarily mean that everything should be automatically versioned. The decision should be made depending on meta data about the particular individual.
So the simplest thing to do is to start with a generic RDBMs schema for storing arbitrary OWL individuals and think about optimizations from there. That schema should already support versioning. The next step would be mapping of (portion of the information about) OWL individuals to RDBMs entities (tables) and specifying the abstract rules that govern what goes where. Those rules should work for all 4 CRUD operations.
Business Object Ontologies
The schema for storing individuals doesn't need to also store the full ontology. It doesn't need class or property information as those remain in the metadata database. The information about each individual is encapsulated in an ontology of its own, a mini-ontology so to speak, the contains only the relevant assertions pertaining to it. We establish a naming schema for those "mini ontologies" that has the following form:
{base-iri}/bo/{main-type-iri}/{object-id}
where
base-iri=http://www.miamidade.gov
'bo' is short for "business object"
main-type-iri=The IRI of the unique class this individual is explicitly declared to be an instance of.
object-id=The unique integer identifier of this business object within the scope of its main type.
For example, the ontology describing an individual of a service type Garbage_Complaint and that has an id 67 will be:
http://www.miamidade.gov/bo/Garbage_Complaint/67
The IRI of the individual itself within that ontology is formed by appending the string '#bo' to the ontology IRI. In other words, the individual has a local name 'bo' within its own ontology:
http://www.miamidade.gov/bo/Garbage_Complaint/67#bo
The RDBMs schema
  • Needs a dedicated table mapping IRI <-> numerical IDs (NUMBER(19)) - IRI_TABLE. This will make the other tables smaller, more compact, and since most of the IRIs deal with meta-data, caching could be used to avoid consulting IRI_TABLE. An important advantage of this table is renaming: the IRI of an object can be changed with having to modify all its references everywhere.
  • Seems better to separate the object properties and data properties in two different tables, since object properties have a fixed form/size: they are just triplets of integer identifiers, while data properties have to deal with data as big as blobs, or arbitrary long strings.
  • The object property table, OWL_OBJECT_PROPERTIES has the whole row (triplet) as a primary key. But the that's unfeasiable for the OWL_DATA_PROPERTY table because potentially big data values, so each data property assertion will have its own unique ID.
  • Naturally, we can use various indexes to improve lookup when those tables get large.
Versioning
There are two strategies to maintain versions of data: real time and logical time. By logical time I mean incrementing version numbers for each piece of data. That logical clock can be global or it can be a separate sequence for each type of data. However, clearly, one would want a snapshot of several related objects at a particular point in time so the clock must be global. Using real time as the clock has the advantage of allowing queries by specific time ranges. On the other hand, real time can be problematic with unsynchronized computers.
One way to maintain versions is to augment each table with columns carrying version information:
[attr1, attr2, ..., version_number]
where version number could be a timestamp or some sort of a logical clock. To get the latest version, one would have to do a 'max' on the version column. To get a snapshot of a set of related objects "as of version N", one needs to get, for each object, the max version <= N. Again, a lot of aggregation, and the queries are not very simple. An alternative schema is to store pairs of timestamps:
[attr1, attr2, ..., from, to]
where the from and to indicate an interval of the time clock where that particular version is valid. A null value of the 'to' column (or the maximal possible number) indicates that this is the latest version. Thus the latest version is found more directly and it's also easier to retrieve object "as of" a particular time (time >= from and time < to).
The second version takes up more space, but querying is faster, so we will use that.

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Materialized Views and RDF Queries: http://rdfvs.saclay.inria.fr/RDFViewS-TR.pdf

    ReplyDelete
  3. Added classes to the project that deal with default persistence to a relational store along with a factory class that allows one to build an iri as described in http://sharegov.blogspot.com/2011/01/owl-to-rdbms.html. The schema file is at the root of the project folder. There is a test class RelationalStoreTest.java that shows how to CRUD data properties, and object properties. As discussed last week I am hoping we can avoid an owl2rdb mapping and instead go with materialized views to handle reporting needs. However the issue of supporting legacy DBs still remains so a mapping strategy might be unavoidable

    ReplyDelete
  4. rctophobby.com
    They leave no room for error because it only takes one erroneous move to utterly destroy a remote control helicopter. Therefore, in order to effectively pull off some maneuvers, one has to train

    ReplyDelete