Tuesday, August 23, 2011

From Rules to Workflows

Some business object types have workflows associated with them. A business object type is identified with an OWL class.  A workflow defines the process through which a business object of that type goes from its creation up to its deletion (or archiving). This can be an enterprise-wide business process involving many other systems and human actors, or a simple one-time interaction with an end user.
In our framework, a business object is always modeled as an OWL individual and it's state is always modeled as OWL properties. Furthermore, an object has a lifetime with a beginning and an end. The process governing that lifetime is specified through a set of SWRL rules. The SWRL rules are dynamically converted into a business process that gets executed on the business object. This blog entry outlines the algorithm that translates sets of rules into a process workflow. Details will be given in future blog entries.
Assumptions
Here are what assumptions made about the set of rules defining a business object workflow:
  1. The rules are defined in a single ontology with an IRI that follows the naming pattern http://www.miamidade.gov/swrl/<OWL Class> where <OWL Class> is the OWL classname of the business object.
  2. At least one rule must have a goal atom in its head. Goal atoms are specific to the business object type.
  3. Unless specifically stated, all variables are local to the rules in which they appear.
  4. A global variable named "bo" is reserved and refers to the business object on which the set of rules operates.  
Sketch
If no rule has a goal atom in its conclusion, then no workflow is created at all.
  1. First rules are evaluated iteratively against the current state of the business object. When evaluating a rule, the atoms in its body (i.e. its premises) are evaluated and if they are all true, then the atoms in its head are asserted in the current BO ontology. If any of those newly asserted atoms is actually and end-goal, then no workflow is constructed because there's nothing to do. If at least one of the atoms in a rule's body is known to be false, then the rule is subsequently ignored. If there are unknown atoms in the rule's body (but no false ones), the whole rule is deemed "unknown" and will be included in the construction of the workflow. When there are no new atom assertions from any rule during the current iteration, the evaluation process is complete. 
  2. At the end of the evaluation process, we are left with a set of "unresolved" rules, that is rules where we have unknown atoms in their bodies. Each rule is annotated with extra information and wrapped in a class called AppliedRule that contains values of instantiated variables and dependencies between the atoms within the rule. Such dependencies are infered through some heuristics and assumptions described below.
  3. At this stage, we have several unresolved rules remaining some of which contain a goal atom in their head (if none of them does, then the workflow is empty). Starting from those rules containing goal atoms in their head (i.e. their conclusion) and going in a backward-chaining fashion, we enumerate all possible logical paths to satisfy the rules and reach their conclusion. Each unknown atom X of the currently examined rule body is added to the deduction path and then recursively unknown atoms from rules where X appears in the conclusion. During the enumeration process each such atom is converted into a WorkflowPathElement instance, which is a helper class that holds an atom and how it depends on other atoms in the eventually constructed workflow. So a dependency graph between the unknown SWRL atoms is created, where the edges are SWRL variables that get propagated from atom to atom. A logical dependency b/w an atom in the rule's body and an atom in that rule's head is represented by a predefined variable implies SWRL variable. This dependency graph is important in that it defines which task in the workflow must be evaluated before which other task.
  4. The next stage converts each logical path found in the previous stage into a sequence of workflow tasks to be executed to reach the goal. The sequence is constructed starting from WorkflowPathElements that don't have dependencies and then adding their dependencies recursively as subsequent tasks to be executed. In addition to that, tasks to assert all possible conclusions are added as soon as possible. That is, every time a task is added to the current sequence, a search is made to find all rules whose premises would be satisfied were all tasks up to this point to succeed, then the conclusions of those rules are added as AssertAtomTasks. In this way, every logical path deducing an end goal is converted into a sequence of steps where no step downstream in the sequence depends on a step upstream (for the value of a variable or logically) and where all possible conclusions from unresolved rules are asserted as soon as the premises of those rules become true.
  5. At this point, we have a bunch of linear sequence of tasks. Each of those sequence can be executed step by step to reach an end goal for the business object, but each intermediary step has the potential of failing. When a step fails, we want to branch to another sequence that may succeed, but we don't want to repeat the execution of steps that have already succeeded and we don't want to branch to a sequence that we know will fail. The construction of the workflow from this set of sequence is based on them being ordered appropriately. Each task in a sequence is assigned a cost and independent tasks within a sequence are thus naturally ordered by their cost. Dependent tasks are assigned costs as the sum of their dependencies so at the end the order of all tasks in a workflow path is determined entirely by this cost number. The task paths themselves are ordered by cost where the cost of a task path is simply the sum of the costs of all its tasks.
  6. Given this setup, the workflow is constructed as follows: start with the least costly sequence. Its first task is the starting point othe workflow. Each task, except atom assertion tasks, results in a true/false result. So add a boolean branching node after each task that branches to the next task in the current path on "true" and to the "first possible task" in subsequent paths on "false". This "first possible task" is obtained by examining each of the more costly paths in turn, and looking for a task not already on the current path and such that all of its preceeding tasks are on the current path. 
The workflow thus constructed looks like a decision tree where branching is done based on the truth of OWL axioms and whenever the truth of an OWL axiom is not known, there's a node to "find out" - prompt the user, call some software etc. In a subsequent blog I will give more details about how variable dependencies are managed, how costs are assigned to tasks so that goal paths can be ordered, and what assumptions and heuristics are being used.
This is work in progress. To handle more real-life workflow scenarios, our next steps is to represent asynchronous processes and events. The simplest strategy would be to put the whole workflow in a state of limbo, whenever there's unknown information and progress on the workflow cannot be made. When in such state, the operations service accepts changes to the BO ontology and resume the workflow with the newly found information. In general, it is possible to execute the workflow right from the beginning any time we want because tasks are performed only for missing information and will be skipped on a second execution if the information is already there (e.g. OWL properties have been stated etc.). So for example when a business object is "edited" with some random properties changing, the workflow can be replayed from the beginning instead of trying to figure out what decision to backtrack and what to keep etc.
Boris

Wednesday, August 10, 2011

Externalizing Data Property Values

Externalize data property values into it's own table. It will improve performance by two means. The first, we will avoid duplicates literal values reducing the amount of records. The second is by having typed representations of the data i.e., VALUE_AS_DATE, VALUE_AS_NUMBER we can query on by type without having to do an explicit cast. The initial table should look like so:

create table CIRM_OWL_DATA_VALUE
(
ID number(19,0) not null, -- the sequence, the VALUE column in CIRM_OWL_DATA_PROPERTY will now be a fk constraint on this column.
VALUE_HASH varchar2(28) , -- lookup column.
VALUE_AS_VARCHAR varchar2(4000 char), --a string representation of the value
VALUE_AS_CLOB clob , -- storage column for large string values
VALUE_AS_DATE timestamp, -- a typed representation of the value as date
VALUE_AS_DOUBLE double precision, -- a typed representation of the value as double
VALUE_AS_INTEGER number(19,0), -- a typed representation of the value as an integer.
primary key (ID)
)


The RelationalStore class will need to be modified accordingly to reflect the changes in schema. We would need to check if an existing VALUE exists, this can be done by first checking the VALUE_HASH column. Use of the java hashCode() is not recommended as table records can get quite large and hash collision is a concern. Perhaps a better approach would be to utilize a cryptographic hash function ie, MD5, SHA-1 to hash the string value and store it's Base64 encoding in the column. SHA-1 hashing should suffice. A base64 encode of a SHA-1 digest would result in a 28 byte hash hence the 28 byte column length.

Also, the VALUE_HASH column could be included in the DATA_PROPERTY TABLE along with the id for rapid equivalence queries (avoid a join).

Tuesday, August 9, 2011

Querying the Operations Database

Triple store query translate results in a query that mirrors what is described in this document:

http://www.w3.org/2002/05/24-RDF-SQL/

The noMappingTranslate works with equivalence based queries. Here is an example of a query with a ‘nested’ type:

{

"sortBy":"hasDateLastModified",

"hasServiceRequestStatus":"*",

"atAddress":{

"Street_Name":"1ST",

"type":"Street_Address"

},

"currentPage":1,

"sortDirection":"desc",

"type":"Garbage_Missed_Complaint",

"hasDateCreated":"06/14/2011",

"itemsPerPage":20

}

The query translator translates this to:

SELECT CIRM_CLASSIFICATION.SUBJECT

FROM CIRM_CLASSIFICATION

JOIN CIRM_OWL_OBJECT_PROPERTY ON CIRM_OWL_OBJECT_PROPERTY.SUBJECT = CIRM_CLASSIFICATION.SUBJECT

JOIN CIRM_OWL_DATA_PROPERTY ON CIRM_OWL_DATA_PROPERTY.SUBJECT = CIRM_CLASSIFICATION.SUBJECT

WHERE (CIRM_CLASSIFICATION.OWLCLASS = ?)

OR (CIRM_OWL_OBJECT_PROPERTY.PREDICATE = ?)

OR (CIRM_OWL_OBJECT_PROPERTY.PREDICATE = ? AND CIRM_OWL_OBJECT_PROPERTY.OBJECT IN ( SELECT CIRM_CLASSIFICATION.SUBJECT

FROM CIRM_CLASSIFICATION

JOIN CIRM_OWL_DATA_PROPERTY ON CIRM_OWL_DATA_PROPERTY.SUBJECT = CIRM_CLASSIFICATION.SUBJECT

WHERE (CIRM_CLASSIFICATION.OWLCLASS = ?)

OR (CIRM_OWL_DATA_PROPERTY.PREDICATE = ? AND TO_CHAR(CIRM_OWL_DATA_PROPERTY.VALUE) = '1ST') ) )

OR (CIRM_OWL_DATA_PROPERTY.PREDICATE = ? AND TO_CHAR(CIRM_OWL_DATA_PROPERTY.VALUE) = '06/14/2011')

Parameters:

1 = 181()

2 = 191()

3 = 194()

4 = 182()

5 = 184()

6 = 185()

Here is a list of functions and operators that the query translator supports:

Function/Operator Translation Sample JSON property expression

greaterThan SQL GREATER THAN "hasDateCreated":"greaterThan(\"2011-06-15T19:18:06.552Z\")"

lessThan SQL LESS THAN "hasDateCreated":"lessThan(\"2011-06-15T19:18:06.552Z\")"

like SQL LIKE With '%' post append "hasName": "like(\"Zues\")"

between SQL BETWEEN "hasDateCreated":"between(\"2011-06-15T19:18:06.552Z\",\"2011-06-15T19:18:06.552Z\")"

*contains

*in

*startsWith

*notLike

= SQL EQUALS "hasName": "= \"Zues\""

>= SQL GREATER THAN OR EQUAL "hasCount": ">= 1"

<= SQL LESS THAN OR EQUAL "hasCount": "<= 1"

> SQL GREATER THAN "hasCount": "> 1"

< SQL GREATER THAN "hasCount": "< 1"

Note: Literal Values are translated to an equals operation on the SQL side.

* Incomplete. Expression parsing is there but translation is incomplete.