Wednesday, August 10, 2011

Externalizing Data Property Values

Externalize data property values into it's own table. It will improve performance by two means. The first, we will avoid duplicates literal values reducing the amount of records. The second is by having typed representations of the data i.e., VALUE_AS_DATE, VALUE_AS_NUMBER we can query on by type without having to do an explicit cast. The initial table should look like so:

create table CIRM_OWL_DATA_VALUE
(
ID number(19,0) not null, -- the sequence, the VALUE column in CIRM_OWL_DATA_PROPERTY will now be a fk constraint on this column.
VALUE_HASH varchar2(28) , -- lookup column.
VALUE_AS_VARCHAR varchar2(4000 char), --a string representation of the value
VALUE_AS_CLOB clob , -- storage column for large string values
VALUE_AS_DATE timestamp, -- a typed representation of the value as date
VALUE_AS_DOUBLE double precision, -- a typed representation of the value as double
VALUE_AS_INTEGER number(19,0), -- a typed representation of the value as an integer.
primary key (ID)
)


The RelationalStore class will need to be modified accordingly to reflect the changes in schema. We would need to check if an existing VALUE exists, this can be done by first checking the VALUE_HASH column. Use of the java hashCode() is not recommended as table records can get quite large and hash collision is a concern. Perhaps a better approach would be to utilize a cryptographic hash function ie, MD5, SHA-1 to hash the string value and store it's Base64 encoding in the column. SHA-1 hashing should suffice. A base64 encode of a SHA-1 digest would result in a 28 byte hash hence the 28 byte column length.

Also, the VALUE_HASH column could be included in the DATA_PROPERTY TABLE along with the id for rapid equivalence queries (avoid a join).

No comments:

Post a Comment