How much abstraction is healthy for a schema/data model? - Part 1

When engaged in the art of Data modelling everybody faces the challenge to find the right level of abstraction. I find that challenge quite intriguing. Nobody would create an attribute "Baker", "Lawyer", "Farmer" in a CRM system today, but one "profession" that can hold any of these professions as value. A higher level of abstraction would be to have attribute value pairs. So instead of "Profession" - "Baker" it would be "Attribute: Name=Profession, Value=Baker". Such constructs have the advantage to be very flexible, without changing the schema all sorts of different attributes can be captured. However they make validation more difficult: are all mandatory attributes present, are only allowed attributes present and do all attributes have values in the prescribed range?
Very often the data models are designed around the planned storage in an RDBMS. This conveniently overlooks that data modelling knows more approaches than just a physical data model and ER-diagrams. Tabular data in real life are confined to accountants' ledgers, while most of the rest are objects, documents and subjects (people, legal entities, automated systems - data actors so to speak) with attributes, components (sub-entries) and linear or hierarchical relations. Also exchanging data requires complete and integer data, which lends itself rather to the document than the table approach (putting on my flame proof underwear now).
In a RDBMS the attribute table would be a child table to the (people) master table with the special challenge how to find an unique key that survives an export/import operation.
This is the reason why a XML Schema seems to be the reasonable starting point to model the master data model for your application. Thus a worthwhile skill is to master XML Schema (It also helps to have a good schema editor. I'm - for many years- using oXygen XML).
This won't stop you to still use a RDBMS to persist (and normalize) data, but the ER-Schema wouldn't take centre stage anymore. Of course modern or fancy or venerable databases can deal with the document tree nature of XML quite well. I fancy DB/2's PureXML capabilities quite a bit. But back to to XML Schema (similar considerations apply to JSON which is lacking a schema language yet - work is ongoing).
Since XML knows elements (the stuff in opening and closing brackets, that can be nested into each other) and attributes (which are name/value pairs living inside the brackets) there are many variations to model a logical data entry. A few rules (fully documented in the official XML specifications) need to be kept in mind:

Element names can't contain fancy characters or spaces
Elements can, but don't need to have content
Elements can have other elements, text or CDATA (for fancy content) as children
Elements can, but don't need to have attributes
Elements can't start with "xml"
Attribute names can't contain fancy characters or spaces
Attribute values can't contain fancy characters. If there they need to be encoded
Attributes should only exist once in an element
Attributes must have a value (it can be empty), but can't have children

There are different "design camps" out there. Some dislike the use of attributes at all, others try to minimize the use of elements, but as usual the practicality lies in the middle. So we have 2 dimensions to look at: Element/Attribute use in a tree structure and secondly the level of abstraction. Lets look at some samples (and their query expressions):

<customer>
<id>1234 </id>
<name>John Doe </name>
<status>active </status>
<remarks>John loves roses and usually speaks to Jenny </remarks>
</customer>

The XPath query to get to John would be //customer[id='1234'] The double slash says "any customer anywhere in the tree hierarchy". This is very flexible and understandable very slow. A better way would be to define the exact level in the XML tree. Presuming John is an entry in the <customerlist> then the expression would be: /customerlist/customer[id='1234']. Very often in XML we use multiple schemata in one file which get a abbreviation (one or more letters in the namespace definition). Presuming our namespace identifier is "cust" then the query changes to /cust:customerlist/cust:customer[cust:id='1234']. Note: in the query the namespace has to be repeated for every element. when you just write /elementname, it is the same as /default:elementname. The example was written without any attributes, containing only nested elements. We could write the example without inner elements, just attributes:

The notation is denser and might on first look appeal to table thinking RDBMS architects. The query looks almost the same (note the @ sign for attributes): /cust:customerlist/cust:customer[@id='1234']. The " all you can attribute" approach has its limits since you can't easily repeat data or include richer content. A more typical approach is to use attributes that identify / qualify an element. Of course there is no hard rule for that. I probably would write the entry like this:

<customer id="1234" status="active">
<name lang="en">John Doe </name>
<name lang="zn">李四 </name>
<remarks>John loves roses and usually speaks to Jenny </remarks>
</customer>

Here I take advantage of the ability to repeat elements and have elements with their own repeating attributes. If and how often an element can repeat can be specified in the schema definition. The query is the same as in the attribute only example: /cust:customerlist/cust:customer[@id='1234']. It gets really interesting for all three example when you want to retrieve the customer name:
/cust:customerlist/cust:customer[cust:id='1234']/cust:name
/cust:customerlist/cust:customer[@id='1234']/@name
/cust:customerlist/cust:customer[@id='1234']/cust:name[@lang='en']
All examples so far used a "customer" element. But one could argue, that a customer is a specialization of a person (a natural person or a incorporation) or "subject", so the tag could be different. With "things" the Taxonomy could be much deeper. You could say an invoice is a customer-document, a business-document, a document, an object. Part 2 will discuss abstraction levels and de-normalization in XML a little more. A future post will shed some light on common XML Schemata.

Posted by Stephan H Wissel on 10 March 2012 | Comments (0) | categories: Software

wissel.net

How much abstraction is healthy for a schema/data model? - Part 1

Comments

No comments yet, be the first to comment