Construction principles for the information professional: Separating data and metadata in storage and processing

By: Rutger Gooszen

Data is hot. Everyone wants to work data-driven and AI only works on big data. In short, data is the new gold! But what exactly is data? There are different categories of data, and one is specifically about dealing with master data and transactional data. Metadata and the need to distinguish that from the data itself and process it separately is essential. Otherwise, spaghetti is created in the system landscape and searching for specific data becomes the proverbial needle in the haystack.

What is Metadata?

Metadata is information about the data or its processing. Is the subject about which the data is about itself data, a data carrier (document, media etc.) or an information-processing process? Then that data about it is the metadata or metadata.

Of a document, this could be, for example, the author, word count, creation date, storage location, file format, etc. But it can be much more and depends on the metadata you want to capture. For example, data about a change in a record is also metadata; when the change was made and by whom (logging). Also the number of errors in a file or the number of objects in a collection. Or the amount of time it takes to fill out a form.

Finally, data models are also a form of metadata. Semantic models, ontology, concepts, etc.

A Meta information system

Informationally, we think of data as a picture of something in reality (an event, transaction or situation). So, then metadata is a picture of that information (processing). Viewed this way, we are talking about an information system that can capture (part of) reality (or actions in reality) and we can talk about a meta-information system if it captures the processing done by the information system.

Observations of the meta-information system are, for example, observed errors, measurement of usage, number of transactions, etc. And based on these, actions can then be performed in reality such as corrections.

Why separate?

Metadata has its own conceptual framework, its own quality requirements and often different dynamics from the information system itself. Then it makes sense to maintain (design, capture and process and modify) metadata independently of the data in the information system.

In addition, maintaining data and metadata in a single storage area leads to intertwining and invisible links. This greatly reduces the maintainability of the entire system see construction principle 2 on decoupling.

Finally, metadata is very important for indexing and searching data. It forms the table of contents of the data collection and across collections.

The separation manifests itself for the information scientist at several levels. At the planning level, the various systems that provide processing on the one hand and control of processing on the other can be identified. At the system level, a distinction can be made between different storage locations and processing software. At the object level, a clear distinction emerges between the content of the information (e.g., the letter or file) and the metadata about it (the information about the letter, the file).

In practice

In practice, the rule of separation is sometimes applied consistently, and sometimes it is not. It is widely recognized as wise to store log files separately from the processing system, not only due to the potential sensitivity of the information but also because the maintenance and release cycles of logging functionality differ from those of the system being logged.

Many organizations separate data in transaction systems (the registration) from data used for business intelligence processing. BI processes data and metadata, which in turn generates new data (analysis data). This separation helps to decouple and maintain the systems, as well as enabling them to be updated independently.

Another example can be found in document management systems (DMS). DMSs create a separation between the storage of documents and the storage and processing of their metadata. Here, maintainability and searchability are key reasons for this, as a new DMS can often be integrated with existing document storage. The metadata collection then forms an index that uses a key to reference the content data.

In the current structure of the Basic Registration of Persons (BRP), data and metadata are often intertwined. For example, the BRP itself records when a change has occurred, and even transactions (such as marriage, divorce, address change, voting rights) are stored in the same system as the core data of a person. While it’s not uncommon to store who created a record and when it was last modified alongside the data itself, this has made the BRP a complex system to maintain and especially difficult to modernize.

Change history is a separate (meta)data element, and by storing it separately in an index, you apply the principle of separation more cleanly. This allows many questions about that metadata to be answered independently of the stored data itself. The registration of the core data then becomes distinct from the registration of the processes used to query, modify, and maintain that data.

In the current setup of the Basic Registration of Persons (BRP), data and metadata are sometimes intermixed. For example, the BRP tracks when a change occurred, and even transactions (marriage, divorce, address change, voting rights) are processed in the same storage as a person’s master data. While it’s not uncommon to store who created a record and when it was last modified along with the data itself, this makes the BRP a difficult system to maintain and even harder to update! The change history is a separate (meta)data and by keeping that separate in an index, you apply the separation principle correctly, allowing many questions about that data to be answered independently from the stored data itself. The registration of the master data then becomes decoupled from the registration of the process to query, modify, and maintain that data.

These are still fairly conventional applications, but what about Data Lakes and AI applications? It also turns out that without metadata, it becomes difficult to search for relevant data! AI relies on many indexes (metadata) that point to the sources where the data itself is stored. And to assign value to that data, it’s essential to know how up to date it is, how often it’s used, who uploaded it, and in what language, etc. All metadata that makes searching much faster

If that metadata were intertwined with the sources themselves, CHATGPT would still be science fiction... because then you would have to search the entire haystack instead of the indexes. In Data Lake analysis, a data analyst's work is always to impose structure on the Data Lake (after the fact).

The information scientist must consider the separation of data and metadata when designing a registry. What are the master data, what are the transaction data, and what are the metadata? What analyses does one want to perform on the registry? What indexing is desired, and what search queries? With metadata, the actual data becomes maintainable, searchable, accessible, and interoperable. Also known as the FAIR principles The FAIR principles are explained here. It is beyond the scope of this article to delve into this further.

Read the other information science principles here:

Related Insights