Construction principles for the information professional: (8) Separating data and metadata in storage and processing
Data is hot. Everyone wants to work data-driven and AI only works on big data. In short, data is the new gold! But what exactly is data? In the previous blog, I mentioned that there are different categories of data and that one was specifically about dealing with master data and transactional data. In this blog, I dwell on metadata and the need to distinguish that from the data itself and process it separately. Otherwise, a spaghetti is created in the system landscape and searching for specific data becomes the proverbial needle in the haystack.
What is Metadata?
Metadata is information about the data or its processing. Is the subject about which the data is about itself data, a data carrier (document, media etc) or an information-processing process? Then that data about it is the metadata or metadata.
Of a document, this could be, for example, the author, word count, creation date, storage location, file format, etc. But it can be much more and depends on the metadata you want to capture. For example, data about a change in a record is also metadata; when the change was made and by whom (logging). But also the number of errors in a file or the number of objects in a collection. Or the amount of time it takes to fill out a form.
Finally, data models are also a form of metadata. Semantic models, an ontology, concepts, etc.
A Meta information system
Informationally, we think of data as a picture of something in reality (an event, transaction or situation). So then metadata is a picture of that information (processing). Viewed this way, we are talking about an information system that can capture (part of) reality (or actions in reality) and we can talk about a meta-information system if it captures the processing done by the information system.
Observations of the meta-information system are, for example, observed errors, measurement of usage, number of transactions, etc. And based on these, actions can then be performed in reality such as corrections.
Why separate?
Metadata has its own conceptual framework , its own quality requirements and often different dynamics from the information system itself. Then it makes sense to maintain (design, capture and process and modify) metadata independently of the data in the information system.
In addition, maintaining data and metadata in a single storage leads to intertwining and invisible links. This greatly reduces the maintainability of the entire system (see construction principle 2 on decoupling ).
Finally, metadata is very important for indexing and searching data. It forms the table of contents of the data collection and across collections.
The separation manifests itself for the information scientist at several levels. At the planning level, the various systems that provide processing on the one hand and control of processing on the other can be identified. At the system level, a distinction can be made between different storage locations and processing software. At the object level, a clear distinction emerges between the content of the information (e.g., the letter or file) and the metadata about it (the information about the letter, the file)
In practice
In practice, the rule of separation is sometimes applied consistently, and sometimes it is not. It is widely recognized as wise to store log files separately from the processing system, not only due to the potential sensitivity of the information but also because the maintenance and release cycles of logging functionality differ from those of the system being logged.
Many organizations separate data in transaction systems (the registration) from data used for business intelligence processing. BI processes data and metadata, which in turn generates new data (analysis data, see the previous blog). This separation helps to decouple and maintain the systems, as well as enabling them to be updated independently.
Another example can be found in document management systems (DMS). DMSs create a separation between the storage of documents and the storage and processing of their metadata. Here, maintainability and searchability are key reasons for this, as a new DMS can often be integrated with existing document storage. The metadata collection then forms an index that uses a key to reference the content data.
Bij de huidige inrichting van de BasisRegistratie Personen (BRP) lopen data en metadata nog wel eens door elkaar. In de BRP zelf wordt bijvoorbeeld ook bijgehouden wanneer een wijziging is opgetreden en zelfs transacties (huwelijk, scheiding, adreswijziging, kiesrecht) worden in dezelfde opslag verwerkt als de stamdata van een persoon. Het is op zich niet ongebruikelijk om bij de data zelf ook op te slaan wie het record heeft aangemaakt en wanneer het voor het laatste is gewijzigd maar daarmee is de BRP een lastig onderhoudbaar en zeker lastig vernieuwbaar systeem geworden! De wijzigingshistorie is een apart (meta)gegeven en door dat apart in een index bij te houden pas je het scheidingsprincipe zuiver toe en kun je veel vragen over dat gegeven los van het opgeslagen gegeven zelf beantwoorden. De registratie van het basisgegeven komt dan los te staan van de registratie van het proces om dat gegeven te bevragen , te wijzigen en te onderhouden.
In the current setup of the Basic Registration of Persons (BRP), data and metadata are sometimes intermixed. For example, the BRP tracks when a change occurred, and even transactions (marriage, divorce, address change, voting rights) are processed in the same storage as a person’s master data. While it’s not uncommon to store who created a record and when it was last modified along with the data itself, this makes the BRP a difficult system to maintain and even harder to update! The change history is a separate (meta)data and by keeping that separate in an index, you apply the separation principle correctly, allowing many questions about that data to be answered independently from the stored data itself. The registration of the master data then becomes decoupled from the registration of the process to query, modify, and maintain that data.
These are still fairly conventional applications, but what about Data Lakes and AI applications? It also turns out that without metadata, it becomes difficult to search for relevant data! AI relies on many indexes (metadata) that point to the sources where the data itself is stored. And to assign value to that data, it’s essential to know how up-to-date it is, how often it’s used, who uploaded it, and in what language, etc. All metadata that makes searching much faster.
If that metadata were intertwined with the sources themselves, CHATGPT would still be science fiction... because then you would have to search the entire haystack instead of the indexes. In Data Lake analysis, a data analyst's work is always to impose structure on the Data Lake (after the fact).
The information scientist must consider the separation of data and metadata when designing a registry. What are the master data, what are the transaction data, and what are the metadata? What analyses does one want to perform on the registry? What indexing is desired, and what search queries? With metadata, the actual data becomes maintainable, searchable, accessible, and interoperable1.
- Also known as the FAIR principles (https://www.go-fair.org/fair-principles/). It is beyond the scope of this blog to delve into this further.
Read the other information science principles here:
- Meaningless identity designation, read here.
- Decoupling points for complexity reduction and flexibility, maximizing independence of components, read here.
- Language consistency, read here.
- Clear distribution of responsibilities and functional separation for administration, read here.
- Delegating decision-making authority as low as possible, read here.
- Detaching authorization from identification/authentication, read here.
- Single registration of master data, read here.
- Separating data and metadata in storage and processing, read here.
- Applying standard patterns without deviations, read here.
- Separating application function from data storage, read here.