Data Fabric building blocks
Data Fabric’s building blocks represent groupings of different components and characteristics. They are high-level blocks that describe a package of capabilities that address specific business needs. The building blocks are Data Governance and its knowledge layer, Data Integration, and Self-Service. Figure 1.3 illustrates the key architecture building blocks in a Data Fabric design.
Figure 1.3 – Data Fabric building blocks
Data Fabric’s building blocks have foundational principles that must be enforced in its design. Let’s introduce what they are in the following subsection.
Data Fabric principles
Data Fabric’s foundational principles ensure that the data architecture is on the right path to deliver high-value and high-quality data management that ensures data is secure, and protected. The following list introduces the principles that need to be incorporated as part of a Data Fabric design. In Chapter 6, Designing a Data Fabric Architecture, we’ll discuss each principle in more depth:
- Data are Assets that can evolve into Data Products (TOGAF & Data Mesh): Represents a transition where assets have active product management across its life cycle from creation to end of life, specific value proposition and are enabled for high scale data sharing.
- Data is shared (TOGAF): Empower high-quality data sharing
- Data is accessible (TOGAF): Ease of access to data
- Data product owner (TOGAF and Data Mesh): The Data Product owner manages the life cycle of a Data Product, and is accountable for the quality, business value, and success of data
- Common vocabulary and data definitions (TOGAF): Business language and definitions associated with data
- Data security (TOGAF): Data needs to have the right level of Data Privacy, Data Protection and Data Security.
- Interoperable (TOGAF): Defined data standards that achieve data interoperability
- Always be architecting (Google): Continuously evolve a data architecture to keep up with business and technology changes
- Design for automation (Google): Automate repeatable tasks to accelerate time to value
These principles have been referenced directly or inspired the creation of new principles. from different noteworthy sources: TOGAF (https://pubs.opengroup.org/togaf-standard/adm-techniques/chap02.html#tag_02_06_02), Google (https://cloud.google.com/blog/products/application-development/5-principles-for-cloud-native-architecture-what-it-is-and-how-to-master-it), and Data Mesh, created by Zhamak Dehghani. They capture the essence of what is necessary for a modern Data Fabric architecture. I have slightly modified a couple of the principles to better align with today’s data trends and needs.
Let’s briefly discuss the four Vs in big data management, which are important dimensions that need to be considered in the design of Data Fabric.
The four Vs
In data management, the four Vs – Volume, Variety, Velocity, and Veracity (https://www.forbes.com/sites/forbestechcouncil/2022/08/23/understanding-the-4-vs-of-big-data/?sh=2187093b5f0a) – represent dimensions of data that need to be addressed as part of Data Fabric architecture. Different levels of focus are needed across each building block. Let’s briefly introduce each dimension:
- Volume: The size of data impacts Data Integration and Self-Service approaches. It requires a special focus on performance and capacity. Social media and IoT data have led to the creation of enormous volumes of data in today’s data era. The size of data is at an infinite point. Classifying data to enable its prioritization is necessary. Not all data requires the same level of Data Governance rigor and focus. For example, operational customer data requires high rigor when compared to an individual’s social media status.
- Variety: Data has distinct data formats, such as structured, semi-structured, and unstructured data. Data variety dictates technical approaches that can be taken in its integration, governance, and sharing. Typically, structured data is a lot easier to manage compared to unstructured data.
- Velocity: The speed at which data is collected and processed, such as batch or real time, is a factor in how Data Governance can be applied and enabling Data Integration technologies. For example, real-time data will require a streaming tool. Data Governance aspects, such as Data Quality and Data Privacy, require a different approach when compared to batch processing due to its incomplete nature.
- Veracity: Data Governance centered around Data Quality and data provenance plays a major role in supporting this dimension. Veracity measures the degree to which data can be trusted and relied upon to make business decisions.
Now that we have a summary of the four Vs, let’s review the building blocks (Data Governance, Data Integration, and Self-Service) of a Data Fabric design.
Important note
I have intentionally not focused on the tools or dived into the implementation details of Data Fabric architecture. My intention is to first introduce the concepts, groups of capabilities, and objectives of Data Fabric architecture at a bird’s-eye-view level. In later chapters, we’ll dive into the specific capabilities of each building block and present a Data Fabric reference architecture example.
Data Governance
Data Governance aims to define standards and policies that achieve data trust via protected, secure, and high-quality or fit-for-purpose data. Data Fabric enables efficient Data Integration by leveraging automation while making data interoperable. To support a mature Data Integration approach, the right level of Data Governance is required. It’s of no use to have a design approach that beautifully integrates dispersed data only to find out that the cohesive data is of poor quality and massively violates data compliance stipulations. Costs saved in superior Data Integration would be short-lived if organizations are then slapped with millions of dollars in Data Privacy violation fines.
Data Governance isn’t new; it has been around for decades. However, in the past, it was viewed as a burden by technologists and as a major obstacle in the deployment of successful software solutions. The impression has been that a governance authority sets rules and boundaries restricting progress with unnecessary rigor. What is drastically different today is the shift in perception. Data Governance has evolved to make it easy to access high-quality, trusted data via automation and more mature technologies despite the need to enforce security and regulation requirements. Another factor is the recognition of the impact of not having the right level of governance in a world where data managed in the right way can lead to major data monetization.
Data Governance today is recognized as a critical success factor in data initiatives. Data is predicted to grow at an exponentially fast rate. According to IDC, from 2021 to 2026, it will grow at a CAGR of 21.2%, potentially reaching 221,178 exabytes by 2026 (https://www.idc.com/getdoc.jsp?containerId=US49018922). This has pushed Data Governance to be front and center in achieving successful data management. To keep up with all the challenges introduced by big data, Data Governance has gone through a modernization evolution. In the following sections, we will dive into the pillars that represent Data Governance in Data Fabric architecture.
The Data Governance pillars established in the past are very relevant today with a modern twist. All Data Governance pillars are critical. However, Data Privacy, Protection and Security as well as Data Quality are prioritized due to data sovereignty requirements that need to address state-, country-, and union-based laws and regulations (such as the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA)), as well as the driving need for high-quality data to achieve data monetization. Data Lineage, Master Data Management, and Metadata Management are fundamental pillars that have dependencies on one another. The priority of these pillars will vary by use case and organizational objectives. All pillars serve a vital purpose and provide the necessary guardrails to manage, govern, and protect data.
Let’s have a look at each Data Governance pillar.
Data Privacy, Protection, and Security
Whether data is used knowingly or unknowingly for unethical purposes, it needs to be protected by defining the necessary data policies and enforcement to automatically shield it from unauthorized access. Business controls need to be applied in an automated manner during Data Integration activities such as data ingestion, data movement, and data access. Manual or ad hoc enforcement is not feasible, especially when it comes to a large data volume. There need to be defined intelligent data policies that mask or encrypt data throughout its life cycle. There are several considerations that need to be taken regarding data access, such as the duration of access, data retention policies, security, and data privacy.
Data Quality
Data Quality was important 20 years ago and is even more critical today. This pillar establishes trust in using data to make business decisions. Today’s world is measured by the speed of data monetization. How quickly can you access data – whether that’s making predictions that generate sales via a machine learning model or identifying customer purchasing patterns via business intelligence reports? Data trust is an fundamental part of achieving data monetization – data needs to be accurate, timely, complete, unique, and consistent. A focus on Data Quality avoids data becoming stale and untrustworthy. Previous approaches only focused on Data Quality at rest, such as within a particular data repository that is more passive in nature. While this is still required today, it needs to be combined with applying Data Quality to in-flight data. Data Quality needs to keep up with the fluidity of data and take a proactive approach. This means Data Quality needs to be embedded into operational processes where it can be continuously monitored. This is called data observability.
Data Lineage
Data Lineage establishes confidence in the use of data. It accomplishes this by providing an audit trail starting with the data source, target destinations, and changes to data along the way, such as transformations. Data Lineage creates an understanding of the evolution of data throughout its life cycle. Data provenance is especially important for demonstrating compliance with regulatory policies. Data Lineage must keep track of the life cycle of data by capturing its history from start to finish across all data sources, processes, and systems with details of each activity at a granular attribute level.
Master Data Management
As far back as I remember, it has always been a challenge to reconcile, integrate, and manage master data. Master Data Management creates a 360-degree trusted view of scattered master data. Master data, such as that of customers, products, and employees, is consolidated with the intention of generating insights leading to business growth. Master Data Management requires a mature level of data modeling that evaluates several data identifiers, business rules, and data designs across different systems. Metaphorically, you can envision a keyring with an enormous number of keys. Each key represents a unique master data identity, such as a customer. The keyring represents the necessary data harmonization that is realized via Master Data Management across all systems that manage that customer. This is a huge undertaking that requires a mature level of Data Governance and includes Data Quality analysis and data cleansing, such as data deduplication, data standards, and business rules validation.
Metadata Management
The high-quality collection, integration, storage, and standardization of metadata comprise Metadata Management. Metadata, like data, needs to be accurate, complete, and reliable. Data must be mapped with business semantics to make it easy to discover, understand, and use. Metadata needs to be collected by following established interoperability standards across all tools, processes, and system touchpoints to build knowledge surrounding the data. There are four types of metadata that must be actively collected:
- Technical metadata: Data structure based on where data resides. This includes details such as source, schema, file, table, and attributes.
- Business metadata: Provides business context to the data. This includes the capturing of business terms, synonyms, descriptions, data domains, and rules surrounding the business language and data use in an organization.
- Operational metadata: Metadata produced in a transactional setting, such as during runtime execution. This includes data processing-based activities, such as pipeline execution, with details such as start and end time, data scope, owner, job status, logs, and error messages.
- Social metadata: Social-based activities surrounding data. This metadata type has become more prominent as it focuses on putting yourself in a customer’s shoes and understanding what they care about. What are their interests and social patterns? An example includes statistics that track the number of times data was accessed, by how many users, and for what reason.
Metadata Management is the underpinning of Data Fabric’s knowledge layer. Let’s discuss this layer in the next section.
Knowledge layer
The knowledge layer manages semantics, knowledge, relationships, data, and different types of metadata. It is one of the differentiating qualities of Data Fabric design when compared to other data architecture approaches. Metadata is managed and collected across the entire data ecosystem. A multitude of data relationships is managed across a variety of data and metadata types (technical, business, operational, and social) with the objective of deriving knowledge with an accurate, complete metadata view. The underlying technology is typically a knowledge graph.
The right balance of automation and business domain input is necessary in the knowledge layer. At the end of the day, technology can automate repetitive tasks, classify sensitive data, map business terms, and infer relationships, but it cannot generate tribal knowledge. Business context needs to be captured and modeled effectively in the form of ontologies, taxonomies, and other business data models that are then reflected as part of the knowledge layer. Intelligent agents proactively monitor and analyze technical, business, operational, and social metadata to derive insights and take action with the goal of driving operational improvements. This is where active metadata plays a major role. According to Gartner, “Active metadata is the continuous analysis of all available users, data management, systems/infrastructure and data governance experience reports to determine the alignment and exception cases between data as designed versus actual experience” (https://www.gartner.com/document/4004082?ref=XMLDELV).
Let’s take a look at what active metadata is.
Moving from passive to active metadata
To understand active metadata, you must first understand what defines passive metadata and why it’s insufficient on its own to handle today’s big data demands.
Passive metadata
Passive metadata is the basic entry point of Metadata Management. Data catalogs are the de facto tools used to capture metadata about data. Basic metadata collection includes technical and business metadata, such as business descriptions. Passive metadata primarily relies on a human in the loop, such as a data steward, to manually manage and enter metadata via a tool such as a data catalog. While a data catalog is a great tool, it requires the right processes, automation, and surrounding ecosystem to scale and address business needs. Primarily relying on a human in the loop for Metadata Management creates a major dependency on the availability of data stewards in executing metadata curation activities. These efforts are labor intensive and can take months or years to complete. Even if initially successful, metadata will quickly get out of sync and become stale. This will diminish the value and trust in the metadata.
Another point here is metadata is continuously generated by different tools and typically sits in silos somewhere without any context or connection to the rest of the data ecosystem, which is another example of passive metadata. Tribal knowledge that comes from a human in the loop will always be critical. However, it needs to be combined with automation along with other advanced technologies, such as machine learning, AI, and graph databases.
Active metadata
Now that you understand passive metadata, let’s dive into active metadata. It’s all about the word active. It’s focused on acting on the findings from passive metadata. Active metadata contains insights about passive metadata, such as Data Quality score, inferred relationships, and usage information. Active metadata processes and analyzes passive metadata (technical, business, operational, and social) to identify trends and patterns. It uses technologies such as machine learning and AI as part of a recommendation engine to suggest improvements and more efficient ways of executing data management. It can also help with discovering new relationships leveraging graph databases. Examples of active metadata in action are failed data pipelines leading to incomplete data where downstream consumption is automatically disabled, and where a dataset with high quality is suggested instead of the current one with low Data Quality, or if Service Level Agreements are not met by a data producer. Alerts that generate action by both data producer and consumer represents active metadata.
Event-based model
The missing link in the active Metadata Management story is the ability to manage metadata live and have the latest and greatest metadata. To drive automation, an event-based model that manages metadata and triggers instantaneous updates for metadata is necessary. This facilitates metadata collection, integration, and storage being completed in a near real-time manner and distributed and processed by subscribed tools and services. This is required in order to achieve active metadata in a Data Fabric design.
Let’s review the next Data Fabric building block, Data Integration.
Data Integration
Data interoperability permits data to be shared and used across different data systems. As data is structured in different ways and formats across different systems, it’s necessary to establish Data Integration standards to enable the ease of data exchange and data sharing. A semantic understanding needs to be coupled with Data Integration in order to enable data discovery, understanding, and use. This helps achieve data interoperability. Another aspect of Data Integration in Data Fabric design is leveraging DataOps best practices and principles as part of the development cycle. This will be discussed in more depth in Chapter 4, Introducing DataOps.
Data ingestion
Data ingestion is the process of consuming data from a source and storing it in a target system. Diverse data ingestion approaches and source interfaces need to be supported in a Data Fabric design that aligns with the business strategy. A traditional approach is batch processing, where data is grouped in large volumes and then processed later based on a schedule. Batch processing usually takes place during offline hours and does not impact system performance, although it can be done ad hoc. It also offers the opportunity to correct data integrity issues before they escalate. Another approach is real-time processing, such as in streaming technology. As transactions or events occur, data is immediately processed in response to an event. Real-time data is processed as smaller data chunks containing the latest data. However, data processing requires intricate management to correlate, group, and map data with the right level of context, state, and completeness. It also requires applying a different Data Governance approach when compared to batch processing. These are all factors that a Data Fabric design considers.
Data transformation
Data transformation takes a source data model that is different from a target data model and reformats the source data model to fit the target model. Typically, extract, transform, and load (ETL) is the traditional approach to achieve this. This is largely applied in relational database management systems (RDBMSs) that require data to be loaded based on the specifications of a target data model, as opposed to NoSQL systems, where the extract, load, and transform (ELT) approach is executed. In ELT, there isn’t a prerequisite of modeling data to fit a target model like in data lake. Due to the inherent nature of NoSQL databases, the structure is more flexible. Another data transformation approach is programmatic, where concepts such as managing data as code apply. When it comes to data transformation, a variety of data formats, such as structured, semi-structured, and unstructured, needs to be processed and handled. Data cleansing is an example of when data transformation will be necessary.
What is data as code?
Data as code represents a set of best practices and principles as part of a DataOps Framework. It focuses on managing data similar to how code is managed where concepts such as version control, continuous integration, continuous delivery and continuous deployment are applied. Policy as code is another variation of Data as code.
Data Integration can be achieved through physical Data Integration, where one or more datasets are moved from a source to a target system using ETL/ELT or programmatic approaches. It can also take place virtually via data virtualization technology that creates a distributed query that unifies data, building a logical view across diverse source systems. This approach does not create a data copy. In a Data Fabric design, all these integration and processing approaches should be supported and integrated to work cohesively with the Data Governance building block, the knowledge layer, and the Self-Service building block.
Self-Service
Self-Service data sharing is a key objective in an organization’s digital transformation journey. It’s the target of why all the management, rigor, governance, security, and all these controls are in place. To share and deliver trusted data that could be used to create profit. Having access to the right data of high-quality quickly when needed is golden. Data Products are business-ready, reusable data that has been operationalized to achieve high-scale data sharing. The goal of a Data Product is to derive business value from data. In Chapter 3, Choosing between Data Fabric and Data Mesh, and Chapter 7, Designing Data Governance, we touch more on Data Products. Self-service data moves away from a model that relies on a central IT team to deliver data, to one that democratizes data for a diverse set of use cases with high quality, and governance. Making data easily accessible by technical users such as data engineers, data scientists, software developers or business roles such as business analysts is the end goal.
A Data Fabric design enables technical and business users to quickly search, find, and access the data they need in a Self-Service manner. On the other hand, it also enables data producers to manage data with the right level of protection and security before they open the doors to the data kingdom. The Self-Service building block works proactively and symmetrically with the Data Integration and Data Governance building blocks. This is where the true power of a Data Fabric design is exhibited.