We examine the challenges organizations grapple with in the complexities of the modern data stack and look towards a 6th Data platform, one assembled from modular, multi-vendor components; dbt Labs has positioned itself to be a leader in turning data into business-level models, a central component in the world of metadata management.
Research by Rob Strechay and George Gilbert, the video conversation with Drew Banin, co-founder of dbt Labs
Our position
We believe that no single data platform will conquer them all. And that organizations will look to continue to break apart the compute or execution layer from the data storage layer in their composable data platform. In the research, we see organizations looking at ways to address their major challenges: costs of compute and storage, number of systems to manage, how to understand what data they have, and how to govern the use of that data. This research note digs into the last two issues around knowing what data you have and who can use it.
Full video interview with Drew Banin, co-founder of dbt Labs
The Market
We see customers taking notice of metadata and governance platform solutions offered by their existing vendors, such as Informatica. The company saw an increase of 37% year-over-year to $617 million in Cloud Subscription Annual Recurring Revenue (ARR) in the fourth quarter and full-year 2023. We also see emerging platforms like Starburst, with Trino and their emerging governance platform, and dbt Labs, which is becoming the authoring and execution platform for turning data into business-level models, are also seeing growth in their citation-weight net sentiment, which our partner ETR reports this growth.
Source-conformed to Business-conformed Data
In this chapter of the 6th Data Platform, we examine the intricacies of extending dbt’s pipeline or Directed Acyclic Graph (DAG), expanding the metadata that dbt and partners can generate from that DAG, and the nuances of federated data topologies.
In order to understand how data is transformed within dbt’s platform, we need to delve into its journey from raw, source data to refined data products. This transformation is a complex process that involves multiple steps, such as filtering, joining, and aggregating data sets. By using SQL and Python models, dbt creates a detailed map of data transformations, the DAG. This map is not just a technical blueprint but also a living document that tracks the lineage of data, and it is version-controlled, which makes it easy for all stakeholders to collaborate and make modifications over time.
What is a Data Mesh
The dbt Mesh paradigm emerged as a response to the scholarly definition of the data mesh concept. Originally designed for composability and representing the data mesh, dbt Mesh has expanded to address governance challenges when building composable transformations in decentralized data environments. With the ability to define data contracts within dbt Mesh, diverse teams can efficiently handle data stewardship. This paradigm draws inspiration from established software engineering workflows, such as the use of contracts with APIs. By bringing these workflows into the data sphere, dbt Mesh helps manage complexity and scale collaboration and provides data contracts.
The Challenges and Solutions
The problem of scaling across data silos is not limited to data platforms only. In fact, these challenges of scalability arise in dbt and other mesh architectures as well, especially when it comes to supporting multiple compute or execution engines. Although dbt can’t do this today, it’s something the company is considering. This may be necessary for customers who serve pipeline processing or BI dashboards using separate execution engines. dbt serves as a logical layer, not an execution engine like Databricks, Snowflake, MongoDB, Oracle, or BigQuery. Compatibility with different underlying data platforms is the key feature of dbt. Despite challenges such as compliance and data sovereignty, dbt’s logical orientation makes it well-suited to provide governed, cross-platform data solutions.
Many organizations have to deal with multiple data silos, which can create difficulties in managing metadata. To address this issue, dbt has introduced the Discovery API that offers comprehensive metadata insights. This new feature enables users and partner ISVs to view the lineage of their entire dbt project, which can help improve data quality and recency understanding. Moreover, column-level lineage is now available, which can greatly enhance impact and root cause analyses.
dbt Lab’s strategic acquisition of Transform, back in February of 2023, brought a new dimension of capabilities, incorporating a semantic layer that promises to streamline the operationalization of metrics at an enterprise scale. The goal of a semantic layer becomes the de facto standard for metric definitions, enabling data analysts and business stakeholders to coalesce around a single, immutable source of truth.
What about Governance
Governance can be a multifaceted and sometimes nebulous concept, being broken down into three pillars: visibility, change management, and data quality. It is seen that effective governance hinges on comprehensive visibility into the data’s lifecycle within an organization. This transparency enables a deep understanding of data’s transformation and application, paving the way for informed governance practices.
SQL versus Python
The question of whether semantics will grow in SQL or if it will require Python for more expressivity was top of mind. Part of this debate is surrounded by whether organizations will ensure consistency and reliability when integrating Python models into dbt, given the diverse nature of Python code. While it is acknowledged that Python’s expressive power is a perceived advantage, it was noted that SQL’s position is that of the dominant language of analytics. We believe there will be a continued expansion of SQL’s capabilities, driven by data platforms’ evolution, thus blurring the lines between the functionalities traditionally exclusive to Python.
Our Perspective
Our perspective through an ongoing thorough analysis of the evolving landscape of data platforms highlights the shift towards modular, multi-vendor systems exemplified in the 6th Data platform. Central to this transformation is the metadata management and governance “control plane”, where dbt Labs is positioned to be a pivotal player in turning data into actionable business models and advancing metadata management. This underscores the growing importance of metadata and governance solutions, challenges in scalability and data management, and the strategic integration of technologies like dbt Mesh and the Discovery API. The acquisition of Transform by dbt Labs, enhancing metric operationalization, and the ongoing debate between SQL and Python’s roles in analytics reflect the dynamic nature of this field. This evolution signifies a future where data platforms are more adaptable, integrated, and focused on efficient data governance and enhanced metadata visibility, driving informed decision-making and innovation in data analytics.