Effective Data Governance Can Only Exist Within Data Orchestration
Data Governance tools and Governance Solutions exist “Ex-post” and cannot prevent changes to errors
Introduction
There’s an increasing focus on Data Governance and the concept of Control and Security in the data world. In Tomasz Tunguz’ latest post, we see that amongst all the data-adjacent software categories, security and “data as a service” businesses are growing at 29% and 23% percent respectively, far higher than other categories of software type. This is interesting, as it evidences a clear demand for security and control in public markets, for generally larger companies with bigger customers, which should be indicative of the demand for such services in smaller companies within tech and beyond.
The natural question is:
How can data practitioners, data engineers, and data architects implement appropriate security measures when both building data teams but also building data companies?
In this article, I’ll argue it will be by doing something never done before: introducing the concept of governance and monitoring into the orchestration layer. There are various tools out there that claim to solve governance, however they are in many regards ex-post and not preventative. I believe something as important as security and governance deserves both a preventative and reactive element, and in this article I’ll show you how.
Definitions
Before launching into Data Governance, we should define what it is and what it’s for. I define Data Governance as:
The practice of ensuring data is stored, used, accessed and shared correctly within legal and operational guidelines
Google define it as:
Data governance is a principled approach to managing data during its life cycle, from acquisition to use to disposal.
Personally, I prefer my definition. It should be evident that the proper storage, use, access and sharing protocols applied to data have numerous benefits such as efficacy (the right data is used), risk management (in the case of PII, it’s distribution is monitored and controlled) and efficiency (there is no duplication of work due to effective governance).
With these in place, let’s analyse data governance tools with respect to these definitions.
The current state of play — data hindsight
For the most part, there are no dedicated “Data Governance Tools”. There are, however, Data Catalogs, such as Collibra, Atlan, and Alation. Architecturally, these tools are structured in a fairly straightforward way. Users give Catalogs credentials, and the permission to fetch metadata. These catalogs collate metadata, clean it and harmonise it to a data model that makes sense. This allows catalogs to present data pipeline metadata via aesthetically-pleasing UIs and to offer features like data lineage. Data Catalogs are traditionally expensive tools designed for enterprises, however there are lots of new “Lightweight” catalogs accessible to smaller companies, like Secoda and CastorDoc.
These platforms often talk about “Data Discoverability” and it is in many ways, linked to Governance for the reasons we mentioned earlier. However, Catalogs also have features such as PII identification which can inform the user of where PII is likely to reside. Catalogs also include an element of reporting on role-based access control (“RBAC”), and allow administrators to see who has access to what data. Finally, lineage helps users see what data is being used for what purpose.
From my perspective, this is great. However it is reactive and therefore constitutes data hindsight. In an organisation that is perhaps very large (and therefore takes time to monitor) or moves very quickly, unless there is a SOC Analyst type person sitting in a data catalog, by the time breaches in Governance occur it could well be too late. Therefore, in respect of Data Governance, it is very much like acting with the benefit of hindsight or “Data Hindsight”.
Modern Data Governance — Orchestration and Governance combined
Fundamentally, data teams are shifting to less manual and more programmatic control of their data pipelines. There are plenty of engineers using Terraform in their data pipelines and discussions on how helpful it is are abound.
This is brilliant for Governance because it determines what is building pipelines, building data assets, and what dependencies exist.
Data Build Tool (dbt) illustrates this quite well. If you look at their setup docs for say, Snowflake, you see that the commands mean that the programmatic dbt user essentially becomes the owner of tables or views maintained in Snowflake. This is important because it means that data in Snowflake has no individual owner, in respect of creating, updating and deleting that data.
There is, of course, the question of who has access to that data however in a world where Infrastructure-as-code is gaining in popularity and where programmatic access and orchestration solutions are the de-facto way to trigger the materialisation of data assets, “Who is access?” is not the important question.
The important question, is “Who controls that which has access?”.
In this simple case of someone using Snowflake and dbt, there could feasibly be two Snowflake users that interact with the data. the DBT_USER and perhaps a POWER_BI user. The question then becomes, who has the power to utilise the DBT_USER and the POWER_BI user?
The answer lies in Data Orchestration, or your Workflow Orchestration Tool, if you prefer. If you’re using a single control plane for building, triggering and monitoring, this is where the truly important role-based access control comes into play.
By moving RBAC into the orchestration tool, Data Architects suddenly can control exactly who has access to what resources in a company’s data stack (with respect to non-read permissions). This is powerful, because it means that changes to data (which necessarily require non-read permissions like write or delete) must be A) granted in the Orchestration layer and B) actioned by the Orchestration layer. Furthermore, an Orchestration layer that supports monitoring of pipelines has an audit log that can inform administrators of when people were granted access to what resources, what operations ran, and when. This enables preventative rather than reactive data governance.
Conclusion
In this article, we defined Data Governance and took a look at how Data Governance functionality exists in tools like Data Catalogs. Evaluating Data Governance-specific tools was not part of the scope, and would certainly paint it in a more advanced light. However, tools today are nonetheless reactive. The Data Governance tools of tomorrow will also be preventative, and must live in the orchestration layer.
A final point worth calling out is that monitoring for read-access in tools such as dashboards is necessarily reactive. In these “self-serve” scenarios, there is nothing to orchestrate — it’s hard to stop someone downloading some data they’re not supposed to see once it’s in the BI layer.
This too, could be handled by an orchestrator through the dynamic provision or checking of current roles within tools. This metadata too, would exist in the orchestration layer, providing a log for administrators to either prevent or react to potential Data Governance breaches.
To date, I know of no Orchestration tools that do this (not even my company, Orchestra). The Open-source projects are miles away from this, and are more focussed on efficiently executing python code. Data Governance is, however, a rapidly growing category, so it’ll be interesting to see what emerges to fill this void. It will be interesting to see when the data community realises Data Governance can be preventative as well as reactive. 🚀