Why you need a Data Catalog to build Data Products
Catalogs are not just for enterprises, but help start-ups drive business value too
About me
Hello 👋 I’m Hugo Lu — I started my career working in M&A in London before moving to JUUL and falling into data engineering. After a brief stint back in finance, I headed up the Data function at London-based Fintech Codat. I’m now CEO at Orchestra, which is a data release pipeline tool that helps Data Teams release data into production reliably and efficiently.
Introduction
At Orchestra something I get asked a lot is how Data Engineers and Analytics Engineers can use our platform to accelerate the adoption of data products.
It’s always a difficult one to field because the definition of “Data Product”(s) gets thrown around a lot and there is a consensus that there is no clear definition.
Is a Data Product a data set that is sold to third party customers? Is a Data Product simply a software company that specialises in gathering and cleaning data? Is building a dashboard for your CEO technically a data product?
A question I prefer to answer is “How can business improve adoption of data” or “How can businesses get more value from data” — this is much more easy to define and has some real tangible takeaways. We’re going to do just that by taking a look at a type of tool that’s been very popular in recent years: the data Catalogs.
What are Data Catalogs?
A Data Catalog is an inventory of data assets used by an organisation.
From a technical perspective, Data Catalogs do the following things:
NB: this is an over simplification but going into every feature of a Catalog isn’t what this article is about.
Maintain integrations and credentials with your data tools, such as Data Warehouses, BI tools, and productivity tools such as Jira
Present this information using an asset-centric data model
Surface this information in a helpful way where non technical or business users can get value from data
Offer some functionality to “push” data to assets, such as tagging them, assigning them owners, possibly updating them, or using them to send people alerts
The interfaces provided are designed to give data and operations managers the ability to easy assess the entirety of a data estate using a dashboard:
Pretty simple right?
How these drive adoption
Catalogs are proven to drive adoption of data products and I believe this is due to the manner in which they are delivered.
If you’ve ever tried to ship self-service analytics before, you’ll know that teaching business users SQL and onboarding them to your data warehouse is not for the faint-hearted.
It is difficult enough to get people to use Dashboards, let alone build them and query the raw data. Indeed, there are a whole host of tools like semantic layers designed to automatically and efficiently present data in a helpful way.
Catalogs are like opinionated BI tools. Rather than present a layer of arbitrary diagrams ontop of arbitrary data, Catalogs schema-force your metadata into a dashboard that’s proven and tested to be something users enjoy using and can enjoy.
It’s this crucial aspect that make Catalogs the new BI, in my opinion.
Having an actual software platform / a portal is something Business Users are used to. It’s much less daunting to be able to view a dataset, its description, its owners, its SLA, its dependencies (and so on and so forth) rather than to simply be presented with the raw data and told to “go drive insights”.
It’s extremely powerful, therefore, in terms of driving adoption. The Catalog is the gateway drug due to its simplicity. It is the one place business users and technical users “go to meet” to understand what data exists, where, and how to get value from it.
Areas for improvement
There is no reason Data Catalogs couldn’t create a BI layer within them. In fact, it feels like the logical thing to do.
As data quality and “doing more with less” sweeps data engineering and analytics engineering, one feels this pattern will come to the BI layer too. If 68% of data goes unused then the problem is surely felt most strongly in the BI layer. Imagine doing all that awesome data engineering, just to have a 32% conversion on your work? That is mind boggling inefficiency.
Catalogs offer a way for data practitioners to finally collaborate with business users effectively. Sure, the core features are probably only things enterprises need. Smaller businesses with a few dozen or even a hundred or so data assets do not need to shell out thousands of dollars on Collibra. However the process of using a Catalog is surely beneficial, not to mention the unlimited power to onboard business users of BI could be embedded within them 🚀
If you enjoyed this, feel free to reach out to me on Linkedin. I would love to discuss Catalogs with you.