Foreword
Some of the biggest pains we see facing data engineers and analytics engineers is that they spend too much time maintaining boilerplate infrastructure, and still have no visibility into pipeline failures.
It means you’re constantly fighting fires and don’t have time to focus on building. What’s worse is that the business doesn’t trust the data.
At Orchestra we’re building a Unified Control Plane for Data Ops. A Data Status page if you like, with some incredible features to give data teams their time back so they can focus on building. You can try it now, free, here.
Introduction
I recently read a fantastic article recently called the “Data Death Cycle”.
In it, the authors describe five common “traps” data teams fall into. These are:
The Tech Trap
The Doing Trap
The Project Trap
The Silo Trap
The Performance-First Trap
In this article, I’ll focus on the Silo Trap as this is one that’s all too common for many Data Teams who struggle with fielding many ad hoc requests and poor uptake of Data Products by the Business.
The Silo Trap in Data and AI
The authors describe the Silo Trap as
Data & AI initiatives are rarely the domain of a single team. Delivering real value requires collaboration across Engineering, Product, Sales, Marketing, Security, IT, and, crucially, the end-users. Yet, in a misguided attempt to streamline processes or manage complexity, these initiatives often fall into the “silo trap,” where work is done in isolation.
This is an extreme example of the issues of centralisation and decentralisation.
When a centralised data team is responsible for everything they can continue being lean. When requirements and discovery move to the business, there becomes a disconnect.
This is common in organisations who have the following structure:
Core data Platform Team / Data engineering
Analytics Engineering
Business Users
A Vicious Cycle
This particular variant of the trap can be illustrated below
In this case, a user tries to self-serve. They consult dashboards, perhaps a catalog, and realise they cannot.
They make a data request. This is annoying for the Data Team — they do not want to be treated like a service desk.
The request is for a new column. The Analytics team realise this requires a change to the code further downstream. There is some back and forth with platform and software — SWEs have forgotten to put this new column on the table.
Some time passes, the Data Team finally get back to the user. The User has now already changed their mind about what they want, the request comes back and does not fit the requirement, or the requirements have changed.
The result? A loss of trust in the Data Team. Frustration from the business at a lack of insights. No Data-Enabled Decision Making
The evolution — empowering users with SQL
Giving Data Analysts the power to transform data on their own without reliance on a data platform team has been game-changing.
It was what first allowed me to become “dangerous” — instead of relying on excel to analyse stuff, a bit of SQL and BigQuery meant I could suddenly run analytics on millions of data points without being shackled by spreadsheets.
Empowering Business Users, or sometimes embedded Analysts is very powerful when it comes to solving the problem above. The flow becomes:
In this case, there is less back and forth with the platform team because the User (in this case an embedded analyst or “Power User”) can do the modelling themselves.
However, there is still some back and forth with the platform team.
This reduces the length of the cycle and alleviates the severity of the problem, but doesn’t solve it completely.
The future — the power of the platform team, governed
Ideally, we would like to see the embedded analyst or user be able to completely self-serve.
However, this would require them to have a very high degree of technical proficiency. It is unrealistic to expect that a single person could have the skillsets of a platform engineer, analytics engineer and data analyst, as well as business context / know-how for the team they are in.
In the example above, the Analyst still has a dependency on the platform team because:
The platform team needs to add a new column to a data ingestion job
The platform team may need to make a change to the orchestration repo to run a new pipeline
The platform need to maintain the infrastructure that run (1) and (2)
The platform team need to set-up alerting in (3) and also update the catalog in the orchestration pipeline
If there was a way to give Embedded Analysts the power of (1–4) or to reduce the requirements of managed infrastructure, complicated orchestration, custom alerting, catalog updating etc. then the flow would look like this:
This is effectively a somewhat decentralised model where there is no need to fall into the “Silo Trap” because business users with a need for data are capable of self-serving.
Key to this is that the Central Data Team (who still exist) retain governance and visibility of everything embedded analysts do.
Conclusion
The Silo Trap is a common pitfall that many Data Teams that are scaling face.
In the push to decentralise, feedback loops lengthen which can lead to a loss of trust and long time to insight. This results in decisions being made without data. The Data Team come to be viewed as a cost-centre, and a service desk.
One possible answer to this problem relates to empowering embedded analysts or business users. This leads to yet another problem, however, which is a loss of governance.
Any approach that empowers business users to build their own data pipelines and self-serve needs to allow a still-existent data team to retain both visibility and governance.
I recently gave a talk about this (Data Mesh) at Big Data London 2024! Follow me on Linkedin to see the pres! 🚀
Find out more about Orchestra
Orchestra is a unified control plane for Data and AI Operations.
We help Data Teams spend less time maintaining infrastructure, make them proactive instead of reactive, and ultimately win trust in data and AI from the Business
We do this by consolidating Orchestration with monitoring, data quality testing, and data discovery. You don’t need an observability, lineage, catalog etc. with Orchestra.
Check out
That is such an amazing deep dive on the Silo Trap! Thanks for sharing these thoughts Hugo 🙏