How to stop failing at data
By Thibaut Gourdel
Innovate or die. It’s one of the few universal rules of business, and it’s one of the main reasons we continue to invest so heavily in data. Only through data can we get the key insights we need to innovate faster, smarter, better and keep ahead of the market.
And yet, the vast majority of data initiatives are doomed to fail. Nearly nine out of 10 data science projects never make it to production. Those that do make it to production are often so slow, clunky and unreliable that they aren’t worth the initial investment that was made in them. The problem is that data teams and technology are siloed with a large, uncrossable chasm between the people who explore the data and the production teams that implement the initiatives.
These roadblocks erode confidence in the value of data, in the principles of collaboration, and in the very structures that support our businesses. We need to bridge the gap and make it easier for everyone to use and access data in a way that encourages collaboration and creates a culture of data literacy across the entire organization.
First, we have to make a few changes.
Doomed by Design
The question isn’t why we fail — it’s how we ever thought we could succeed in the first place. A data scientist comes up with a model that will solve the problem of real-time customer recommendations. They test this solution by running a Python script on a laptop — and every test is flawless.
However, things start to fall apart when data engineers try to implement the same functionality using complex pipeline technology with Spark and the Scala language. It turns out that the algorithm isn’t fast enough, robust enough, or secure enough to handle the entire customer dataset in real-life conditions. In production you have edge cases, regulation challenges, resource limits, and other factors that complicate analysis. What worked beautifully for a subset of the data fails completely when run at scale.
Data projects are doomed when the people who plan and the people who execute don’t have the same tools, the same access, or even the same goals. Data scientists are really good at asking the right questions and running exploratory models, but they don’t know how to scale. Meanwhile, data engineers are experts at making data pipelines that scale, but they don’t know how to find the insights.
We’ve been using tools that require such a high level of specialist expertise that it’s impossible to get everyone on the same page. Because data scientists only ever touch small subsets of the data, there’s no way for them to extrapolate their models to function at scale. They don’t have access to production-grade data technology, so they have no way of understanding the constraints of building complex pipelines.
Meanwhile, data engineers are being handed algorithms to implement with the barest context of the business problem they’re trying to solve and with little understanding of how and why data scientists have settled on this solution. There may be some back and forth, but there’s rarely enough common ground to build a foundation.
Unleash the Power of Data
Establishing that common ground and building a foundation for innovation requires us to move away from existing models built on siloed teams and technology. Instead, we need to build a continuous, holistic culture of data literacy that spans the entire organization.
Here are four steps to help you start your data transformation:
1. Access to data
Limited access to subsets makes it impossible for business users and data scientists to scope solutions that work at scale. This is often due to technology vendors forcing companies into consumption-based pricing that ration companies’ access to their own data. Data must be freely available to the people who need it, without restrictions on volume, data sources, or users.
2. A level playing field
The people who plan data initiatives often don’t understand the constraints of production. This is because they typically don’t have access to the specialist tools that data engineers use to build pipelines. And even if they did, they wouldn’t know how to use them. Look for a consistent set of user-friendly, self-service data platforms to nurture a common language for how data works.
3. Commentary and context
Contrary to the popular saying, data does not speak for itself. The people who use the data every day need to enrich that data with context and commentary. This will help other users in the organization understand what data they can trust and how to use it best. In addition to rating and commenting, users should have visibility into data’s provenance (where it came from) and lineage (how it has been used).
4. Persistent data governance.
Unregulated data poses a real risk to companies and their customers. We must balance the need for innovation with caution and make sure we are being responsible with sensitive and proprietary data. Establishing the roles, rules, and permissions that ensure accountability will require data governance at every stage of the data lifecycle.
The democratization of data doesn’t mean the end of specialization and expertise — far from it. Rather, by putting data in the hands of more people and encouraging the consumption and analysis of that data, we will create an even greater appetite for data initiatives and the problems they can solve.
By unifying and standardizing the ways we use and access data, we can bridge the gaps between planning and execution and finally unleash the power of data to deliver positive business outcomes.
Reprinted with permission from Datanami. You can view the original article here.