Once upon a time in data land … more and more organizations had more and more data

April 23, 2020 by Rufus Pollock

Once upon a time, more and more organizations such as enterprises, governments, and NGOs had more and more data from more and more sources, both internal and external. They knew (or felt) that it must be valuable … if only they could work out how to unlock that value.

Photo by @beyondzarathustra (CC BY-SA)

However, many were struggling …

However, many were struggling on the various steps of their data journey – from finding the relevant data, to integrating it reliably together, to turning it into insight … and then automating all of that!

Some were overwhelmed by all the different data

Some were overwhelmed by the variety and volume of the data available – internally and externally – and even struggled to know what data they had. They asked themselves questions like:

  • What data do we even have?
  • What data could we get?
  • What kind of things can we do with data?
  • Where should we start?

Some struggled to connect their data with concrete business needs

Some struggled to figure out a link between all the data and concrete business needs. They asked themselves questions like:

  • Do we even have relevant data (to answer the questions we have)?
  • Can we use it cost effectively?
  • Where do we start?

Some started doing work manually but it was slow, error prone, costly and difficult to scale

Some started doing work manually, sharing files by email or dropbox, and extracting and analysing data by hand in spreadsheets and databases. But they found it was slow, error prone, costly and difficult to scale. They asked themselves questions like:

  • Why am I only getting performance numbers two months late?
  • Why is it taking Joe 3 days a month just to wrangle the excel?
  • How do we begin automating this? (and can we do it incrementally?)
  • How do we train our staff?
  • What tools are available and which should we use?

Some started building semi-automated pipelines with their own teams but it was expensive, hard to debug and leaves them dependent on ad-hoc solutions and key staff

Some started building semi-automated pipelines and hiring data scientists and engineers. This is an improvement but it’s expensive, hard to audit and debug, and often leaves them dependent on ad-hoc solutions and key staff. They ask themselves questions like:

  • Where does the the number in that dashboard comes from?[1] How do we add track and audit our data pipelines so that we can catch errors early and quickly fix bugs?
  • What happens when Joe leaves – he’s the only one who really knows how that system works? How do we introduce common processes and frameworks to organize our data team’s work so that can easily add team members and reduce key-person risk?
  • What tools can we reuse so we don’t need to build our own?
  • Is the data team’s work aligned with the business strategy?

Some decided to buy a solution from a vendor but it was hard and expensive to fit it to their rapidly evolving data and needs (and they were now locked in)

Some decided to buy expensive solutions from proprietary vendors whether for BI, ETL, or data governance but the results are often disappointing as data is diverse and messy and one size fits all is a poor fit for their rapidly evolving data systems and business needs, plus they are now locked in to one more vendor. They asked questions like:

  • What happens if our needs change next year? Can we make changes: is it covered by our contract, is it even possible?
  • Is the solution working for our business, or are we shaping our business processes and goals to fit what we can get from the solution?
  • What happens if a great new tool is released and we want to use it too? What happens when the vendor gets bought, pivots or goes out of business?
  • What do all these acronyms mean? Which of these 3 similar Apache project is actually better (for us)?

And then one day … they discovered Datopian

And then one day … they discovered Datopian who had thought about these problems for years and had a whole open source framework and tooling ready to go plus lots of expert data engineers to help them implement and adapt it …

And they worked with Datopian happily ever after and generated millions of dollars in value

And they worked with Datopian day after day unlocking the potential of their data and generated millions of dollars in value from efficiency savings and new insights … and their business thrived … and they lived happily ever after (and their CDO got a special commendation from the board!).


This is revised and extended version of the first version of Once Upon a Time in Data Land published in November 2018.

  1. “That number in the dashboard looks far too high, where does it come from … a few hours later: uh-oh debugging is really hard because the underlying pipeline is 30 undocumented SQL statements spread across 5 different systems” ↩︎