Service providers: Datopian
Client: Birch Infrastructure
Services: custom data scrapers, custom data management systems, data strategy, data engineering
Period: March 2021 – present
As data is becoming the world’s most valuable commodity and leads to outputs in the scale of zettabytes (1 zettabyte can store roughly 10 billion 4K movies of 100 gigabytes each!), there is undoubtedly a high demand for solving big data problems. The kinds of data issues we encounter in the wild can vary tremendously from one industry to the next, but a constant factor remains: managing data efficiently is a growingly difficult endeavour and good solutions are being sought out to deal with such astronomical amounts in a way that allows companies to make sense of their data and extract value from it.
Accessing, retrieving, storing, processing, searching, sharing and visualizing big data are only a handful of the common situations data engineers are contending with on a daily basis. On some complex projects, these technically-minded folks have to deal with uncertainty where the domain, the needs and the data are unknown and the main focus keeps growing. They may struggle to cross-reference hundreds of different data sources (welcome variety, one of the big Vs of big data!) and searching through those data sources and their associated metadata is by no stretch of the imagination a simple challenge. Birch Infrastructure, an innovative US-based firm catering to forward-looking businesses wishing to become better consumers of renewable energy, is one of the firms that is tackling this modern puzzle, one big piece at a time.
Taking a stance for tomorrow’s generation, Birch is hard at work to help us live in a more sustainable world. In a nutshell, they operate as a virtual utility to bring together renewable energy and data centers. Specifically, they focus on:
- Siting and developing data center sites;
- assessing and allocating the utility, chiefly energy, risks of;
- Managing commodity market risk with adaptive data forecasting and modeling.
Birch has deep knowledge about energy markets, renewable energy development, fiber deployment and combined they have a unique approach to selecting ideal data center and renewable power locations. Just like every other visionary company working on complex issues that have a worldwide reach, they surround themselves with teams of talented people to accomplish their ambitious goals. When the time came to solve their data management needs, the decision process was straightforward: Datopian fit in the picture.
Meeting the requirements
A data-driven business such as Birch thrives on deriving meaning from information, which becomes a crucial asset enabling delivery of high value to customers. The process shares common ground with diamond extraction from the Earth: data needs to be scraped from its source; it then goes through a technical pipeline where it is scrubbed to remove its impurities, cleaned, weighted to assess its worth, packaged in a useful format and transported to a team of experts who then polish the final result. However, this remains exceedingly challenging as there is a need to:
- inventory and evaluate existing data sources;
- suggest the best ways to access and manage data;
- follow the best practices of data engineering;
- enrich data from “off-the-shelf” solutions;
- build custom and interactive interfaces;
- deploy efficient and reliable systems.
From the very beginning of this journey, we need to ask ourselves a plethora of questions:
- How long will it take to acquire all that data?
- When do computational speed, storage and cost become an issue?
- How much can we take advantage of existing tools and APIs?
- Is there a trade-off to make with volume (yet another big V of big data)?
Ultimately, technical expertise has to be leveraged so that more business leads can be generated thanks to solutions that will provide benefits in the long run by making the best possible use of big data.
Summary of strategies and goals
First and foremost, one key objective is to facilitate access to the relevant data.
For Birch, being able to quickly find specific information will result in large savings as they will focus on producing business insights instead of dealing with the complexities of raw data, which may not even be easily or conveniently obtainable.
On one technical hand, there is a need to factor security into the equation, availability of data, pipeline automations and so on. Considering the aspect of data relevancy, we need to ensure that human-in-the-loop features are built into the system so that improvements can be made when dealing with highly unstructured data since it is estimated that 80 to 90% of the data in any organization is unstructured.
In keeping what that theme, an initial plan of action can be summarized as follows:
- Data exploration and understanding;
- Deployment and metadata indexing;
- Smart entity mapping, including human-in-the-loop interactions;
- Semantic search and augmentation of entity mapping.
Some of the strategies we might use encompass a data API (Application Programming Interface), an ETL (Extract, Transform, Load) data pipeline, OCR (Optical Character Recognition) and possibly statistics forecast, amongst others. Regardless of the definitive approaches chosen in the end, our core strategy will revolve around building an MVP (Minimum Viable Product) to clearly unravel the end goals, provide features incrementally and avoid feature creep while guaranteeing that the solutions we build will be fully aligned with Birch‘s purpose.
Technologies involved: a whirlwind tour
Given this project’s scope and aspirations, we ought to depend on many different technologies and services if we are to succeed. On the server side, battle-tested software will be on our side, including Elasticsearch, Microsoft SQL Server, Node.js, BigQuery and the Python programming language. As we enter the era of data in this modern incarnation of data science where the web keeps evolving at a swift pace, we will also arm ourselves with contemporary offsprings such as Jupyter Notebooks for data exploration, FastAPI for rapid prototyping of RESTful APIs, Prefect for scheduling and managing data pipelines and DBT to automate data transformations.
Other supporting tools include Docker, cloud test environments and platforms like GitHub and GitLab. These facilitate tasks such as build automation, test suites execution, code quality control with CI/CD (continuous integration and continuous delivery) and automatic documentation updates. When contemplating search, the central aspect of this work, we can count on bleeding edge technology like Jina.ai to create an intelligent semantic search engine. We can use a combination of Elasticsearch with Kibana dashboards to navigate categorical data, explore large amounts of data or perform aggregations. Dynamic Neural Networks (DNNs), machine learning and natural language processing (NLP) are only a sprinkle of the techniques we are seeking in niche software to take data’s inherent yet hidden potential to the next level.
At a high-level, efficiency was gained by limiting the number of channels used to communicate and manage existing data. It becomes easier to capture, clarify, organize, to reflect upon and engage with previous knowledge once it is being funneled into fewer tools in a true Getting Things Done fashion. Applying that same strategy to data, we ended up with a “core dataset” where we keep building relationships between different entities after having undertaken the arduous tasks of automating the scraping, processing and dumping of data. Initially, distinct data sources had to be consulted one at a time in different formats with no obvious path to link data pertaining from one dataset to another. This is no longer the case when we have a unifying interface that relies on Single Sign-On (SSO) to help comply with the CIA triad of confidentiality, integrity, and availability, keeping security in check. At this stage, with some understanding of specialized technical tooling such as BigQuery, Elasticsearch and custom-built tools, it is now possible to find cross-referenced data which was anteriorly laborious to pull out, metamorphosing the timescale required to get tangible results from weeks and even months down to hours or even minutes.
With a data pipeline now in place, ingestion, management, transformation and display of data occurs with much less friction. New data sources can be added into the mix and parsed meaningfully in relation to the existing information. Furthermore, this need for continuous addition of sources is being taken into account by giving engineers at Birch features to make data requests, which are integrated within the whole architecture through custom-built functionality. This provides us with the ability to monitor requests closely, to make sure they are dealt with in a timely manner and to update different parts of the system accordingly. The consequence is a workflow that simplifies manipulation of data substantially while giving stakeholders the confidence that critical input will not slip through the cracks. Each user interface we have designed has built-in support for requesting changes and allows editing of the content on the fly, giving the opportunity to Birch‘s knowledgeable staff to fill in the gaps by outputting their wisdom to a perpetually self-updating machinery.
Right from the beginning, providing working and useful solutions has been a topmost priority: despite the fact that there is much left to achieve in this ever-expanding venture, an iterative approach took us much farther ahead than where we were only a couple of months ago. Dashboards with interactive maps and features served as a very good entry point to understand the data, observe formerly unknown correlations and take our learnings to the next level with this newfound acumen when we built more tools as part of the exploratory phase. Even though abstract findings like these may be hard to quantify, there is no doubt that using technology well has served its purpose and led us to making groundbreaking discoveries: expecting to reach similar conclusions in the absence of carefully engineered software by going the manual labour route would be unthinkable.
The road ahead is brimming with possibilities: there will be an ongoing need to process new data sources to expand the knowledge base; data requests will keep coming in for which platforms like GitHub can be put to use to automate task assignments to data engineers, status updates, follow-ups and so on; a general-purpose knowledge graph will be built in the form of an ontology to encapsulate the nature of entities and to visually represent emerging connections between them; the search in the custom UIs will be improved with semantic meaning and multi-faceted search will be enabled by organizing data in brand-new arrangements; NLP will be taken a step further with Named-Entity Recognition and entity clustering; web applications will boast augmented search for entity maps…
As we uncover yet unmet demands, the project will be stirred with the proper guidance from the leadership, just like a plane would readjust slightly its direction during a flight until it arrives at its destination.
Have a similar project and need help? Contact us!
We create, maintain, and deploy data management and data engineering technologies for government, enterprise, and the non-profit sector using CKAN, Frictionless Data, DataHub.io and other open-source software that we have built ourselves.