Article Image

Advanced Data Engineering for Big Data Challenges

6 mins read
Key facts
Service providers:

Datopian

Client:

Not disclosed

Services:
Data Engineering; Data Strategy;
Period:
March 2023 - July 2023

Brief summary of the project.

Datopian successfully engineered an advanced ETL pipeline, tackling the Client's complex big data challenges and enabling data-driven decision-making.

Exclamation mark pointing the problem
Problem

The Client engaged Datopian's data engineering expertise to tackle the big data hurdles inherent in a large-scale ETL process—the heart of their data-centric product. With thousands of JSON files, each averaging around 100GB, even minor inefficiencies could cascade into severe operational setbacks.

Interrogation mark pointing the need
Need

The Client needed an ultra-efficient, automated ETL infrastructure capable of flawlessly managing hundreds of terabytes of data each month. The framework had to meet rigorous standards for operational efficiency, data quality, and integrity, and be primed for rapid querying.

Checkmark pointing the solution
Solution

Datopian architected an advanced ETL solution leveraging Airflow's Directed Acyclic Graphs (DAGs), designed meticulously for each critical phase—from seamless data retrieval to automated workflow. Implementing data streaming and chunking technologies to combat memory constraints, the result was an ETL pipeline that set a new industry benchmark in speed, data hygiene, and structural integrity.

Main technologies & tools used
Airflow
Python
Postgre
AWS
Redshift

Context

With open data increasingly becoming the standard instead of the exception, there are many innovations in the works to provide data to the public in easily accessible ways. In some cases, policies have enforced previously privately held data to become open and available to the public, which is precisely the circumstance that brought the Client and Datopian together.

While an NDA prevents us from diving into the specifics of the data at this time (even though the data itself is open), the Client is embarking on an ambitious journey to create a cutting-edge portal designed to democratize access to this data. Their mission? To provide this intricate and often opaque information in a fully transparent, easily searchable, and comparable format.

A small problem

However, the task at hand proved far more complex than initially estimated. To draw a simple analogy: While the Client envisioned us disassembling and reorganizing a van or a small school bus, upon deep dive analysis, it became clear that we were dealing with a train—a behemoth one at that—with 4,000 cars loaded with intricate data points. The challenge was monumental, yet ripe with opportunities for technological ingenuity and societal impact.

The situation

The Client approached Datopian for our data engineering expertise. They came in the early stages of development, seeking a solution for managing their big data ETL (Extract, Transform, Load) process. Data is at the heart of their product, so they needed a solution that could handle this data complexity and sheer volume—it’s composed of thousands of JSON files with an average size of around 100GB (though many are even larger). Any inefficiencies or mismanagement would have serious ramifications.

The criteria

  • Monthly Throughput: Automated ETL capable of handling hundreds of terabytes of data monthly
  • Operational Efficiency: Full ETL process designed to take less than a week
  • Data Purity: Resulting data should only contain what’s necessary
  • Data Integrity and Uniqueness: No duplicated data
  • Query-Ready Architecture: Stored within a fast and easily queryable database schema

The solution

Datopian began by analyzing the problem. Of the many obstacles, the most challenging were Airflow’s limitations with high memory-consuming processes (within MWAA—e.g., running out of memory while working with multiple Python lists containing millions of values), request timeouts, handling failed Airflow tasks, and designing a database schema that can provide fast and efficient lookups. Each of the following sections is implemented in its own Airflow DAG (Directed Acyclic Graph).

Retrieving Source URLs

The initial step is retrieving a list of URLs that point to the data files (.gz compressed JSON files). This is pretty straightforward. Using the source API, a list of URLs can be extracted and saved as a CSV to an Amazon S3 bucket.

Retrieve Files and Upload to S3

The second step is retrieving the compressed JSON files and uploading them to an S3 bucket. Due to the size of the files, they’re uploaded in parts using multipart_upload, a function available in the Python package boto3. The process is:

  • Stream the file using Python requests with stream=True
  • Iterate over the stream chunks using iter_content
  • Upload each chunk to S3 with upload_part

Streaming and chunking the data allows for more concurrent tasks in the environment and avoids out-of-memory issues. Streaming means the data is processed as it’s downloading instead of waiting for the download to complete. Chunks of this streaming data are then uploaded as it flows in.

Decompressing

With the compressed files downloaded, they can be unzipped. Just as was done with downloading, the best method here is streaming and chunking. The basic flow is as follows:

  • Iterate over the compressed files
  • Create a temporary directory for each file
  • Open the compressed file using gz.GzipFile with mode="rb" (this accepts a stream of the compressed files directly from S3 while simultaneously unzipping it)
  • Split the data into separate parts (these end up in different files on S3 for easier parsing later)
  • Chunk the incoming file in sections of 500MB (each part from the split can have many chunks)
  • Create a temporary .jsonl in the temporary directory for each chunk
  • Upload the chunk to S3
  • Once all chunks are uploaded, remove the temporary directory and files

Streaming and chunking are similarly critical while unzipping the compressed files; the same benefits are provided here. The data is processed as the unzipping occurs, not after.

Transforming

Transforming the data is the most complex aspect of this ETL process. The previous steps break down the data into more manageable parts, but the valuable bits have yet to be extracted. The following is a high-level view of how the data flows through the transform stage:

  • Retrieve the lists of unique hashes of previously consumed data from the PostgreSQL database (these are generated later in the transform process—there won’t be any unique hashes on the first run)
  • Retrieve the data parts from S3
  • Filter the part through logic to compare hashes (skip if it exists—if not, add the new hash to the list), extract only the necessary data, and then push the results to their designated S3 location (throughout the process, the different types of data are appended to temporary Python lists)

Streaming isn’t used in this step, but a form of chunking is. As the data is parsed, transformed, and stacked into separate sections, when a section is complete, the result is pushed to its home on an S3 bucket while the process continues.

Uploading

With all the data parsed, transformed, and separated in S3, it can finally be uploaded to the PostgreSQL database tables. This step is simple since the database schema is set, and the data is already broken down into specified parts.

Removing duplicates

Due to the system resources required to hold all the data in memory to check for duplicates, deduplication occurs at the end of the ETL process. Since the data is already in the database, duplicates are removed by simply using SQL DISTINCT.

Airflow DAG automation

The final piece is automating the flow. An ETL “controller” DAG was created to handle this. It tells Airflow which DAGs should run and when (e.g., DAG B must wait for the completion of DAG A, DAG C can run any time, etc.).

ETL Flow Diagram

Data Storage

Storing, retrieving, and querying such large quantities of data required significant analysis, testing, and iteration. The frontend portal was still in progress while the backend neared completion, so the client decided to use both Redshift and RDS PostgreSQL, easily swappable via Airflow config variables. This allowed for final performance comparisons between the options.

Until the database schema was finalized, a more traditional transactional-focused solution like RDS PostgreSQL would have been required. The final database schema uses highly efficient normalized tables and relationships, which makes Redshift's fast query-centric capabilities a better fit.

Final testing was done with Redshift and RDS PostgreSQL before handing over ownership to the client. Redshift performed better overall, especially in the case of indexes (for example, when dropping duplicates, indexes need to be removed, then re-created—this adds a lot of processing time).

The outcome

Data Integrity & Accessibility

The soul of the project—the data—is its most valuable asset. Not just any data, but data that is:

  • Clean and well-structured
  • Easily queryable
  • Free of redundancies

We’ve set a high benchmark for "high-quality, easily queryable data," making it exceedingly straightforward for end-users to locate and utilize the crucial information they require.

ETL Process Efficiency

In addition to being spotlessly clean and easily accessible, the data is managed through an ETL process that is highly efficient and completes promptly. Set to run on the 1st of each month—coinciding with the release of the newest data—this state-of-the-art ETL process concludes its extensive tasks within a matter of days, embodying the term "fast and efficient ETL processes.

Enabling Informed Business Decisions: The Ultimate ROI

The upshot of our work is an end-to-end big data pipeline tailored specifically to empower our Client's data-driven decision-making. By ensuring the effective management and transformation of extensive data quantities, we provide an unmatched solution that elevates data accessibility and real-time business analytics. This enables both end-users and business leaders to promptly access pivotal information, thereby catalyzing smarter, more informed business decisions.

What’s next?

As of now, Datopian's engagement in this project has reached its initial conclusion. However, given that the project is still in its early stages, the potential for further innovation and refinement remains vast. Our Client now possesses the tools for immediate analysis and future scaling, leaving the door wide open for unlimited possibilities in next-gen big data solutions.

We are the CKAN experts.

Datopian are the co-creators, co-stewards, and one of the main developers of CKAN. We design, develop and scale CKAN solutions for everyone from government to the Fortune 500. We also monitor client use cases for data to ensure that CKAN is responding to genuine challenges faced by real organizations.

Related Case Studies