Data Engineering, Data Strategy
March 2023 - July 2023
Brief summary of the project
Datopian successfully engineered an advanced ETL pipeline, tackling the Client's complex big data challenges and enabling data-driven decision-making.
The Client engaged Datopian's data engineering expertise to tackle the big data hurdles inherent in a large-scale ETL process—the heart of their data-centric product. With thousands of JSON files, each averaging around 100GB, even minor inefficiencies could cascade into severe operational setbacks.
The Client needed an ultra-efficient, automated ETL infrastructure capable of flawlessly managing hundreds of terabytes of data each month. The framework had to meet rigorous standards for operational efficiency, data quality, and integrity, and be primed for rapid querying.
Datopian architected an advanced ETL solution leveraging Airflow's Directed Acyclic Graphs (DAGs), designed meticulously for each critical phase—from seamless data retrieval to automated workflow. Implementing data streaming and chunking technologies to combat memory constraints, the result was an ETL pipeline that set a new industry benchmark in speed, data hygiene, and structural integrity.
Main technologies & tools used
With open data increasingly becoming the standard instead of the exception, there are many innovations in the works to provide data to the public in easily accessible ways. In some cases, policies have enforced previously privately held data to become open and available to the public, which is precisely the circumstance that brought the Client and Datopian together.
While an NDA prevents us from diving into the specifics of the data at this time (even though the data itself is open), the Client is embarking on an ambitious journey to create a cutting-edge portal designed to democratize access to this data. Their mission? To provide this intricate and often opaque information in a fully transparent, easily searchable, and comparable format.
However, the task at hand proved far more complex than initially estimated. To draw a simple analogy: While the Client envisioned us disassembling and reorganizing a van or a small school bus, upon deep dive analysis, it became clear that we were dealing with a train—a behemoth one at that—with 4,000 cars loaded with intricate data points. The challenge was monumental, yet ripe with opportunities for technological ingenuity and societal impact.
The Client approached Datopian for our data engineering expertise. They came in the early stages of development, seeking a solution for managing their big data ETL (Extract, Transform, Load) process. Data is at the heart of their product, so they needed a solution that could handle this data complexity and sheer volume—it’s composed of thousands of JSON files with an average size of around 100GB (though many are even larger). Any inefficiencies or mismanagement would have serious ramifications.
- Monthly Throughput: Automated ETL capable of handling hundreds of terabytes of data monthly
- Operational Efficiency: Full ETL process designed to take less than a week
- Data Purity: Resulting data should only contain what’s necessary
- Data Integrity and Uniqueness: No duplicated data
- Query-Ready Architecture: Stored within a fast and easily queryable database schema
Datopian began by analyzing the problem. Of the many obstacles, the most challenging were Airflow’s limitations with high memory-consuming processes (within MWAA—e.g., running out of memory while working with multiple Python lists containing millions of values), request timeouts, handling failed Airflow tasks, and designing a database schema that can provide fast and efficient lookups. Each of the following sections is implemented in its own Airflow DAG (Directed Acyclic Graph).
The initial step is retrieving a list of URLs that point to the data files (.gz compressed JSON files). This is pretty straightforward. Using the source API, a list of URLs can be extracted and saved as a CSV to an Amazon S3 bucket.
The second step is downloading the compressed JSON files to an S3 bucket. Due to the size of the files, they’re uploaded in parts using multipart_upload, a function available in the Python package boto3. The process is:
- Stream the file using Python requests with stream=True
- Iterate over the stream chunks using iter_content
- Upload each chunk to S3 with upload_part
Streaming and chunking the data allows for more concurrent tasks in the environment and avoids out-of-memory issues. Streaming means the data is processed as it’s downloading instead of waiting for the download to complete. Chunks of this streaming data are then uploaded as it flows in.
With the compressed files downloaded, they can be unzipped. Just as was done with downloading, the best method here is streaming and chunking. The basic flow is as follows:
- Iterate over the compressed files
- Create a temporary directory for each file
- Open the compressed file using
mode="rb"(this accepts a stream of the compressed files directly from S3 while simultaneously unzipping it)
- Split the data into separate parts (these end up in different files on S3 for easier parsing later)
- Chunk the incoming file in sections of 500MB (each part from the split can have many chunks)
- Create a temporary
.jsonlin the temporary directory for each chunk
- Upload the chunk to S3
- Once all chunks are uploaded, remove the temporary directory and files
Streaming and chunking are similarly critical while unzipping the compressed files; the same benefits are provided here. The data is processed as the unzipping occurs, not after.
Transforming the data is the most complex aspect of this ETL process. The previous steps break down the data into more manageable parts, but the valuable bits have yet to be extracted. The following is a high-level view of how the data flows through the transform stage:
- Retrieve the lists of unique hashes of previously consumed data from the PostgreSQL database (these are generated later in the transform process—there won’t be any unique hashes on the first run)
- Retrieve the data parts from S3
- Filter the part through logic to compare hashes (skip if it exists—if not, add the new hash to the list), extract only the necessary data, and then push the results to their designated S3 location (throughout the process, the different types of data are appended to temporary Python lists)
Streaming isn’t used in this step, but a form of chunking is. As the data is parsed, transformed, and stacked into separate sections, when a section is complete, the result is pushed to its home on an S3 bucket while the process continues.
With all the data parsed, transformed, and separated in S3, it can finally be uploaded to the PostgreSQL database tables. This step is simple since the database schema is set, and the data is already broken down into specified parts.
Due to the system resources required to hold all the data in memory to check for duplicates, deduplication occurs at the end of the ETL process. Since the data is already in the database, duplicates are removed by simply using SQL DISTINCT.
The final piece is automating the flow. An ETL “controller” DAG was created to handle this. It tells Airflow which DAGs should run and when (e.g., DAG B must wait for the completion of DAG A, DAG C can run any time, etc.).
The soul of the project—the data—is its most valuable asset. Not just any data, but data that is:
- Clean and well-structured
- Easily queryable
- Free of redundancies
We’ve set a high benchmark for "high-quality, easily queryable data," making it exceedingly straightforward for end-users to locate and utilize the crucial information they require.
In addition to being spotlessly clean and easily accessible, the data is managed through an ETL process that is highly efficient and completes promptly. Set to run on the 1st of each month—coinciding with the release of the newest data—this state-of-the-art ETL process concludes its extensive tasks within a matter of days, embodying the term "fast and efficient ETL processes.
The upshot of our work is an end-to-end big data pipeline tailored specifically to empower our Client's data-driven decision-making. By ensuring the effective management and transformation of extensive data quantities, we provide an unmatched solution that elevates data accessibility and real-time business analytics. This enables both end-users and business leaders to promptly access pivotal information, thereby catalyzing smarter, more informed business decisions.
As of now, Datopian's engagement in this project has reached its initial conclusion. However, given that the project is still in its early stages, the potential for further innovation and refinement remains vast. Our Client now possesses the tools for immediate analysis and future scaling, leaving the door wide open for unlimited possibilities in next-gen big data solutions.