Article Image

So you want to join a Covid-19 Data Hackathon

Author ImageAuthor Image
Audrey Lobo-Pulo
Gift Egwuenu
5 mins read

The call for action to respond to the coronavirus COVID-19 pandemic has led many public groups and private organisations on a quest to find new approaches or solutions in dealing with this urgent problem.

COVID 19

Photo by Anuar Ustayev CC BY-SA

However, amongst the many challenges faced, the interpretation of data-driven insights depends on the quality and provenance of the data used. In this article, Datopian takes a closer look at some of the data and issues we came across on our own Open Data 2020 COVID-19 hackathon.

As the global community continues to face the coronavirus COVID-19 pandemic, there is a renewed momentum gathering around hackathons to quickly contribute towards addressing the crisis. Already, there have been at least 54 reported global hackathon events since mid-March - with Forbes reporting that high profile events such as the #BuildforCOVID19 Global Online Hackathon and The Global Hack, attracted more than 18,000 and 12,000 participants respectively.

At Datopian, we’ve been working with some covid-19 datasets - and like any data-driven journey, ours had its fair share of surprises along the way. Here are some of our stories and links, which we hope might help you in your next hackathon!

Data begins at the cleaning room…

Every data hacker knows that the first step of a hackathon begins at the cleaning room - and this was no different! We spent a fair bit of time cleaning up and tidying the covid-19 datasets, so that it can be ready-to-use at your next hackathon!

The Johns Hopkins University Center for Systems Science and Engineering (CSSE) has done amazingly in such a short amount of time with their data collection by providing near real-time data. They had around 20,000 stars on Github (the most popular data repo with covid data) and still growing! But there were still a few things we could contribute towards the data cleaning process, with over 1000 reported and open issues on the main dataset.

So we began with the John Hopkins datasets - the main aggregated dataset - and cleaned and normalized the data. Some of this involved tidying and data wrangling dates (some formats that are common in the US and UK aren’t international standards!) and consolidating several files into normalized time series. We also added some standard metadata, such as column descriptions and data packaged it. Frictionless Data offers a specification called Data Packages that helps format and describe a collection of data. You can even download it in alternative formats (e.g. JSON) from our DataHub. DataHub.io provides a user friendly interface to showcase our datasets.

Datahub Screenshot

Screenshot from our datahub.io site

Our dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:

  • confirmed tested cases of Coronavirus infection;
  • the number of people who have reportedly died while sick with Coronavirus; and
  • the number of people who have reportedly recovered from it.

Our data is disaggregated by country and region/state, with additional aggregated files by country and worldwide. What you might find useful about our datasets is our commitment to using open source scripts, which allows you to audit or cross-check the data for yourself - or contribute towards improving it!

Interestingly, this has been our most popular dataset to date - with over 650 stars and developers building their own applications (including dashboards) based on our dataset!

Following the Data back to the Source

At Datopian, we love to build reliable and auditable data pipelines! On our journey to finding some insights into the covid-19 datasets, we came across data sources with varying degrees of data quality and reliability - and re-discovered the old adage that the quality of a dataset is only as good as the quality of the data sources!

Here’s a list of the data sources used to create the main aggregated dataset:

Fig 1: Schematic of the data sources by Rufus Pollock

Fig 1: Schematic of the data sources by Rufus Pollock

The schematic above describes the sources from where we collect our data using the dataflows library.

Dashboard Insights and the journey onward

The need for communicating important global issues and developments in near real-time with data has never been more important than now. The current covid-19 pandemic has already had far-reaching implications on almost all sectors in our economy, and has forced governments to put in place a range of social restrictions.

How citizens cooperate and monitor the effectiveness of these measures depends much on the transparency and feedback of the progress made. And this is particularly true when the situation is changing exponentially, and it’s difficult to get a sense of how quickly things are evolving. So Data visualization has an important role to play here.

At Datopian, we created this dynamic dashboard, built using react.js, which shows the total number of cumulative confirmed cases, new cases per day and deaths. One of the key features of this visualization is that it allows you to select the country of your choice, to check on the latest covid-19 indicators. We also added a table that shows users a country summary and allows users to sort the data.

Screenshot of our dashboard

Fig 2: Screenshot of our dashboard descriptive statistics

While the covid-19 pandemic continues to sweep across the globe, data hackers are continuing to generously give of their time and efforts in supporting humanity. There’s so much we can still do to contribute towards our open data ecosystem, and learn from these experiences.
If you want to join and add to our efforts, check out our Github:


Datopian delivers outstanding solutions that enable your organization to realise your data’s potential. From hosted data portals powered by CKAN to specialised data engineering, from agile data practices to data strategy development, Datopian empowers you to transform data to insight.

© Datopian (CC Attribution-Sharealike (by-sa))

We are the CKAN experts.

Datopian are the co-creators, co-stewards, and one of the main developers of CKAN. We design, develop and scale CKAN solutions for everyone from government to the Fortune 500. We also monitor client use cases for data to ensure that CKAN is responding to genuine challenges faced by real organizations.

Related blog posts

Case Study Image
4 min read

A Brief Introduction to Data Portals

A crucial tool for any organization, data portals perform a range of functions, from providing an easily-searchable catalog of your data to enabling data visualizations and enhancement. This article i...

Author ImageAuthor Image

Annabel Van Daalen

Paul Walsh

Case Study Image
6 min read

On the Value of Data

Data has become increasingly intertwined with our daily lives as more companies collect, analyze, and utilize it—and its use is growing exponentially. Data is everywhere. IoT is opening up new possibi...

Author Image

Michael Polidori