CKAN-based Data Hubs for Earth observation communities

Tue, Jan 14, 2020

Paul Walsh

What is CKAN? What can it do for data publishers and data user?

In this Blogpost, we will introduce you to the CKAN tools and how CKAN is being used for Earth observations both in Europe and elsewhere – and how you can use it too.

We'll give you

  • a brief introduction to CKAN
  • an overview of the applicability of CKAN to Earth observation data
  • a brief overview of why CKAN is a great choice for building data hubs in general
  • and an Earth observation data hub in particular
  • how CKAN differentiates itself from others

CKAN started off in 2007 at the Open Knowledge Foundation, a nonprofit that is dedicated to seeing fair and equitable societies through equal access to knowledge and data, and CKAN was really designed along this mission. Back then, governments around the world were only starting to release data publicly via open licenses as part of various government transparency initiatives. The growth of CKAN substantially came from those early use cases whilst paying particular attention to data publication and data access.

CKAN has since then evolved and is nowadays a much more powerful data management system; Its many features around raw publication and access grew to a great extend and are part of its core functionality.

About CKAN – Data publication and data access

CKAN is a metadata driven catalog; Metadata is the information about each dataset ranging from the name of the dataset through to the timestamp or the last time it was modified, through to information on the quality of the data, or provenance information which describes what type of transformations may have happened to the data before it hit the catalog, through to other types of information like taxonomic information or semantic markup for a dataset.

Metadata consequently is indispensable to enable data management operations.

CKAN can actually store the data as well, and likewise also store on metadata for data that's hosted on other platforms around the web; Its precisely this metadata which essentially provides a wealth of data exploration and discovery functionality.

Succinctly all of the features that CKAN offers are based on this core spine of a metadata catalog. The user interface and the API offer rich features for filtering data for free, take search over the metadata fields and other methods for filtering and accessing data so that new data sets within the catalog can be discovered.

Also, CKAN offers really rich APIs that enable web developers to build applications on top of the data that's within a data catalog. Those APIs can also be used for importing data into the catalog and for doing other types of transformations of data that sits in a given catalog.

As in, for example, Energinet, a free and open data portal Datopian has worked on where anyone can get data about the Danish energy system.

CKAN also offer some rich tooling around data licensing, access controls around data that's in the catalog and authentication of users and permissions for users.

CKAN for public data

Given that CKAN has been a solution in the space of public data release for over 10 years, It has been tested and trusted by a variety of data platforms: wherever you see a data platform online there's a very good chance that CKAN is behind that data. The largest governments in the world run CKAN for public data, from the US to Australia, the UK and many more.

CKAN for internal data management needs

Additionally, many organizations, corporations and nonprofits use CKAN internally as a data management system: There are effectively thousands and thousands of instances of CKAN in production, all supporting these core use cases around publication and access and additional ones, whether it's for a public data release or for internal data management needs.

The relationship with CKAN to Earth observation data

It turns out that the use cases around Earth observation data are very similar to the use cases that ckan has been designed around;

Like many types of public data, we want

  • to have really good metadata management,
  • powerful tools for exploration and discovery of data using that metadata,
  • the ability to preview and download data sets,
  • the ability to harvest data from external sources and to potentially enrich the metadata for given datasets that exist in the catalog,
  • the ability to apply taxonomies and other types of semantic information to data
  • APIs for application development
  • and importantly, especially with Earth observation data where data sets can be rather large, we want raw file access so that data can be offline processed.

CKAN has a great range of existing extensions that are suitable for Earth observation data; A lot of them support the ability to do spatial querying and geospatial harvesting, or extensions that support addition of vectors, ArcGIS map services and feature layers and solutions such as DCAT, which is a vocabulary for a semantic markup of data, also used with Earth observation data.

There are several existing platforms that are running CKAN for Earth observation data:

  • AmeriGEOSS datahub
  • The National Oceanic Atmospheric Administration,
  • NextGEOSS
  • a range of other CKAN installations around the world that have Earth observation data in them with other types of data, like Energinet, Natural History Museum, and more

So why is CKAN a good choice for a data hub given that it does all of these things on top of the fact that it makes these core use cases for data management?

What else does CKAN have to offer?

First of all, CKAN is open source software; There are no licensing fees, there is no vendor lock-in, there are many options for support around CKAN; those support options range from support from developers within the community, through to paid options with commercial vendors. Another important point is that CKAN has got decades and decades worth of people committing code to it or to extensions around it, solving data management problems. Consequently, if a solution doesn't already exist in an existing extension or in the core, there are people who have experience, and people in the community who can be talked to for best practices or advice on solutions to certain types of data management problems. Moreover, CKAN has an open architecture and is deeply extensible, almost every single part of it can be customized to suit particular use cases, or data workflows which also helps in integrating CKAN into existing systems. It is very easy to install and maintain on commodity hardware and common cloud providers.

Most importantly of all, CKAN has a large open community; Anyone can contribute with code and ideas. Being open source, it is supported and nourished by a broad community of active, informed and devoted stakeholders and a range of experts who have extensive experience with data and technology around data management, and there's a large ecosystem of extensions and complementary libraries in the spaces of data science, data engineering and others.

There are very active community channels, which have a large amount of developers with different sectoral focuses who come in and collaborate on data management issues together in CKAN, from government data, energy, data, all sorts of other sectors.

Together with LinkDigital, we at Datopian have been appointed as co-stewards, striving to move CKAN forward, strengthening both the platform and community and to see CKAN grow and prosper in the years to come.

In summary, CKAN is an extensible metadata and data management platform with a large ecosystem of extensions and best practices, with great support for use cases around Earth observation data and a vibrant open source community led by open data experts; These are reasons why CKAN is a great choice for an Earth observation data hub.

Datopian offers a range of CKAN services for open data portals and internal data management systems. We offer managed hosting, custom development, and consulting services for CKAN and open data.

For NextGEOSS, together with other partners in the project, we are leading and responsible for the work package that covers the development of the NextGEOSS Datahub which integrates references to, amongst many others, all types of Earth observation data from a plethora of data providers.

Communities around Earth observation data, reach out to us! We'd be interested to hear about your experiences with CKAN.