Datopian & CivicActions
U.S. Department of Education
CKAN Development, CKAN Features, CKAN Consultancy
September 2018 - present
Work we've done:
Department of Education ODP (Open Data Platform)
Brief summary of the project
Datopian (in partnership with CivicActions) have delivered a government-compliant Open Data Platform that revolutionizes data accessibility, navigation, and user engagement, fulfilling the United States Department of Education's need for a centralized, user-friendly data repository.
The problem faced by the United States Department of Education was the lack of a centralized location to organize, search, edit, and share their massive and ever-growing open data collection. As a result, it was difficult for inter-organizational Data Stewards, as well as the general public, to easily navigate and utilize the data.
The Department of Education needed a solution that would allow for the easy organization, searchability, navigation, and usage of their data by a wide range of audiences, including researchers, teachers, government employees, and the general public. The project required a solution that would fulfill a broad set of criteria, including centralizing access to public data, providing a modern and intuitive interface for navigating data resources, allowing an easy search of public datasets by metadata, facilitating user engagement, and enabling analytics that allow the Department to answer questions, among others.
Datopian, in collaboration with CivicActions, developed a centralized Open Data Platform (ODP) that meets government standards and offers a feature-rich, user-friendly location for organizing, searching, editing, and sharing data, which can be accessed by users at any level. The platform collects metadata using web scrapers and ingests data from education-related organizations. Custom solutions were developed for managing users, approving data, and creating and editing dataset relationships. The ODP ensures compliance with the Government mandate on open data and supports standardized Project Open Data Schema, making data exchange simple and flexible.
Main technologies & tools used
The United States Department of Education focuses on creating and assisting in executing policies for—along with administrating and coordinating federal assistance to—education. Part of fulfilling these responsibilities includes gathering, organizing, and utilizing large quantities of data. Since much of this data is open, it's not only used internally by the Department of Education, but also by other organizations, researchers, and even the general public.
At the inception of this project, the Department of Education had a clear goal: build a centralized location (the Open Data Platform—or ODP) for their massive—and ever-growing—open data collection, to allow easier organization, searchability, navigation, and usage of their data by interorganizational Data Stewards, along with the general public. The journey of this (still ongoing) project has been a collaboration between the Department of Education, Datopian, and CivicActions.
This project was inspired by portals such as data.gov and healthdata.gov. The following is a list of objectives provided by the Department of Education from day one:
- Centralize access to ED’s public data, allowing end users to download ED datasets in bulk, read related documentation, and query data elements via APIs (where feasible) from a singular location.
- Provide end users with a modern, intuitive, aesthetically-pleasing interface for navigating data resources.
- Allow end users to easily search public datasets by defined metadata.
- Facilitate user engagement, including by allowing end users to provide feedback on data resources.
- Allow end users and data stewards to see the popularity of a given public dataset (e.g., counts of dataset downloads).
- Give data stewards a user-friendly interface for hosting and sharing public datasets, describing/tagging them, tracking what they post, and following the activity of other data publishers in the Department.
- Allow offices with many currently existing public datasets to easily import their data resources into the catalog.
- Enable the Department to require that certain metadata must be associated with a data resource before it can be published
- Federate the compilation of an agency public dataset listing that can be fed into data.gov per their requirements (i.e., enable the platform to automatically generate a data.gov-compatible listing based on the data resources referenced and uploaded to it, thereby allowing offices to contribute directly to the listing).
- Make its contents readily available to other systems; i.e., it should be possible to exchange information with other inventories, including data.gov and the ED Data Inventory, the Department’s collection repository.
- Enable analytics that allow the Department to answer questions.
- Enable authorization workflows that manage the movement of datasets from creation to review to approval and publication, in part by allowing offices to determine which data stewards can create, edit, and/or publish datasets.
- Ease administration of the platform by automating notifications in support of related workflows, such as the approval of datasets for publication and responding to public inquiries.
Tackling the objectives provided by the Department of Education began (and continues) with an emphasis on strong communication and feedback loops within two-week Agile sprints (originally the Scrum framework, but we later adopted Scrumban—a hybrid of Scrum and Kanban—which has been working really well).
Datopian's mature knowledge and expertise with all things CKAN and data engineering (along with everything in between) pair exceedingly well with CivicActions' proven history of proficiency and success in assisting government agencies with creating—or migrating to—software with modern standards, design principles, and usability.
The ODP is used for data found throughout the hierarchy of organizations related to the Department of Education. Because of this, one of the early (and critical) tasks was to create a collection of web scrapers (using Python's Scrapy). These scapers collect the metadata required for CKAN to ingest data from education-related organizations and are responsible for populating the Open Data Platform with much of its data.
To ensure that all of the metadata is retrieved and retained on CKAN during ingestion, a custom schema was developed. This allows a seamless transferrence of data from the various sources to the ODP. And, since the DCAT-US schema fields are retained, data found on the ODP is easily ingestable by others via the data.json.
The admins of the ODP have various responsibilities. Near the top of that list is verifying that resource files, resource links, and Data Explorers are suitable for the portal, so an approval workflow was developed. This provides a review process, allowing admins to approve or reject (with comments, if needed) Data Explorers (individually) and resources (individually or in bulk).
The Admins of the ODP are charged with the responsibility of overall user management and platform sanitization. Since CKAN does not provide out-of-the-box features that cover a lot of the requirements, a custom solution had to be built for deleting users. When users create objects (for example, datasets or any of the group types), they're the owner of that object. Since CKAN doesn't include functionality to change the object owners, a solution was developed to resolve this. It allows admins to change the owner for datasets (Data Profiles) and group types (Collections, Sources, and Data Explorers), directly on their page in the UI.
Organizations don't have an owner in the same sense, but they do have roles. The default flow in CKAN, of replacing an organization admin role user, is to navigate to the roles page, remove a user, add the new user, then assign the admin role. Similar to the solution for the previously mentioned object owners, there's now a UI option to swap/change organization admins.
When dealing with datasets (especially in large quantities), the ability to create, edit, and view the relationships between them can be very useful. Though CKAN has basic support for dataset relationships, by default it's only accessible via APIs or in the backend with custom code. To resolve this, the ability to create and remove relationships between datasets was developed and implemented, available on each dataset edit page, providing an easy process for (only) admins to set parent/child, dependency, and derivation relationships between datasets.
Due to the complexity of some dataset relationship trees, navigable, collapsible hierarchies were implemented. Each dataset page displays its full tree (expanded to location of the currently viewed dataset by default—and every dataset title in the tree is a hyperlink for easy navigation). The ability to persistently reorder the datasets in the hierarchy was also added later on.
A similar, though simpler version of this was implemented for organizations, available on the main organization list page.
This covers many of noteworthy and valuable solutions so far, but it is by no means extensive. Numerous smaller UI, UX, and backend improvements and changes have been made, to constantly and consistently improve all aspects of the ODP.
The project has come a long way since the beginning. The result is that the Department of Education now has a feature-rich central location to organize, search, edit, and share data. Users of the portal, at any level (public, research, teachers, government employees, etc.), will find a plethora of features to help them with whatever their data journey holds.
- The ODP provides a robust platform for the Department of Education to collate, curate, and share its open data with a wide spectrum of audiences in a flexible manner that suits their internal business processes.
- The ODP ensures the Department of Education not only complies with the Government mandate on open data, but also supports the standardized Project Open Data Schema, thereby enabling easy data exchange between the ODP and other compliant systems.
On the roadmap for the near future, there are a couple of items in sight:
The first destination is to migrate CKAN and its extensions from Python 2 to Python 3. Preliminary work on this is already in progress, and I know all parties involved are excited for the migration!
The second, which will likely immediately follow the migration, is to develop a better workflow for work-in-progress and incomplete—i.e. missing any required metadata—datasets. The aim is to automate the verification of metadata validity and completeness. Once the metadata meets certain criteria, it will become publicly visible and accessible in the UI and data.json.
We create, maintain, and deploy data management and data engineering technologies for government, enterprise, and the non-profit sector using CKAN, Frictionless Data, DataHub.io and other open-source software that we have built ourselves.