Service providers: Datopian & CivicActions (DevSecOps services)
Client: U.S. Department of Education
Services: Data engineering; data scraping; CKAN development; CKAN consultancy; CKAN extensions
Period: September 2018 – present
Work we’ve done: Department of Education ODP (Open Data Platform)
Summary of what we did for them:
– Additional metadata fields for datasets, to meet DCAT-US (Project Open Data) Schema compliance.
– Ability to create or delete dataset relationships within the UI.
– Navigable and complete dataset relationship hierarchies displayed on each dataset page. Each relationship is displayed in its own per-level collapsible tree. The relationship types are parent/child, dependencies, and derivations.
– A URL path (data.json) that displays an up-to-date JSON file listing all DCAT-US Schema compliant metadata for all publicly available datasets.
– Additional group types (Collections, Sources, and Data Explorers) for organizing Data Profiles in various ways that aren’t supported in the default CKAN groups.
– An organization hyperlink tree based on the hierarchy of organizations (and their sub-organizations) beneath the Department of Education.
– Ability to persistently re-order datasets in each level of the parent/child trees.
– Additional harvest types for external metadata/dataset ingestion.
– Web scrapers for multiple Department of Education sub-organizations, required for generating harvest-compatible JSON files to ingest metadata/datasets.
– Allow admins to change the owner of dataset, organization, and group object types.
Automate Source (custom group type) creation, using scraped OMB number (approved numbers only) metadata.
– UI/UX improvements, with guidance by USWDS (U.S. Web Design System design principles).
– An approval workflow allowing admins to review resources (file uploads and external URLs) and the group type, Data Explorers. Approval and rejection can be done individually or in bulk. Supplementally, this includes an activity list to review approval requests and their resulting approval or rejection.
– Multiple CKAN minor release upgrades.
– WIP: Migrate from Python 2 to Python 3.
Main technologies used
– Cypress testing
The United States Department of Education focuses on creating and assisting in executing policies for—along with administrating and coordinating federal assistance to—education. Part of fulfilling these responsibilities includes gathering, organizing, and utilizing large quantities of data. Since much of this data is open, it’s not only used internally by the Department of Education, but also by other organizations, researchers, and even the general public.
At the inception of this project, the Department of Education had a clear goal: build a centralized location (the Open Data Platform—or ODP) for their massive—and ever-growing—open data collection, to allow easier organization, searchability, navigation, and usage of their data by interorganizational Data Stewards, along with the general public. The journey of this (still ongoing) project has been a collaboration between the Department of Education, Datopian, and CivicActions.
This project was inspired by portals such as data.gov and healthdata.gov. The following is a list of objectives provided by the Department of Education from day one:
- Centralize access to ED’s public data, allowing end users to download ED datasets in bulk, read related documentation, and query data elements via APIs (where feasible) from a singular location.
- Provide end users with a modern, intuitive, aesthetically-pleasing interface for navigating data resources.
- Allow end users to easily search public datasets by defined metadata.
- Facilitate user engagement, including by allowing end users to provide feedback on data resources.
- Allow end users and data stewards to see the popularity of a given public dataset (e.g., counts of dataset downloads).
- Give data stewards a user-friendly interface for hosting and sharing public datasets, describing/tagging them, tracking what they post, and following the activity of other data publishers in the Department.
- Allow offices with many currently existing public datasets to easily import their data resources into the catalog.
- Enable the Department to require that certain metadata must be associated with a data resource before it can be published
- Federate the compilation of an agency public dataset listing that can be fed into data.gov per their requirements (i.e., enable the platform to automatically generate a data.gov-compatible listing based on the data resources referenced and uploaded to it, thereby allowing offices to contribute directly to the listing).
- Make its contents readily available to other systems; i.e., it should be possible to exchange information with other inventories, including data.gov and the ED Data Inventory, the Department’s collection repository.
- Enable analytics that allow the Department to answer questions.
- Enable authorization workflows that manage the movement of datasets from creation to review to approval and publication, in part by allowing offices to determine which data stewards can create, edit, and/or publish datasets.
- Ease administration of the platform by automating notifications in support of related workflows, such as the approval of datasets for publication and responding to public inquiries.
Tackling the objectives provided by the Department of Education began (and continues) with an emphasis on strong communication and feedback loops within two-week Agile sprints (originally the Scrum framework, but we later adopted Scrumban—a hybrid of Scrum and Kanban—which has been working really well).
Datopian’s mature knowledge and expertise with all things CKAN and data engineering (along with everything in between) pair exceedingly well with CivicActions’ proven history of proficiency and success in assisting government agencies with creating—or migrating to—software with modern standards, design principles, and usability.
The ODP is used for data found throughout the hierarchy of organizations related to the Department of Education. Because of this, one of the early (and critical) tasks was to create a collection of web scrapers (using Python’s Scrapy). These scapers collect the metadata required for CKAN to ingest data from education-related organizations and are responsible for populating the Open Data Platform with much of its data.
To ensure that all of the metadata is retrieved and retained on CKAN during ingestion, a custom schema was developed. This allows a seamless transferrence of data from the various sources to the ODP. And, since the DCAT-US schema fields are retained, data found on the ODP is easily ingestable by others via the data.json.
The admins of the ODP have various responsibilities. Near the top of that list is verifying that resource files, resource links, and Data Explorers are suitable for the portal, so an approval workflow was developed. This provides a review process, allowing admins to approve or reject (with comments, if needed) Data Explorers (individually) and resources (individually or in bulk).
The Admins of the ODP are charged with the responsibility of overall user management and platform sanitization. Since CKAN does not provide out-of-the-box features that cover a lot of the requirements, a custom solution had to be built for deleting users. When users create objects (for example, datasets or any of the group types), they’re the owner of that object. Since CKAN doesn’t include functionality to change the object owners, a solution was developed to resolve this. It allows admins to change the owner for datasets (Data Profiles) and group types (Collections, Sources, and Data Explorers), directly on their page in the UI.
Organizations don’t have an owner in the same sense, but they do have roles. The default flow in CKAN, of replacing an organization admin role user, is to navigate to the roles page, remove a user, add the new user, then assign the admin role. Similar to the solution for the previously mentioned object owners, there’s now a UI option to swap/change organization admins.
When dealing with datasets (especially in large quantities), the ability to create, edit, and view the relationships between them can be very useful. Though CKAN has basic support for dataset relationships, by default it’s only accessible via APIs or in the backend with custom code. To resolve this, the ability to create and remove relationships between datasets was developed and implemented, available on each dataset edit page, providing an easy process for (only) admins to set parent/child, dependency, and derivation relationships between datasets.
Due to the complexity of some dataset relationship trees, navigable, collapsible hierarchies were implemented. Each dataset page displays its full tree (expanded to location of the currently viewed dataset by default—and every dataset title in the tree is a hyperlink for easy navigation). The ability to persistently reorder the datasets in the hierarchy was also added later on.
A similar, though simpler version of this was implemented for organizations, available on the main organization list page.
This covers many of noteworthy and valuable solutions so far, but it is by no means extensive. Numerous smaller UI, UX, and backend improvements and changes have been made, to constantly and consistently improve all aspects of the ODP.
The project has come a long way since the beginning. The result is that the Department of Education now has a feature-rich central location to organize, search, edit, and share data. Users of the portal, at any level (public, research, teachers, government employees, etc.), will find a plethora of features to help them with whatever their data journey holds.
- The ODP provides a robust platform for the Department of Education to collate, curate, and share its open data with a wide spectrum of audiences in a flexible manner that suits their internal business processes.
- The ODP ensures the Department of Education not only complies with the Government mandate on open data, but also supports the standardized Project Open Data Schema, thereby enabling easy data exchange between the ODP and other compliant systems.
On the roadmap for the near future, there are a couple of items in sight:
The first destination is to migrate CKAN and its extensions from Python 2 to Python 3. Preliminary work on this is already in progress, and I know all parties involved are excited for the migration!
The second, which will likely immediately follow the migration, is to develop a better workflow for work-in-progress and incomplete—i.e. missing any required metadata—datasets. The aim is to automate the verification of metadata validity and completeness. Once the metadata meets certain criteria, it will become publicly visible and accessible in the UI and data.json.
We create, maintain, and deploy data management and data engineering technologies for government, enterprise, and the non-profit sector using CKAN, Frictionless Data, DataHub.io and other open-source software that we have built ourselves.