Article Image

September Update!

Author Image
estrus2_o72siy
2 mins read

We are now in the 11th week of our Official Inquiries project. This month, we managed to introduce a simple search feature to the website, that will allow visitors to the site to search the text inquiries we have up.

Meanwhile, we conducted a review of text extraction tools. The better our text extraction, the less work is needed on an inquiry afterwards, so it was important to establish whether there was a clear favourite among the free and open source options available. As it happens, there was! Apache PDFbox produced a cleaner output compared with other tools when used on all of our example inquiries.

Also, in order to test our production pipeline, we chose a simple inquiry to process to a web-ready state. The senate’s report into ENRON’s collapse made for an interesting case as, although it was a full text PDF, the text was unclear and quite old, so there was a question over whether our text extraction tools would struggle a bit with it.

As you can see, there were a few difficulties! Much of this is fixable with a few simple find and replace actions, but a lot will need to be manually cleaned up, which will take valuable man-hours. Processing this inquiry really highlighted the importance of enrolling people with similar interests to have a chance to contribute and be part of the project. Our next big challenge will be to look at how we can promote Official Inquiries and make it easy for people to contribute, so that they can help us make these vital sources of information on political and commercial matters easily accessible to the public.

We are the CKAN experts.

Datopian are the co-creators, co-stewards, and one of the main developers of CKAN. We design, develop and scale CKAN solutions for everyone from government to the Fortune 500. We also monitor client use cases for data to ensure that CKAN is responding to genuine challenges faced by real organizations.

Related blog posts

Case Study Image
3 min read

The OpenSpending Revamp: Behind the Scenes

In our last article, we explored the Open Spending revamp. Now, let's dive into the tech stack that makes it tick. We'll unpack how PortalJS, Cloudflare R2, Frictionless Data Packages, and Octokit com...

Author ImageAuthor Image

Luccas Mateus

João Demenech