October 3, 2016 by Datopian
We are now in the 11th week of our Official Inquiries project. This month, we managed to introduce a simple search feature to the website, that will allow visitors to the site to search the text inquiries we have up.
Meanwhile, we conducted a review of text extraction tools. The better our text extraction, the less work is needed on an inquiry afterwards, so it was important to establish whether there was a clear favourite among the free and open source options available. As it happens, there was! Apache PDFbox produced a cleaner output compared with other tools when used on all of our example inquiries.
Also, in order to test our production pipeline, we chose a simple inquiry to process to a web-ready state. The senate’s report into ENRON’s collapse made for an interesting case as, although it was a full text PDF, the text was unclear and quite old, so there was a question over whether our text extraction tools would struggle a bit with it.
As you can see, there were a few difficulties! Much of this is fixable with a few simple find and replace actions, but a lot will need to be manually cleaned up, which will take valuable man-hours. Processing this inquiry really highlighted the importance of enrolling people with similar interests to have a chance to contribute and be part of the project. Our next big challenge will be to look at how we can promote Official Inquiries and make it easy for people to contribute, so that they can help us make these vital sources of information on political and commercial matters easily accessible to the public.