Wednesday, March 24, 2010

Hacking PDF with Python

I recently did a project that involved extracting table data from some PDF's create by the UN. Here is an example. My first suggestion to my client was that he just contact the UN and see if he could get the data in some other format such as csv or excel. Much to my surprise, the UN had no interest in helping a college professor do some research - that no doubt will benefit humanity. So we were stuck with pdfs.

I had created pdfs with python before, but I had never extracted info from a pdf with python. So I did a quick google search and found the python package - pdfminer (when all went to hell later, subsequent searches showed that is the only game in town - for programs based in python). pdfminer outputs the results as text, html or xml. Great - certainly one of those formats would do the trick. So I told my client and we agreed that I should give it a shot.

My first approach was to convert to html. I envisioned tags such as table, tr, td, etc... Then I encountered the horror of pdf. As you may recall, pdf has been around for a long time. Long before the notion of semantic markup. As the name implies, the goal of "portable document format" is to make documents look EXACTLY the same on all platforms. Hence, pdf is more of a vector graphics format than information format. For example, if you have the letter "a" somewhere in a word, that is part of a sentence, pdf will specify the font, font-size and coordinates of the letter, but it might not know that the letter is part of a word which is part of a sentence. If you are a graphics designer, that is great, because it allows you to do all sorts of fine manipulations, such as kerning. But if you are trying to extract info from the document - it is a major headache. Also, pdf has no concept of a table. To pdf, a table is just a bunch of lines and letters.

So what does pdfminer give you when you request the output at html? A bunch of tags that attempt to define the position of the text - cannot remember the details because I switched to XML almost immediately. XML was not much better, but the structure was slightly easier to process.

At this point, I contacted the client, told him that things were much more complicated than I had first thought and asked him how he wanted to proceed. He suggested we try a commercially available program that some of his colleagues had some luck with. This seemed like a good idea to me. Thousands of people before me must have faced this problem. There must be hundreds of millions of pdfs on this planet by now. Why re-invent the wheel?

The program we chose was PDF Converter 6 by Nuance. It was about $50. Had a nice batch mode, so it was easy to convert all 125 pdf files into excel. Even better, it has a mode where it recognizes tables and will create an excel file where each sheet is a table. All the other clutter is filtered out! The biggest problem was that the excel sheets did not have the table titles.

At this point, the solution I attempted was to convert all the pdfs using PDF Converter. Then convert the pdfs again to text files using pdfminer. I would use regex to extract all the table titles and then apply them to the excel sheets.

This did not work for two reasons. First, the text in the file created by pdfminer was not in any particular order. To make matters worse, words and sentences were split up. It was a mess, with no info to put things back together again.

Second, PDF Converter is not very good at recognizing tables. Making things worse, it did not consistently fail on one particular table type. It was impossible to know which table it failed to find. Thus, for a pdf, I might get 8 sheets in the excel file but find 12 table titles. So it was not easy to figure out which title went with which sheet.

...urgh...months have passed...

I have not kept up the gory details of this post. Sorry. I wish I had, but it got too depressing. This project really ate my lunch. My client was very understanding. I probably could have bailed. But my (foolish) pride would not let me.

What I remember was that there were two sources of grief. First there was the pdf conversion software. Second, the files I was trying to process did not have consistent formatting. Also, the files often used spaces in numbers - one million - was 1 ooo ooo. This seriously confused the pdf converter.

In the end, I used a combination of python code and manual intervention. In retrospect, hiring a typist would have taken less time and been cheaper.

2 comments:

  1. i had a similar problem and pdf-reader library for ruby helped me.

    you can iterate over each line from top to bottom.

    i know the comment is a bit late but it might help anyone else trying to solve the same problem.

    ReplyDelete
  2. I have found this post on google search. Let me add my nice experience about a similar project in 2014. may be helpful for others who needs these kind of help.

    I found that there is a nice library called pdftableextract. This library can read any tabular data from PDF. pdftableextract needs Pandas library as prerequisite.

    If the single page tables are pdftableextract works fine. for multiple pages it became some problem. So, i use the library to take raw data, and the data was like, (row_no, col_no, start_Padding, page_number, td_data)

    So, I made dictionary for full pdf and from the dictionary rebuild table into a list. and used pandas to create the Excel file using that list.

    ReplyDelete