I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.

enter image description here

I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs. I attempted using python with this code

import PyPDF2try:pdfFileObj = open('test.pdf', 'rb')pdfReader = PyPDF2.PdfFileReader(pdfFileObj)pageNumber = pdfReader.numPagespage = pdfReader.getPage(0)print(pageNumber)pagecontent = page.extractText()print(pagecontent)except Exception as e:print(e)

but I got stuck on how to find the procurement history. What is the best way to extract the procurement history from the pdf?

6

Best Answer


pdfplumber is the best option. [Reference]

Installation

pip install pdfplumber

Extract all the text

import pdfplumberpath = 'path_to_pdf.pdf'with pdfplumber.open(path) as pdf:for page in pdf.pages:print(page.extract_text())

I did something similar to scrape my grades a long time ago. The easiest (not pretty) solution I found was to convert the pdf to html, then parse the html.

To do so I used pdf2text/pdf2html (https://pypi.org/project/pdf-tools/) and html.
I also used codecs but don't remember exactly the why behind this.

A quick and dirty summary:

from lxml import htmlimport codecsimport os# First convert the pdf to text/html# You can skip this step if you already did itos.system("pdf2txt -o file.html file.pdf")# Open the file and read itfile = codecs.open("file.html", "r", "utf-8")data = file.read()# We know we're dealing with html, let's load ithtml_file = html.fromstring(data)# As it's an html object, we can use xpath to get the data we need# In the following I get the text from <div><span>MY TEXT</span><div>extracted_data = html_file.xpath('//div//span/text()')# It returns an array of elements, let's process itfor elm in extracted_data:# Do thingsfile.close()

Just check the result of pdf2text or pdf2html, then using xpath you should extract your information easily.

I hope it helps!

EDIT: comment code

EDIT2:The following code is printing your data

# Assuming you're only giving the page 4 of your document# os.system("pdf2html test-page4.pdf > test-page4.html")from lxml import htmlimport codecsimport osfile = codecs.open("test-page4.html", "r", "utf-8")data = file.read()html_file = html.fromstring(data)# I updated xpath to your needextracted_data = html_file.xpath('//div//p//span/text()')for elm in extracted_data:line_elements = elm.split()# Just observed that what you need starts with a numberif len(line_elements) > 0 and line_elements[0].isdigit():print(line_elements)file.close();

OK. I help with the development of this commercial product from opait.com.I took your input PDF and zoned a few areas in it like this:

enter image description here

And also the table you have:

enter image description here

And in about 2 minutes I can extract this from this one and 1000 documents like it. Note this image is the log view and exports that data as CSV. All the blue "links" are the actual data extracted and actually link back into the PDF so you can see where from. The output could be XML or JSON or other formats also. What you see in that screen capture is the log view, all of that is in CSV format (one for the main properties and others for each table linked by a record ID if you had a PDF that had 1000 of these documents in one PDF).

Again, I help with the development with this product but what you ask for can be done. I extracted your entire table but also all the other fields that would be important.

enter image description here

PDFTron, the company I work for has a fully automated PDF to HTML output solution.

You can try it out here online.https://www.pdftron.com/pdf-tools/pdf-table-extraction

Here is the screenshot of the HTML output for the file you provided. The output contains both HTML tables, and re-flowable text content in between.

enter image description here

The output is standard XML HTML, so you easily parse/manipulate the HTML tables.

I work for the company that makes PDFTables. The PDFTables API would help you to solve this problem, and to convert all PDFs at once. It's a simple web based API, so can be called from any programming language. You'll need to create an account at PDFTables.com, then use a script from one of the example languages here: https://pdftables.com/pdf-to-excel-api. Here's an example using Python:

import pdftables_apiimport osc = pdftables_api.Client('MY-API-KEY')file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"for file in os.listdir(file_path):if file.endswith(".pdf"):c.xlsx(os.path.join(file_path,file), file+'.xlsx')

The script looks for all files within a folder that have extension '.pdf', then converts each file to XLSX format. You can change the format to '.csv', '.html' or '.xml'. The first 75 pages are free.

That's four lines of script in IntelliGet

{ start = IsSubstring("CAGE Contract Number",Line(-2)); end = IsEqual(0, Length(Line(1)));{ start = 1;output = Line(0);}}