Python: Extract Data From PDF

Intro

True story, the town I live in recently had some issues around the town budget. As I started to get involved I wanted to look at the data provided by our Board of Finance. To my misfortune, I found myself trying to analyze data in a PDF. Trying to do any sort of analysis on it was time consuming and frustrating. To make the data useful I needed to extract data from PDF and into something more flexible, in this case a simple CSV will do. Here is a sample of what we are working with:

Tooling

For this project we are going to use Pandas and Tabula. I also want to note I am using Python 3.12.2 for this project. To get Tabula up and running you will need to have Java 8+ installed as well. From there its fairly straight forward.

pip install pandas
pip install tabula-py

Setup

First we are going to create a function that uses Tabula to extract data from our source PDF. I was pleasantly surprised by how simple this is to implement. Behold the ease of use:

def extract_tables_from_pages(pdf_path,page):
	# Extract table from specified page
	tables = tabula.read_pdf(pdf_path, pages=page, multiple_tables=True)

  # Check if result is empty
	if not tables:
	  return None

	# Concatenate all tables from the specified pages into one DataFrame
	df = pd.concat(tables, ignore_index=True)

	return df

Now that we have our function to extract data lets handle the rest of the groundwork:

Create an empty file with our headers
Set the page range for the PDF that our data is in
Initialize and empty data frame

# Create file with headers
headers = ['Description', 'FY24Budget', 'FY25Budget', 'Change', 'ChangePercent']
df = pd.DataFrame(columns=headers)
csv_file_path = 'budget_data.csv'
df.to_csv(csv_file_path, index=False)

# Set page range
start_page = 93
end_page = 98

# Initialize an empty DataFrame
df = pd.DataFrame()

With that out of the way lets put it to us:

Call the function we created earlier, load the data into a data frame and add the proper column headers to it
The PDF sample has a bunch of summary rows on the last page of data so we check if its the last page and remove those rows if it is
Since this PDF has a bunch of white space in it Tabula returns several columns that are meaningless, we need to drop all of those
The last modification we need to make is removing whitespace from columns with data we need
In the end we write the data frame to a CSV and we are all set

pdf_path = 'C:\\Board_of_Finance_Budget.pdf'

for page in range(start_page, end_page + 1):
  # Extract tables from the specified range of pages
  df = extract_tables_from_pages(pdf_path, page)
  new_row = pd.DataFrame([df.columns], columns=df.columns)
  df = pd.concat([new_row, df]).reset_index(drop=True)
  
  # Remove last 6 rows if last page
  if page == end_page:
      df = df.iloc[:-6]
  
  # Drop 2nd column
  df.drop(df.columns[1], axis=1, inplace=True)
  # Drop 3rd column
  df.drop(df.columns[2], axis=1, inplace=True)
  # Drop 4th column
  df.drop(df.columns[3], axis=1, inplace=True)
  # Remove space characters in the 2nd column
  df.iloc[:, 1] = df.iloc[:, 1].astype(str).str.replace(' ', '')
  # Remove space characters in the 3rd column
  df.iloc[:, 2] = df.iloc[:, 2].astype(str).str.replace(' ', '')
  
  if df is not None:
      # Print the DataFrame
      print(df)
      
      # Save the DataFrame to CSV
      df.to_csv(csv_file_path, mode='a', index=False, header=False)
      print(f"DataFrame saved to {csv_file_path}")
  else:
      print(f'No table found on page {page}')

Wrapping Up

With a small amount of effort we were able to extract several pages of detail data from a PDF and write it out to a CSV. I am very excited about the flexibility Tabula has brought to the table. For your needs you may be able to get away with less transformation than I had to do. I found in my research if a table in a PDF has lines outlining its columns the extraction is much easier. Let me know in the comments how you plan to use Tabula to extract data.

Intro#

Tooling#

Setup#

Wrapping Up#

Intro

Tooling

Setup

Wrapping Up