Question 1

What kinds of PDFs does this extract best?

Accepted Answer

PDFs where the data was originally tabular — invoices, reports generated by accounting or BI software, timesheets, schedules, product lists, bank statements, exam scores, research data tables. Anything exported from a spreadsheet, database, or reporting tool. The more the PDF was 'printed from software' and the less it was 'typed by hand or scanned,' the better the extraction.

Question 2

What does 'text-native PDF' mean and why does it matter?

Accepted Answer

A text-native PDF stores each text character with its position on the page — the text is selectable, searchable, and readable by software. A scanned (image-only) PDF stores each page as a picture, with no underlying text. The heuristic in this tool reads character coordinates from the PDF, clusters them into rows by y-position and columns by x-position, and writes cells to an Excel sheet. If there are no coordinates — because the PDF is an image — there is nothing to cluster.

Question 3

Does the tool do OCR for scanned PDFs?

Accepted Answer

No. This tool does not run optical character recognition. Scanned PDFs need to be processed through OCR first to generate a text layer, then fed here. Adobe Acrobat Pro and several dedicated OCR services handle that step. Free tools like Tesseract and Google Docs' Open with Google Docs option also add a text layer to scanned PDFs.

Question 4

How does the column detection decide where to split?

Accepted Answer

The algorithm reads the x-coordinate of every text item on the page, sorts them, and groups x-values that are within about 8 PDF points of each other. Each group becomes a column. Cells within the same row that are separated by less than 4 points of horizontal gap are joined (normal letter-spacing); wider gaps start a new cell. It is not perfect — overlapping columns, very tight layouts, or uneven line spacing can confuse it — but for PDFs with a clear visual column structure, it works on the first pass.

Question 5

What do I do if the detection gets something wrong?

Accepted Answer

Download the .xlsx and fix locally in Excel or Google Sheets. Typical cleanups: (1) one transaction row split across two sheet rows — merge them, (2) header row placed into data rows — cut and paste to the top, (3) a column with mixed-alignment numbers got split — copy the numeric content into one column. For most tabular PDFs, the cleanup takes a minute or two. If the layout is so complex that cleanup is slower than retyping, consider a dedicated paid tool like Adobe Acrobat Pro's Export to Excel, which does layout-aware detection.

Scanned paper PDFs	Image-only pages have no text layer. Run OCR first (Acrobat Pro, Tesseract, Google Drive's Open with Docs) to get a text layer, then come back.
Cells with heavy wrapping	If a cell's text wraps onto 3-4 lines, each line may end up as its own row. Works better when cells fit on one line — or tolerate manual row-merging afterward.
Nested / merged-cell headers	Visually merged headers above a group of columns come through as a row of their own, not as merged cells. Easy fix in Excel.
Handwritten forms	No coordinates, no extraction. Need OCR + possibly form recognition. Out of scope.
Charts and graphs	We read text. A chart's bars are not text, so axis labels and legends may come through but the data series won't.
Multi-column magazine layouts	Articles with editorial columns confuse the column detector. For prose PDFs, use a PDF-to-Word converter instead.

How to Extract Data from PDF to Excel

How the extraction actually works

When this tool is a great fit

Invoices and billing exports

Bank & brokerage statements

Reports from BI / analytics tools

Timesheets, schedules, rosters

Product catalogs, price lists, SKUs

Scientific tables and research data

When it's not the right tool

Alternatives when the free browser tool isn't enough

Adobe Acrobat Pro — Export to Excel

Tabula (open source)

Excel 365 — Data from PDF

Python with Camelot / pdfplumber

Vastiko (this tool)

Google Sheets IMPORTHTML

Extract now

Frequently Asked Questions

What kinds of PDFs does this extract best?

What does "text-native PDF" mean and why does it matter?

Does the tool do OCR for scanned PDFs?

How does the column detection decide where to split?

What do I do if detection gets something wrong?

Other PDF to Excel use cases