How to Extract Data from PDF to Excel

Turn any text-native PDF with tabular data — invoices, reports, timesheets, schedules, product lists — into an .xlsx workbook. The tool reads coordinates for every character in the PDF, groups them into rows and columns, and writes one Excel sheet per page. Best suited for PDFs that were exported from spreadsheet, accounting, or reporting software. Scanned PDFs need OCR first. Everything runs locally in your browser.

Drop a PDF to extract data

Free, no account, no upload — runs in your browser.

How the extraction actually works

No AI, no server, no guessing about what fields "mean." Just geometry — where each character sits on the page and what neighbors to group it with.

  1. Read text with coordinates. The tool uses pdfjs to walk the PDF and collect every text item with its x, y, width, and height on the page.
  2. Cluster items into rows. Items whose y-coordinates are within about 60% of the median text height are assigned to the same row. This handles normal line-spacing variation without merging real rows.
  3. Cluster items into columns. Every item's starting x-coordinate is sorted. Items within ~8 points of each other are assigned to the same column. The page ends up with N columns based on actual layout.
  4. Merge adjacent items into cells. Within a row, items separated by less than 4 points are treated as the same cell (normal word spacing); wider gaps start a new cell.
  5. Emit .xlsx. Each PDF page becomes one Excel sheet. Trailing empty columns are trimmed. The file is built with the SheetJS library and downloaded — no upload step.

The whole process runs in your browser in JavaScript. A 10-page PDF with a few hundred rows extracts in under 2 seconds on a typical laptop. Larger PDFs stay client-side too — they just take proportionally longer.

When this tool is a great fit

Invoices and billing exports

Line items, quantities, unit prices — invoices from QuickBooks, Xero, FreshBooks, SAP, and most ERPs are exported as text-native PDFs with clear column structure.

Bank & brokerage statements

Transaction logs, holdings, dividend reports — downloaded statements from major banks and brokerages are almost always text-native.

Reports from BI / analytics tools

Tableau, Power BI, Looker, Metabase — when they export to PDF, the data table underneath is retained as text. Our extraction treats it like any other table.

Timesheets, schedules, rosters

Shift plans, employee rosters, call schedules — anything printed from an HR or scheduling system. Rows = shifts / entries, columns = person / time / location.

Product catalogs, price lists, SKUs

Supplier price lists and wholesale catalogs exported from ERP or catalog software. Columns like SKU / Description / Price / Stock extract cleanly.

Scientific tables and research data

Tables embedded in journal PDFs, lab reports, clinical summaries. Works when the PDF was composed from a text source (LaTeX, Word) rather than a scan.

When it's not the right tool

Being honest about limitations saves you the round trip.

Scanned paper PDFsImage-only pages have no text layer. Run OCR first (Acrobat Pro, Tesseract, Google Drive's Open with Docs) to get a text layer, then come back.
Cells with heavy wrappingIf a cell's text wraps onto 3-4 lines, each line may end up as its own row. Works better when cells fit on one line — or tolerate manual row-merging afterward.
Nested / merged-cell headersVisually merged headers above a group of columns come through as a row of their own, not as merged cells. Easy fix in Excel.
Handwritten formsNo coordinates, no extraction. Need OCR + possibly form recognition. Out of scope.
Charts and graphsWe read text. A chart's bars are not text, so axis labels and legends may come through but the data series won't.
Multi-column magazine layoutsArticles with editorial columns confuse the column detector. For prose PDFs, use a PDF-to-Word converter instead.

Alternatives when the free browser tool isn't enough

Adobe Acrobat Pro — Export to Excel

$19.99/mo. Layout-aware table detection, OCR built in, handles complex headers. Industry standard. Uploads to Adobe Document Cloud by default.

Tabula (open source)

Free desktop tool (Java). Runs locally, gives you a visual selector to draw boxes around tables — more control than heuristic detection. Good for mixed-layout PDFs.

Excel 365 — Data from PDF

Built into Excel with Microsoft 365 subscription ($6.99/mo+). Works on text-native PDFs, keeps everything local. Good option if you already pay for 365.

Python with Camelot / pdfplumber

Free but requires Python and code. Precise control over extraction rules. Worth it if you extract hundreds of similar PDFs on a schedule.

Vastiko (this tool)

Free, zero install, browser-only, no upload. Best for one-off extractions and PDFs with clean column structure. Heuristic — not as tolerant of messy layouts as paid layout-aware tools.

Google Sheets IMPORTHTML

Free, but the PDF first needs to become HTML. Impractical for most PDF workflows. Works if the source is a web page rather than a real PDF.

Extract now

Drop the PDF. Each page becomes an Excel sheet.

Extract data to XLSX

Your PDF stays on your device.

Frequently Asked Questions

What kinds of PDFs does this extract best?

PDFs where the data was originally tabular — invoices, reports from accounting or BI software, timesheets, schedules, product lists, bank statements, exam scores, research data tables. Anything exported from a spreadsheet, database, or reporting tool. The more the PDF was "printed from software" and the less it was "typed by hand or scanned," the better the extraction.

What does "text-native PDF" mean and why does it matter?

A text-native PDF stores each text character with its position on the page — the text is selectable and searchable. A scanned (image-only) PDF stores each page as a picture, with no underlying text. The heuristic reads character coordinates, clusters them into rows by y-position and columns by x-position, and writes cells to Excel. If there are no coordinates — because the PDF is an image — there is nothing to cluster.

Does the tool do OCR for scanned PDFs?

No. This tool does not run OCR. Scanned PDFs need to be processed through OCR first to generate a text layer, then fed here. Adobe Acrobat Pro, Tesseract, and "Open with Google Docs" all add a text layer to scanned PDFs.

How does the column detection decide where to split?

Every text item's starting x-coordinate is sorted; x-values within about 8 points of each other become one column. Within a row, items separated by less than 4 points join into one cell (normal word spacing), wider gaps start a new cell. It's not perfect — overlapping or very tight layouts can confuse it — but for PDFs with a clear visual column structure, it works first pass.

What do I do if detection gets something wrong?

Download the .xlsx and fix locally in Excel or Sheets. Typical cleanups: a transaction row split across two sheet rows (merge), header row placed into data rows (cut and paste to top), numbers split into two columns (combine). For most tabular PDFs the cleanup takes a minute or two. If cleanup is slower than retyping, a dedicated paid tool like Acrobat Pro does layout-aware detection and saves time on complex layouts.

Other PDF to Excel use cases