Turn any text-native PDF with tabular data — invoices, reports, timesheets, schedules, product lists — into an .xlsx workbook. The tool reads coordinates for every character in the PDF, groups them into rows and columns, and writes one Excel sheet per page. Best suited for PDFs that were exported from spreadsheet, accounting, or reporting software. Scanned PDFs need OCR first. Everything runs locally in your browser.
No AI, no server, no guessing about what fields "mean." Just geometry — where each character sits on the page and what neighbors to group it with.
The whole process runs in your browser in JavaScript. A 10-page PDF with a few hundred rows extracts in under 2 seconds on a typical laptop. Larger PDFs stay client-side too — they just take proportionally longer.
Line items, quantities, unit prices — invoices from QuickBooks, Xero, FreshBooks, SAP, and most ERPs are exported as text-native PDFs with clear column structure.
Transaction logs, holdings, dividend reports — downloaded statements from major banks and brokerages are almost always text-native.
Tableau, Power BI, Looker, Metabase — when they export to PDF, the data table underneath is retained as text. Our extraction treats it like any other table.
Shift plans, employee rosters, call schedules — anything printed from an HR or scheduling system. Rows = shifts / entries, columns = person / time / location.
Supplier price lists and wholesale catalogs exported from ERP or catalog software. Columns like SKU / Description / Price / Stock extract cleanly.
Tables embedded in journal PDFs, lab reports, clinical summaries. Works when the PDF was composed from a text source (LaTeX, Word) rather than a scan.
Being honest about limitations saves you the round trip.
| Scanned paper PDFs | Image-only pages have no text layer. Run OCR first (Acrobat Pro, Tesseract, Google Drive's Open with Docs) to get a text layer, then come back. |
|---|---|
| Cells with heavy wrapping | If a cell's text wraps onto 3-4 lines, each line may end up as its own row. Works better when cells fit on one line — or tolerate manual row-merging afterward. |
| Nested / merged-cell headers | Visually merged headers above a group of columns come through as a row of their own, not as merged cells. Easy fix in Excel. |
| Handwritten forms | No coordinates, no extraction. Need OCR + possibly form recognition. Out of scope. |
| Charts and graphs | We read text. A chart's bars are not text, so axis labels and legends may come through but the data series won't. |
| Multi-column magazine layouts | Articles with editorial columns confuse the column detector. For prose PDFs, use a PDF-to-Word converter instead. |
$19.99/mo. Layout-aware table detection, OCR built in, handles complex headers. Industry standard. Uploads to Adobe Document Cloud by default.
Free desktop tool (Java). Runs locally, gives you a visual selector to draw boxes around tables — more control than heuristic detection. Good for mixed-layout PDFs.
Built into Excel with Microsoft 365 subscription ($6.99/mo+). Works on text-native PDFs, keeps everything local. Good option if you already pay for 365.
Free but requires Python and code. Precise control over extraction rules. Worth it if you extract hundreds of similar PDFs on a schedule.
Free, zero install, browser-only, no upload. Best for one-off extractions and PDFs with clean column structure. Heuristic — not as tolerant of messy layouts as paid layout-aware tools.
Free, but the PDF first needs to become HTML. Impractical for most PDF workflows. Works if the source is a web page rather than a real PDF.
Drop the PDF. Each page becomes an Excel sheet.
Extract data to XLSX
Your PDF stays on your device.
PDFs where the data was originally tabular — invoices, reports from accounting or BI software, timesheets, schedules, product lists, bank statements, exam scores, research data tables. Anything exported from a spreadsheet, database, or reporting tool. The more the PDF was "printed from software" and the less it was "typed by hand or scanned," the better the extraction.
A text-native PDF stores each text character with its position on the page — the text is selectable and searchable. A scanned (image-only) PDF stores each page as a picture, with no underlying text. The heuristic reads character coordinates, clusters them into rows by y-position and columns by x-position, and writes cells to Excel. If there are no coordinates — because the PDF is an image — there is nothing to cluster.
No. This tool does not run OCR. Scanned PDFs need to be processed through OCR first to generate a text layer, then fed here. Adobe Acrobat Pro, Tesseract, and "Open with Google Docs" all add a text layer to scanned PDFs.
Every text item's starting x-coordinate is sorted; x-values within about 8 points of each other become one column. Within a row, items separated by less than 4 points join into one cell (normal word spacing), wider gaps start a new cell. It's not perfect — overlapping or very tight layouts can confuse it — but for PDFs with a clear visual column structure, it works first pass.
Download the .xlsx and fix locally in Excel or Sheets. Typical cleanups: a transaction row split across two sheet rows (merge), header row placed into data rows (cut and paste to top), numbers split into two columns (combine). For most tabular PDFs the cleanup takes a minute or two. If cleanup is slower than retyping, a dedicated paid tool like Acrobat Pro does layout-aware detection and saves time on complex layouts.