PDF to Text

Extract text from any PDF,
in your browser.

Pull a clean text layer out of any PDF — paragraphs preserved, multi-page, UTF-8. Extraction runs inside your browser, so the file never leaves your device.

Drop the PDF you want to extract text from

We read the embedded text layer with pdf.js — no OCR, no server.

Verify yourself: open DevTools → Network tab → drop a file. Watch zero uploads happen.

Multi-page · UTF-8 output Scanned PDFs need OCR — this is text-layer only

Free

No Sign-Up

No Upload

UTF-8 Output

Output

UTF-8 .txt

Extracting text layer…

report.pdf Done

Privacy Stamp

File stays local

0 BYTES TRANSMITTED

HOW IT WORKS

Three steps. Your PDF never leaves this tab.

Drop your PDF

Pick the file you want to extract text from. It loads into your browser's memory, not a server.

We read the text layer

pdf.js walks every page, sorts items by Y-coordinate, and reconstructs paragraph breaks where they belong.

Copy or download .txt

Get clean UTF-8 plain text. Copy to clipboard or save as a .txt file — your call.

When you want the text and not the PDF

The reasons to extract the text from a PDF are nearly always about getting the words somewhere a PDF can't go. You want to paste a quote into an email without keeping the PDF as an attachment. You're feeding a long document to an AI assistant that takes plain text. You want to grep a 200-page report for one phrase and the PDF viewer's search is hiding what you need. You're translating a contract and your translation tool wants the source as a flat file. You're moving a manuscript from a finished PDF back into a writing app. In all of these the wrapper is what's in the way — the words are fine, just stuck inside a layout that's hard to recombine.

The output here is exactly that: plain text, one big .txt file, in the same order the PDF reads. No formatting, no fonts, no images, no tables-as-tables. The job is to liberate the words.

What "extract" actually does

A PDF holds two kinds of "text". Real text — characters drawn with fonts that the PDF marks as letters — sits in a text layer. The tool reads that layer directly. The other kind is text that lives only as pixels: anything scanned, photographed, or screenshot before being put into the PDF. Those characters are images of letters, not letters, and no extractor can see them as text without OCR. There is no OCR step here.

Two quick checks tell you which kind of PDF you have. Open it in any reader, click and drag over a paragraph: if the text highlights cleanly, there's a real text layer and extraction will work. If your cursor draws a rectangle and nothing highlights, the page is an image and you'll need to run it through OCR first (in another tool) before extraction has anything to read.

How line breaks and paragraphs come out

PDFs don't store paragraphs. Internally, a page is a bag of text fragments with positions on the page — there's no metadata saying "this is the end of a paragraph." Sensible plain text needs the breaks somewhere, so the tool infers them from vertical spacing: small gaps between lines become a single line break; larger gaps (the kind designers use to separate paragraphs) become a blank line. It gets the common cases right — body paragraphs, headings, lists. It can't know when a designer used unusual spacing to mean something other than what it usually means, so unusual layouts may need light cleanup after.

Pages are separated by a blank line in the output. If you'd rather not have page breaks at all, find-and-replace the double blank line with a single one in your editor.

What doesn't survive the extraction

Bold, italic, fonts, colour, alignment. Plain text is plain. If you need formatting preserved, see pdf-to-word.
Tables. Cells become plain text in reading order — usually row-by-row, often with awkward spacing. Tables that need to stay tables belong in pdf-to-excel.
Images and diagrams. Anything that wasn't text in the PDF doesn't appear in the text. To rip the images out separately, see pdf-to-jpg.
Headers and footers. If the original repeated "Confidential — page X of Y" on every page, the extracted text will too. A find-and-replace removes them in seconds.
Hyphenated words across line breaks. A word split with a hyphen across two lines comes out as some-\nthing rather than something. If this matters for downstream search or spellcheck, a regex like -\n → empty string fixes it.
Multi-column layouts may interleave. A two-column research paper can come out with sentences from the left and right columns alternating. Single-column documents — most reports, contracts, books — are unaffected. If a column-mixed result is unusable, opening the PDF in a tool that respects column order before re-extracting is the cleanest path.

A few practical notes

If the PDF is password-protected, open it through unlock-pdf first. Encrypted PDFs can't be opened for text extraction.
For very large PDFs (hundreds of MB or thousands of pages), extraction still happens in your browser. On a desktop this is rarely a problem; on a phone with a 500-page scan, browser memory is the cap. If that hits, do it on desktop.
The output is UTF-8. Cyrillic, Greek, Arabic, Chinese, accented Latin all survive cleanly as long as the PDF stored them as real text. PDFs that drew non-Latin characters as embedded subset glyphs without proper encoding produce garbage on extraction — that's a problem in the PDF itself, not the extractor. The fix is on the source side: re-export with proper Unicode encoding.
Filename pattern. A file called contract.pdf downloads as contract.txt. The PDF on disk stays put.

What happens to your file

The extraction runs in your browser. Open DevTools and watch the Network tab during the operation — there are no outbound requests carrying the file content. The PDF stays on your disk; the .txt is a new download alongside it.

FAQ

Frequently asked

How does extraction work?

We use Mozilla's pdf.js to read the embedded text layer of your PDF page by page. Items are grouped by their Y-coordinate so paragraph breaks survive — no server, no upload.

Does it work on scanned PDFs?

No. Scans are images of text, not text — extracting them requires OCR, which this tool doesn't run. If your PDF was made by scanning paper, you'll need an OCR tool first.

Is my file uploaded anywhere?

Never. Extraction runs entirely in your browser via WebAssembly — verifiable in DevTools → Network. The file stays on your device.

What about password-protected PDFs?

Unlock the PDF first using our Unlock PDF tool, then extract. Encrypted content streams can't be parsed without the password.

What's the file size limit?

Up to 100 MB. Anything larger may exhaust browser memory — try splitting it with the Split PDF tool first.

Extract text from any PDF, in your browser.