Processing and AI

Paperarchive automatically processes every document you upload using AI-powered analysis.

How processing works

Validate file type and size.
OCR extracts text from PDFs and images.
Paperarchive extracts sender, category, tags, dates, and summary.
Similarity matching connects related documents.

Low confidence results create action hints on the document so you can review and confirm.

Duplicate detection

Paperarchive automatically detects duplicate uploads using SHA-256 content hashing. When you upload a file, a hash is computed on your device and verified on the server. If the same content already exists in your library, the upload is skipped and you are notified.

For near-duplicates (re-scans, reformatted versions), Paperarchive uses embedding similarity to flag documents that are very similar to existing ones. These are marked for your review so you can decide whether to keep them.

Duplicate detection works across all upload methods: the app, the API, and email forwarding.

Processing language

Pick the language that best matches your documents. This improves OCR and extraction. Display language is separate and only affects the app UI.

What Paperarchive detects

Sender: the company or person the document is from.
Category: the type of document (invoice, contract, etc.).
Tags: relevant labels extracted from the content.
Date: the document date (not the upload date).
Filename: a clean, descriptive name based on the content and your naming preferences.

You can always edit any of these after processing.