Commit Graph

48 Commits

Author SHA1 Message Date
Trenton H 2729b0d3dc refactor: consolidate pdftotext utility and archive-decision logic
- Add extract_pdf_text() and PDF_TEXT_MIN_LENGTH to paperless/parsers/utils.py,
  eliminating duplicate pdftotext call sites in consumer.py and tesseract.py
- Rename _should_produce_archive → should_produce_archive (now public, imported
  by both consumer.py and tasks.py)
- update_document_content_maybe_archive_file now calls should_produce_archive,
  honouring ARCHIVE_FILE_GENERATION the same as the consumer pipeline
- Fallback OCR path sets archive_path when produce_archive=True; update
  test_with_form_redo_produces_no_archive to use produce_archive=False

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 07:51:24 -07:00
Jan Kleine 0bc032a67d Development: improve test portability (#12187)
* Fix: improve test portability

* Make settings always consistent

* Make a few more tests deterministic wrt settings

* Dont pollute settings for this one

* Fix timezone issue with mail parser

* Update test_parser.py

* Uh, I guess OCR gives variants for this

---------

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-02-27 23:24:11 +00:00
Trenton H 8db1c4e08b Breaking: Remove pybzar as a barcode reader (#12065) 2026-02-13 08:14:00 -08:00
shamoon 00ef0837d2 Fix: re-run ASN check after barcode detection (#11681) 2026-02-02 23:23:37 +00:00
Sebastian Steinbeißer 3b5ffbf9fa Chore(mypy): Annotate None returns for typing improvements (#11213) 2026-02-02 08:44:12 -08:00
Christoph Schober d16d3fb618 Feature: support split documents based on tag barcodes (#11645) 2026-01-29 08:05:33 -08:00
DerRockWolf 4ff09c4cf4 Enhancement: support workflow path matching of barcode-split documents (#10723) 2025-09-24 21:03:03 +00:00
shamoon 6a5be992c0 Enhancement: add barcode frontend config (#9742) 2025-05-11 19:44:06 +00:00
shamoon 9b84dc06b6 Enhancement: support retain barcode split pages (#7912) 2024-10-13 20:51:39 -07:00
Trenton H b720aa3cd1 Chore: Convert the consumer to a plugin (#6361) 2024-04-18 02:59:14 +00:00
Trenton H 8d664fad56 Fixes the interaction when both splitting and ASN are enabled (#5779) 2024-02-15 17:33:26 +00:00
Trenton H 21f96f0679 Fix: Splitting on ASN barcodes even if not enabled (#5740)
* Fixes the barcodes always splitting on ASNs, even if splitting was disabled
2024-02-12 12:58:37 -08:00
pkrahmer fb82aa0ee1 Feature: Allow tagging by putting barcode stickers on documents (#5580) 2024-02-05 17:38:19 +00:00
Trenton H 2da5e46386 Refactor file consumption task to allow beginnings of a plugin system (#5367) 2024-01-13 16:11:14 +00:00
Trenton H 122e4141b0 Fix: Document metadata is lost during barcode splitting (#4982)
* Fixes barcode splitting dropping metadata that might be needed for the round 2
2023-12-15 09:17:25 -08:00
Bastian Machek 931f5f9c27 Feature: support barcode upscaling for better detection of small barcodes (#3655) 2023-06-27 10:18:47 -07:00
Trenton H 45d8c945e2 Small improvements to coverage 2023-06-06 13:18:13 -07:00
Trenton H 07e07fc7e8 Updates handling of barcodes to encapsulate logic, moving it out of tasks and into barcodes 2023-05-22 06:52:31 -07:00
Trenton H 3bcbd05252 Fixes ruff not running isort against the codebase 2023-04-26 09:35:27 -07:00
Trenton Holmes 1b4020b3d7 Fixes barcode tests not running 2023-04-01 17:38:18 -07:00
Trenton H ce41ac9158 Configures ruff as the one stop linter and resolves warnings it raised 2023-04-01 17:03:52 -07:00
Trenton H 3c2bbf244d Creates a data model for the document consumption, allowing stronger typing of arguments and setting of some information about the file only once 2023-04-01 11:05:34 -07:00
Trenton H 0778c2808b Instead of using PIL directly to convert TIFF to PDF, use the existing library of img2pdf 2023-03-20 13:48:05 -07:00
Marvin Gaube 567a1bb770 fix: skip tiff tests for zxing 2023-03-20 20:59:59 +01:00
Marvin Gaube e89c0f15dd feature: Add support for zxing as barcode scanning lib 2023-03-19 13:48:35 +01:00
Trenton H 41bcfcaffe Changes out the settings and a decent amount of test code to be pathlib compatible 2023-03-06 09:16:07 -08:00
Trenton Holmes 0df91c31f1 Creates a mix-in for asserting file system states 2023-02-20 10:25:21 -08:00
Fabian Ohler 658d372cd2 Feature: split documents on ASN barcode (#2554)
* also split documents when an ASN barcode is found

* linter

* fix test case parameters

* avoid pre-python-3.9 features

* simplify dict-creation in tests

* simplify dict-creation in tests for empty dicts

* Add test cases for the splitting by ASN barcode feature

* deleted supporting files for test case construction
2023-02-01 01:13:30 -08:00
Trenton H 9784ea4a60 Minor tweak to password test to ensure the right lines were hit 2023-01-27 12:24:47 -08:00
Trenton H 4fce5aba63 Moves ASN barcode testing into a dedicated class 2023-01-27 12:24:47 -08:00
Trenton H 2ab77fbaf7 Removes pikepdf based scanning, fixes up unit testing (+ commenting) 2023-01-27 12:24:47 -08:00
shamoon c7690c05f5 Merge pull request #2498 from paperless-ngx/fix-2496
Fix: limit asn integer size
2023-01-24 10:37:04 -08:00
Trenton H 4195d5746f Rescales images from PDFs so zbar can better find them 2023-01-24 10:30:53 -08:00
Trenton H 8b90b51b1a Adjust the barcode to ASN range check and add test case to cover the check 2023-01-24 10:30:32 -08:00
Trenton H 299a69a2de Adds given/when/then commenting and adds an end to end test to verify the read ASN is provided to the consumer 2023-01-24 09:43:52 -08:00
Trenton H 7bc077ac08 Use dataclasses to group data about barcodes in documents 2023-01-24 09:43:52 -08:00
Peter Kappelt c2880bcf9a Extended tests for ASN barcode parsing 2023-01-24 09:43:52 -08:00
Peter Kappelt d8d111f093 update existing tests to use modified barcode api 2023-01-24 09:43:52 -08:00
Trenton H 6ff28c92a4 Resolves minor flake8 warnings in the test suite 2023-01-05 08:39:48 -08:00
Trenton H 10f6195bac Always use pikepdf, then pdf2image if needed to check for barcodes instead of requiring/allowing configuration 2022-11-09 13:01:39 -08:00
Trenton H d52fbbb040 More smoothly handle the case of a password protected PDF for barcodes 2022-10-24 13:16:14 -07:00
Trenton H f8ce6285df Allows using pdf2image instead of pikepdf if desired 2022-10-24 09:58:34 -07:00
Trenton Holmes 4cc2976614 Adds specific handling for CCITT Group 4, which pikepdf decodes, but not correctly 2022-10-11 13:51:14 -07:00
Trenton H caf4b54bc7 In case pikepdf fails to convert an image to a PIL image, fall back to converting pages to PIL images 2022-10-11 13:51:13 -07:00
Trenton H 8025df5fe3 Catch the new error raised by redis when it can't find the broker and stub out the call for testing 2022-10-10 14:21:42 -07:00
Trenton Holmes 7aa0e5650b Updates how barcodes are detected, using pikepdf images, instead of converting each page to an image 2022-09-16 09:08:16 -07:00
Trenton Holmes 9ae847039b Fixes the seperation of files by barcode, during the case where 2 barcodes appear back to back 2022-09-14 14:00:37 -07:00
Trenton Holmes ec045e81f2 Moves the barcode related functionality out of tasks and into its own location. Splits up the testing based on that 2022-07-02 16:19:22 +02:00