Commit Graph

40 Commits

Author SHA1 Message Date
Brian Martin b6ae129ad1 Sample Config and Bug Fix
Update sample config to reflect new setting variable.
Change consumer to handle density setting as str instead of int.
2016-05-13 23:23:58 -04:00
Brian Martin 52c5aafb3f Convert Density
Add settings variable for the convert density setting.
If no variable is set, default to 300.
2016-05-13 22:47:40 -04:00
Daniel Quinn e96c7448bc Fix for #107 2016-04-11 23:28:12 +01:00
Daniel Quinn 90939be6af @Pitkley made a good suggestion in #98 2016-04-10 17:39:49 +01:00
Daniel Quinn 64b72d4337 Added test for duplicates 2016-04-03 18:44:00 +01:00
Daniel Quinn bbe691f342 Merge pull request #101 from danielquinn/issue/89
Closes #89.
2016-03-28 14:25:56 +01:00
Daniel Quinn b4e648e1e3 Test All The Things 2016-03-28 14:16:26 +01:00
Daniel Quinn b92e007e15 Removed log components and introduced signals for tags & correspondents 2016-03-28 11:11:15 +01:00
Daniel Quinn 49b56425e8 Merge branch 'master' into issue/81 2016-03-25 20:56:30 +00:00
Daniel Quinn b387be6f25 I didn't mean to explicitly set -limit 2016-03-25 20:33:00 +00:00
Daniel Quinn 9991f5a6b2 Introducing optional env vars for ImageMagick 2016-03-25 20:31:15 +00:00
Daniel Quinn 0aa0513004 Modifications for support for dates 2016-03-24 19:18:33 +00:00
Daniel Quinn 1170139127 Added a consume-start and consume-finish signal 2016-03-14 21:20:44 +00:00
Tikitu de Jager 95217e8e21 Use FileInfo directly instead of via indirection 2016-03-07 21:08:07 +02:00
Tikitu de Jager 1f75af0137 Extract filename parsing into testable class 2016-03-07 21:05:04 +02:00
Pit Kleyersburg fb36a49c26 Add unpaper as another pre-processing step 2016-03-06 15:30:37 +01:00
Daniel Quinn 495ed1c36c Added thumbnail generation to the conumer 2016-03-05 12:09:06 +00:00
Daniel Quinn 5d4587ef8b Accounted for .sender in a few places 2016-03-04 09:14:50 +00:00
Daniel Quinn 070463b85a s/Sender/Correspondent & reworked the (im|ex)porter 2016-03-03 20:52:42 +00:00
Daniel Quinn fad466477b More verbose error logging 2016-03-03 18:18:48 +00:00
Daniel Quinn 631aa99d92 No need to pass verbosity around anymore 2016-02-28 00:39:40 +00:00
Daniel Quinn 2fe9b0cbc1 New logging appears to work 2016-02-27 20:18:50 +00:00
Daniel Quinn 1aecb1e63a Compensate for case and format of jpg vs. jpeg 2016-02-23 20:15:13 +00:00
Daniel Quinn 3a7923e32d Moved pyocr.get_available_tools() into a method 2016-02-21 02:24:05 +00:00
Daniel Quinn 422ae9303a pep8 2016-02-21 00:14:50 +00:00
Daniel Quinn 51b19f4c19 Issue #57 2016-02-20 22:30:01 +00:00
Pit Kleyersburg c45f951ca0 Ignore error if orientation detection fails
Fixes an additional issue that came up in #48.
2016-02-19 09:52:32 +01:00
Pit Kleyersburg c34d57a872 Detect image orientation if the OCR supports it
Fixes issue #47.
2016-02-18 09:37:13 +01:00
Daniel Quinn 1e7ece81ee Fixes #45 2016-02-17 23:07:54 +00:00
Daniel Quinn 6f95b05287 Support appropriate sorting for long documents 2016-02-17 00:10:05 +00:00
Pit Kleyersburg 46f8f492f5 Safely and non-randomly create scratch directory
Creating the scratch-files in `_get_grayscale` using a random integer is
for one inherently unsafe and can cause a collision. On the other hand,
it should be unnecessary given that the files will be cleaned up after
the OCR run.

Since we don't know if OCR runs might be parallel in the future, this
commit implements thread-safe and deterministic directory-creation.

Additionally it fixes the call to `_cleanup` by `consume`. In the
current implementation `_cleanup` will not be called if the last
consumed document failed with an `OCRError`, this commit fixes this.
2016-02-16 12:15:57 +01:00
Daniel Quinn a0f4f6c5f2 Fixed merge conflict and did some pep8 2016-02-14 17:13:48 +00:00
Pit Kleyersburg aeab9a0e81 Detect language only on one page of PDF
To detect the language currently the entire document gets processed. If
a different language has been detected than the default one, the entire
document will be processed again for the new language.

This PR analyzes the middle page for its language and either processes
the remaining pages with the default language if it didn't differ, or
processes all pages for the new guessed language.

The amount of processed pages comes down from the worst case `2n` to
worst case `n+1`.
2016-02-14 17:55:13 +01:00
Daniel Quinn 7843ea5037 Added and implemented a rudimentary logger 2016-02-14 16:09:52 +00:00
Pit Kleyersburg 20b2408dbb Ensure OCR_THREADS is integer, add documentation 2016-02-14 16:37:38 +01:00
Pit Kleyersburg f5beda9c56 Enable parallel OCR processing
At the moment, every page in a PDF will be processed one by one using
tesseract. Since the processing of a single page is independent from every
other page, one can make use of multi-core machines.

This PR introduces a multiprocessing pool to process multiple pages
simultaneously. The amount of threads to use can be specified in the
environment variable `PAPERLESS_OCR_THREADS`. This will default to the
number of cores/hyperthreads Python detects for your system.
2016-02-14 15:57:42 +01:00
Daniel Quinn a846b3f7b8 Adding some more debugging 2016-02-13 00:57:05 +00:00
Daniel Quinn 2421f559be Simpler regex 2016-02-12 08:27:09 +00:00
Daniel Quinn a022fcb8f1 Fixed the auto-naming regexes 2016-02-11 22:05:55 +00:00
Daniel Quinn 48761911b3 Image imports and consumption by mail work 2016-02-06 17:05:36 +00:00