Compare commits

..

12 Commits

Author SHA1 Message Date
Trenton H
951ea085f7 Merge branch 'dev' into fix/potential-redos 2026-04-02 15:07:21 -07:00
Trenton H
376af81b9c Fix: Resolve another TC assuming an object has been created somewhere (#12503) 2026-04-02 14:58:28 -07:00
Trenton H
6c8622b6b3 This is clear enough and testing it seems useless 2026-04-02 14:55:19 -07:00
Trenton H
3e9558bf5e Test a valid hmac but garbage data 2026-04-02 14:55:19 -07:00
Trenton H
a9c1ce463c Quickly test the error handling and good handling cases for regex 2026-04-02 14:55:19 -07:00
Trenton H
94a7b5c6a3 Merge branch 'dev' into fix/potential-redos 2026-04-02 14:42:49 -07:00
Trenton H
d917c7070f We need to set a dummy value to build, oops 2026-04-02 14:10:01 -07:00
Trenton H
fe2d924505 Require a SECRET_KEY, don't ever fallback to an older one 2026-04-02 14:00:04 -07:00
Trenton H
c1c423c7b2 Patches things more directly 2026-04-02 13:49:17 -07:00
Trenton H
79784ac407 Signs the classifier so we have additional protections against tampering + pickle 2026-04-02 13:27:02 -07:00
Trenton H
0c2fe1272b Adds scoped rate limiting to the token API 2026-04-02 13:26:09 -07:00
Trenton H
d6542a691e Fixes potential sources for ReDOS 2026-04-02 13:04:14 -07:00
43 changed files with 545 additions and 1126 deletions

View File

@@ -237,8 +237,8 @@ RUN set -eux \
&& echo "Adjusting all permissions" \
&& chown --from root:root --changes --recursive paperless:paperless /usr/src/paperless \
&& echo "Collecting static files" \
&& s6-setuidgid paperless python3 manage.py collectstatic --clear --no-input --link \
&& s6-setuidgid paperless python3 manage.py compilemessages \
&& PAPERLESS_SECRET_KEY=build-time-dummy s6-setuidgid paperless python3 manage.py collectstatic --clear --no-input --link \
&& PAPERLESS_SECRET_KEY=build-time-dummy s6-setuidgid paperless python3 manage.py compilemessages \
&& /usr/local/bin/deduplicate.py --verbose /usr/src/paperless/static/
VOLUME ["/usr/src/paperless/data", \

View File

@@ -17,9 +17,9 @@
# (if doing so please consider security measures such as reverse proxy)
#PAPERLESS_URL=https://paperless.example.com
# Adjust this key if you plan to make paperless available publicly. It should
# be a very long sequence of random characters. You don't need to remember it.
#PAPERLESS_SECRET_KEY=change-me
# Required. A unique secret key for session tokens and signing.
# Generate with: python3 -c "import secrets; print(secrets.token_urlsafe(64))"
PAPERLESS_SECRET_KEY=change-me
# Use this variable to set a timezone for the Paperless Docker containers. Defaults to UTC.
#PAPERLESS_TIME_ZONE=America/Los_Angeles

View File

@@ -62,14 +62,10 @@ The REST api provides five different forms of authentication.
## Searching for documents
Full text searching is available on the `/api/documents/` endpoint. The
following query parameters cause the API to return Tantivy-backed search
Full text searching is available on the `/api/documents/` endpoint. Two
specific query parameters cause the API to return full text search
results:
- `/api/documents/?text=your%20search%20query`: Search title and content
using simple substring-style search.
- `/api/documents/?title_search=your%20search%20query`: Search title only
using simple substring-style search.
- `/api/documents/?query=your%20search%20query`: Search for a document
using a full text query. For details on the syntax, see [Basic Usage - Searching](usage.md#basic-usage_searching).
- `/api/documents/?more_like_id=1234`: Search for documents similar to
@@ -443,5 +439,3 @@ Initial API version.
- The `all` parameter of list endpoints is now deprecated and will be removed in a future version.
- The bulk edit objects endpoint now supports `all` and `filters` parameters to avoid having to send
large lists of object IDs for operations affecting many objects.
- The legacy `title_content` document search parameter is deprecated and will be removed in a future version.
Clients should use `text` for simple title-and-content search and `title_search` for title-only search.

View File

@@ -402,6 +402,12 @@ Defaults to `/usr/share/nltk_data`
: This is where paperless will store the classification model.
!!! warning
The classification model uses Python's pickle serialization format.
Ensure this file is only writable by the paperless user, as a
maliciously crafted model file could execute arbitrary code when loaded.
Defaults to `PAPERLESS_DATA_DIR/classification_model.pickle`.
## Logging
@@ -422,14 +428,20 @@ Defaults to `/usr/share/nltk_data`
#### [`PAPERLESS_SECRET_KEY=<key>`](#PAPERLESS_SECRET_KEY) {#PAPERLESS_SECRET_KEY}
: Paperless uses this to make session tokens. If you expose paperless
on the internet, you need to change this, since the default secret
is well known.
: **Required.** Paperless uses this to make session tokens and sign
sensitive data. Paperless will refuse to start if this is not set.
Use any sequence of characters. The more, the better. You don't
need to remember this. Just face-roll your keyboard.
need to remember this. You can generate a suitable key with:
Default is listed in the file `src/paperless/settings.py`.
python3 -c "import secrets; print(secrets.token_urlsafe(64))"
!!! warning
This setting has no default value. You **must** set it before
starting Paperless. Existing installations that relied on the
previous default value should set `PAPERLESS_SECRET_KEY` to
that value to avoid invalidating existing sessions and tokens.
#### [`PAPERLESS_URL=<url>`](#PAPERLESS_URL) {#PAPERLESS_URL}
@@ -770,6 +782,14 @@ If both the [PAPERLESS_ACCOUNT_DEFAULT_GROUPS](#PAPERLESS_ACCOUNT_DEFAULT_GROUPS
Defaults to 1209600 (2 weeks)
#### [`PAPERLESS_TOKEN_THROTTLE_RATE=<rate>`](#PAPERLESS_TOKEN_THROTTLE_RATE) {#PAPERLESS_TOKEN_THROTTLE_RATE}
: Rate limit for the API token authentication endpoint (`/api/token/`), used to mitigate brute-force login attempts.
Uses Django REST Framework's [throttle rate format](https://www.django-rest-framework.org/api-guide/throttling/#setting-the-throttling-policy),
e.g. `5/min`, `100/hour`, `1000/day`.
Defaults to `5/min`
## OCR settings {#ocr}
Paperless uses [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest/)

View File

@@ -23,7 +23,8 @@
# Security and hosting
#PAPERLESS_SECRET_KEY=change-me
# Required. Generate with: python3 -c "import secrets; print(secrets.token_urlsafe(64))"
PAPERLESS_SECRET_KEY=change-me
#PAPERLESS_URL=https://example.com
#PAPERLESS_CSRF_TRUSTED_ORIGINS=https://example.com # can be set using PAPERLESS_URL
#PAPERLESS_ALLOWED_HOSTS=example.com,www.example.com # can be set using PAPERLESS_URL

View File

@@ -315,9 +315,12 @@ markers = [
]
[tool.pytest_env]
PAPERLESS_SECRET_KEY = "test-secret-key-do-not-use-in-production"
PAPERLESS_DISABLE_DBHANDLER = "true"
PAPERLESS_CACHE_BACKEND = "django.core.cache.backends.locmem.LocMemCache"
PAPERLESS_CHANNELS_BACKEND = "channels.layers.InMemoryChannelLayer"
# I don't think anything hits this, but just in case, basically infinite
PAPERLESS_TOKEN_THROTTLE_RATE = "1000/min"
[tool.coverage.report]
exclude_also = [

View File

@@ -49,11 +49,11 @@ test('text filtering', async ({ page }) => {
await page.getByRole('main').getByRole('combobox').click()
await page.getByRole('main').getByRole('combobox').fill('test')
await expect(page.locator('pngx-document-list')).toHaveText(/32 documents/)
await expect(page).toHaveURL(/text=test/)
await expect(page).toHaveURL(/title_content=test/)
await page.getByRole('button', { name: 'Title & content' }).click()
await page.getByRole('button', { name: 'Title', exact: true }).click()
await expect(page.locator('pngx-document-list')).toHaveText(/9 documents/)
await expect(page).toHaveURL(/title_search=test/)
await expect(page).toHaveURL(/title__icontains=test/)
await page.getByRole('button', { name: 'Title', exact: true }).click()
await page.getByRole('button', { name: 'Advanced search' }).click()
await expect(page).toHaveURL(/query=test/)

View File

@@ -3545,7 +3545,7 @@
"time": 1.091,
"request": {
"method": "GET",
"url": "http://localhost:8000/api/documents/?page=1&page_size=50&ordering=-created&truncate_content=true&include_selection_data=true&text=test",
"url": "http://localhost:8000/api/documents/?page=1&page_size=50&ordering=-created&truncate_content=true&include_selection_data=true&title_content=test",
"httpVersion": "HTTP/1.1",
"cookies": [],
"headers": [
@@ -3579,7 +3579,7 @@
"value": "true"
},
{
"name": "text",
"name": "title_content",
"value": "test"
}
],
@@ -4303,7 +4303,7 @@
"time": 0.603,
"request": {
"method": "GET",
"url": "http://localhost:8000/api/documents/?page=1&page_size=50&ordering=-created&truncate_content=true&include_selection_data=true&title_search=test",
"url": "http://localhost:8000/api/documents/?page=1&page_size=50&ordering=-created&truncate_content=true&include_selection_data=true&title__icontains=test",
"httpVersion": "HTTP/1.1",
"cookies": [],
"headers": [
@@ -4337,7 +4337,7 @@
"value": "true"
},
{
"name": "title_search",
"name": "title__icontains",
"value": "test"
}
],

View File

@@ -24,7 +24,7 @@ import {
FILTER_HAS_DOCUMENT_TYPE_ANY,
FILTER_HAS_STORAGE_PATH_ANY,
FILTER_HAS_TAGS_ALL,
FILTER_SIMPLE_TEXT,
FILTER_TITLE_CONTENT,
} from 'src/app/data/filter-rule-type'
import { GlobalSearchType, SETTINGS_KEYS } from 'src/app/data/ui-settings'
import { DocumentListViewService } from 'src/app/services/document-list-view.service'
@@ -545,7 +545,7 @@ describe('GlobalSearchComponent', () => {
component.query = 'test'
component.runFullSearch()
expect(qfSpy).toHaveBeenCalledWith([
{ rule_type: FILTER_SIMPLE_TEXT, value: 'test' },
{ rule_type: FILTER_TITLE_CONTENT, value: 'test' },
])
settingsService.set(

View File

@@ -25,7 +25,7 @@ import {
FILTER_HAS_DOCUMENT_TYPE_ANY,
FILTER_HAS_STORAGE_PATH_ANY,
FILTER_HAS_TAGS_ALL,
FILTER_SIMPLE_TEXT,
FILTER_TITLE_CONTENT,
} from 'src/app/data/filter-rule-type'
import { ObjectWithId } from 'src/app/data/object-with-id'
import { GlobalSearchType, SETTINGS_KEYS } from 'src/app/data/ui-settings'
@@ -410,7 +410,7 @@ export class GlobalSearchComponent implements OnInit {
public runFullSearch() {
const ruleType = this.useAdvancedForFullSearch
? FILTER_FULLTEXT_QUERY
: FILTER_SIMPLE_TEXT
: FILTER_TITLE_CONTENT
this.documentService.searchQuery = this.useAdvancedForFullSearch
? this.query
: ''

View File

@@ -4,7 +4,7 @@ import { ComponentFixture, TestBed } from '@angular/core/testing'
import { By } from '@angular/platform-browser'
import { NgbAccordionButton, NgbActiveModal } from '@ng-bootstrap/ng-bootstrap'
import { of, throwError } from 'rxjs'
import { FILTER_SIMPLE_TITLE } from 'src/app/data/filter-rule-type'
import { FILTER_TITLE } from 'src/app/data/filter-rule-type'
import { DocumentService } from 'src/app/services/rest/document.service'
import { StoragePathService } from 'src/app/services/rest/storage-path.service'
import { SettingsService } from 'src/app/services/settings.service'
@@ -105,7 +105,7 @@ describe('StoragePathEditDialogComponent', () => {
null,
'created',
true,
[{ rule_type: FILTER_SIMPLE_TITLE, value: 'bar' }],
[{ rule_type: FILTER_TITLE, value: 'bar' }],
{ truncate_content: true }
)
listSpy.mockReturnValueOnce(

View File

@@ -23,7 +23,7 @@ import {
} from 'rxjs'
import { EditDialogComponent } from 'src/app/components/common/edit-dialog/edit-dialog.component'
import { Document } from 'src/app/data/document'
import { FILTER_SIMPLE_TITLE } from 'src/app/data/filter-rule-type'
import { FILTER_TITLE } from 'src/app/data/filter-rule-type'
import { DEFAULT_MATCHING_ALGORITHM } from 'src/app/data/matching-model'
import { StoragePath } from 'src/app/data/storage-path'
import { IfOwnerDirective } from 'src/app/directives/if-owner.directive'
@@ -146,7 +146,7 @@ export class StoragePathEditDialogComponent
null,
'created',
true,
[{ rule_type: FILTER_SIMPLE_TITLE, value: title }],
[{ rule_type: FILTER_TITLE, value: title }],
{ truncate_content: true }
)
.pipe(

View File

@@ -3,7 +3,7 @@ import { provideHttpClientTesting } from '@angular/common/http/testing'
import { ComponentFixture, TestBed } from '@angular/core/testing'
import { NG_VALUE_ACCESSOR } from '@angular/forms'
import { of, throwError } from 'rxjs'
import { FILTER_SIMPLE_TITLE } from 'src/app/data/filter-rule-type'
import { FILTER_TITLE } from 'src/app/data/filter-rule-type'
import { DocumentService } from 'src/app/services/rest/document.service'
import { DocumentLinkComponent } from './document-link.component'
@@ -99,7 +99,7 @@ describe('DocumentLinkComponent', () => {
null,
'created',
true,
[{ rule_type: FILTER_SIMPLE_TITLE, value: 'bar' }],
[{ rule_type: FILTER_TITLE, value: 'bar' }],
{ truncate_content: true }
)
listSpy.mockReturnValueOnce(throwError(() => new Error()))

View File

@@ -28,7 +28,7 @@ import {
tap,
} from 'rxjs'
import { Document } from 'src/app/data/document'
import { FILTER_SIMPLE_TITLE } from 'src/app/data/filter-rule-type'
import { FILTER_TITLE } from 'src/app/data/filter-rule-type'
import { CustomDatePipe } from 'src/app/pipes/custom-date.pipe'
import { DocumentService } from 'src/app/services/rest/document.service'
import { AbstractInputComponent } from '../abstract-input'
@@ -121,7 +121,7 @@ export class DocumentLinkComponent
null,
'created',
true,
[{ rule_type: FILTER_SIMPLE_TITLE, value: title }],
[{ rule_type: FILTER_TITLE, value: title }],
{ truncate_content: true }
)
.pipe(

View File

@@ -428,7 +428,7 @@ describe('BulkEditorComponent', () => {
req.flush(true)
expect(req.request.body).toEqual({
all: true,
filters: { title_search: 'apple' },
filters: { title__icontains: 'apple' },
method: 'modify_tags',
parameters: { add_tags: [101], remove_tags: [] },
})

View File

@@ -67,8 +67,6 @@ import {
FILTER_OWNER_DOES_NOT_INCLUDE,
FILTER_OWNER_ISNULL,
FILTER_SHARED_BY_USER,
FILTER_SIMPLE_TEXT,
FILTER_SIMPLE_TITLE,
FILTER_STORAGE_PATH,
FILTER_TITLE,
FILTER_TITLE_CONTENT,
@@ -314,7 +312,7 @@ describe('FilterEditorComponent', () => {
expect(component.textFilter).toEqual(null)
component.filterRules = [
{
rule_type: FILTER_SIMPLE_TEXT,
rule_type: FILTER_TITLE_CONTENT,
value: 'foo',
},
]
@@ -322,18 +320,6 @@ describe('FilterEditorComponent', () => {
expect(component.textFilterTarget).toEqual('title-content') // TEXT_FILTER_TARGET_TITLE_CONTENT
}))
it('should ingest legacy text filter rules for doc title + content', fakeAsync(() => {
expect(component.textFilter).toEqual(null)
component.filterRules = [
{
rule_type: FILTER_TITLE_CONTENT,
value: 'legacy foo',
},
]
expect(component.textFilter).toEqual('legacy foo')
expect(component.textFilterTarget).toEqual('title-content') // TEXT_FILTER_TARGET_TITLE_CONTENT
}))
it('should ingest text filter rules for doc asn', fakeAsync(() => {
expect(component.textFilter).toEqual(null)
component.filterRules = [
@@ -1131,7 +1117,7 @@ describe('FilterEditorComponent', () => {
expect(component.textFilter).toEqual('foo')
expect(component.filterRules).toEqual([
{
rule_type: FILTER_SIMPLE_TEXT,
rule_type: FILTER_TITLE_CONTENT,
value: 'foo',
},
])
@@ -1150,7 +1136,7 @@ describe('FilterEditorComponent', () => {
expect(component.textFilterTarget).toEqual('title')
expect(component.filterRules).toEqual([
{
rule_type: FILTER_SIMPLE_TITLE,
rule_type: FILTER_TITLE,
value: 'foo',
},
])
@@ -1264,12 +1250,30 @@ describe('FilterEditorComponent', () => {
])
}))
it('should convert user input to correct filter rules on custom fields query', fakeAsync(() => {
component.textFilterInput.nativeElement.value = 'foo'
component.textFilterInput.nativeElement.dispatchEvent(new Event('input'))
const textFieldTargetDropdown = fixture.debugElement.queryAll(
By.directive(NgbDropdownItem)
)[3]
textFieldTargetDropdown.triggerEventHandler('click') // TEXT_FILTER_TARGET_CUSTOM_FIELDS
fixture.detectChanges()
tick(400)
expect(component.textFilterTarget).toEqual('custom-fields')
expect(component.filterRules).toEqual([
{
rule_type: FILTER_CUSTOM_FIELDS_TEXT,
value: 'foo',
},
])
}))
it('should convert user input to correct filter rules on mime type', fakeAsync(() => {
component.textFilterInput.nativeElement.value = 'pdf'
component.textFilterInput.nativeElement.dispatchEvent(new Event('input'))
const textFieldTargetDropdown = fixture.debugElement.queryAll(
By.directive(NgbDropdownItem)
)[3]
)[4]
textFieldTargetDropdown.triggerEventHandler('click') // TEXT_FILTER_TARGET_MIME_TYPE
fixture.detectChanges()
tick(400)
@@ -1287,8 +1291,8 @@ describe('FilterEditorComponent', () => {
component.textFilterInput.nativeElement.dispatchEvent(new Event('input'))
const textFieldTargetDropdown = fixture.debugElement.queryAll(
By.directive(NgbDropdownItem)
)[4]
textFieldTargetDropdown.triggerEventHandler('click') // TEXT_FILTER_TARGET_FULLTEXT_QUERY
)[5]
textFieldTargetDropdown.triggerEventHandler('click') // TEXT_FILTER_TARGET_ASN
fixture.detectChanges()
tick(400)
expect(component.textFilterTarget).toEqual('fulltext-query')
@@ -1692,56 +1696,12 @@ describe('FilterEditorComponent', () => {
])
}))
it('should convert legacy title filters into full text query when adding a created relative date', fakeAsync(() => {
component.filterRules = [
{
rule_type: FILTER_TITLE,
value: 'foo',
},
]
const dateCreatedDropdown = fixture.debugElement.queryAll(
By.directive(DatesDropdownComponent)
)[0]
component.dateCreatedRelativeDate = RelativeDate.WITHIN_1_WEEK
dateCreatedDropdown.triggerEventHandler('datesSet')
fixture.detectChanges()
tick(400)
expect(component.filterRules).toEqual([
{
rule_type: FILTER_FULLTEXT_QUERY,
value: 'foo,created:[-1 week to now]',
},
])
}))
it('should convert simple title filters into full text query when adding a created relative date', fakeAsync(() => {
component.filterRules = [
{
rule_type: FILTER_SIMPLE_TITLE,
value: 'foo',
},
]
const dateCreatedDropdown = fixture.debugElement.queryAll(
By.directive(DatesDropdownComponent)
)[0]
component.dateCreatedRelativeDate = RelativeDate.WITHIN_1_WEEK
dateCreatedDropdown.triggerEventHandler('datesSet')
fixture.detectChanges()
tick(400)
expect(component.filterRules).toEqual([
{
rule_type: FILTER_FULLTEXT_QUERY,
value: 'foo,created:[-1 week to now]',
},
])
}))
it('should leave relative dates not in quick list intact', fakeAsync(() => {
component.textFilterInput.nativeElement.value = 'created:[-2 week to now]'
component.textFilterInput.nativeElement.dispatchEvent(new Event('input'))
const textFieldTargetDropdown = fixture.debugElement.queryAll(
By.directive(NgbDropdownItem)
)[4]
)[5]
textFieldTargetDropdown.triggerEventHandler('click')
fixture.detectChanges()
tick(400)
@@ -2071,30 +2031,12 @@ describe('FilterEditorComponent', () => {
component.filterRules = [
{
rule_type: FILTER_SIMPLE_TITLE,
rule_type: FILTER_TITLE,
value: 'foo',
},
]
expect(component.generateFilterName()).toEqual('Title: foo')
component.filterRules = [
{
rule_type: FILTER_TITLE_CONTENT,
value: 'legacy foo',
},
]
expect(component.generateFilterName()).toEqual(
'Title & content: legacy foo'
)
component.filterRules = [
{
rule_type: FILTER_SIMPLE_TEXT,
value: 'foo',
},
]
expect(component.generateFilterName()).toEqual('Title & content: foo')
component.filterRules = [
{
rule_type: FILTER_ASN,
@@ -2214,36 +2156,6 @@ describe('FilterEditorComponent', () => {
})
})
it('should hide deprecated custom fields target from default text filter targets', () => {
expect(component.textFilterTargets).not.toContainEqual({
id: 'custom-fields',
name: $localize`Custom fields (Deprecated)`,
})
})
it('should keep deprecated custom fields target available for legacy filters', fakeAsync(() => {
component.filterRules = [
{
rule_type: FILTER_CUSTOM_FIELDS_TEXT,
value: 'foo',
},
]
fixture.detectChanges()
tick()
expect(component.textFilterTarget).toEqual('custom-fields')
expect(component.textFilterTargets).toContainEqual({
id: 'custom-fields',
name: $localize`Custom fields (Deprecated)`,
})
expect(component.filterRules).toEqual([
{
rule_type: FILTER_CUSTOM_FIELDS_TEXT,
value: 'foo',
},
])
}))
it('should call autocomplete endpoint on input', fakeAsync(() => {
component.textFilterTarget = 'fulltext-query' // TEXT_FILTER_TARGET_FULLTEXT_QUERY
const autocompleteSpy = jest.spyOn(searchService, 'autocomplete')

View File

@@ -71,8 +71,6 @@ import {
FILTER_OWNER_DOES_NOT_INCLUDE,
FILTER_OWNER_ISNULL,
FILTER_SHARED_BY_USER,
FILTER_SIMPLE_TEXT,
FILTER_SIMPLE_TITLE,
FILTER_STORAGE_PATH,
FILTER_TITLE,
FILTER_TITLE_CONTENT,
@@ -197,6 +195,10 @@ const DEFAULT_TEXT_FILTER_TARGET_OPTIONS = [
name: $localize`Title & content`,
},
{ id: TEXT_FILTER_TARGET_ASN, name: $localize`ASN` },
{
id: TEXT_FILTER_TARGET_CUSTOM_FIELDS,
name: $localize`Custom fields`,
},
{ id: TEXT_FILTER_TARGET_MIME_TYPE, name: $localize`File type` },
{
id: TEXT_FILTER_TARGET_FULLTEXT_QUERY,
@@ -204,12 +206,6 @@ const DEFAULT_TEXT_FILTER_TARGET_OPTIONS = [
},
]
const DEPRECATED_CUSTOM_FIELDS_TEXT_FILTER_TARGET_OPTION = {
// Kept only so legacy saved views can render and be edited away from, remove me eventually
id: TEXT_FILTER_TARGET_CUSTOM_FIELDS,
name: $localize`Custom fields (Deprecated)`,
}
const TEXT_FILTER_TARGET_MORELIKE_OPTION = {
id: TEXT_FILTER_TARGET_FULLTEXT_MORELIKE,
name: $localize`More like`,
@@ -322,13 +318,8 @@ export class FilterEditorComponent
return $localize`Custom fields query`
case FILTER_TITLE:
case FILTER_SIMPLE_TITLE:
return $localize`Title: ${rule.value}`
case FILTER_TITLE_CONTENT:
case FILTER_SIMPLE_TEXT:
return $localize`Title & content: ${rule.value}`
case FILTER_ASN:
return $localize`ASN: ${rule.value}`
@@ -362,16 +353,12 @@ export class FilterEditorComponent
_moreLikeDoc: Document
get textFilterTargets() {
let targets = DEFAULT_TEXT_FILTER_TARGET_OPTIONS
if (this.textFilterTarget == TEXT_FILTER_TARGET_FULLTEXT_MORELIKE) {
targets = targets.concat([TEXT_FILTER_TARGET_MORELIKE_OPTION])
}
if (this.textFilterTarget == TEXT_FILTER_TARGET_CUSTOM_FIELDS) {
targets = targets.concat([
DEPRECATED_CUSTOM_FIELDS_TEXT_FILTER_TARGET_OPTION,
return DEFAULT_TEXT_FILTER_TARGET_OPTIONS.concat([
TEXT_FILTER_TARGET_MORELIKE_OPTION,
])
}
return targets
return DEFAULT_TEXT_FILTER_TARGET_OPTIONS
}
textFilterTarget = TEXT_FILTER_TARGET_TITLE_CONTENT
@@ -450,12 +437,10 @@ export class FilterEditorComponent
value.forEach((rule) => {
switch (rule.rule_type) {
case FILTER_TITLE:
case FILTER_SIMPLE_TITLE:
this._textFilter = rule.value
this.textFilterTarget = TEXT_FILTER_TARGET_TITLE
break
case FILTER_TITLE_CONTENT:
case FILTER_SIMPLE_TEXT:
this._textFilter = rule.value
this.textFilterTarget = TEXT_FILTER_TARGET_TITLE_CONTENT
break
@@ -777,15 +762,12 @@ export class FilterEditorComponent
this.textFilterTarget == TEXT_FILTER_TARGET_TITLE_CONTENT
) {
filterRules.push({
rule_type: FILTER_SIMPLE_TEXT,
rule_type: FILTER_TITLE_CONTENT,
value: this._textFilter.trim(),
})
}
if (this._textFilter && this.textFilterTarget == TEXT_FILTER_TARGET_TITLE) {
filterRules.push({
rule_type: FILTER_SIMPLE_TITLE,
value: this._textFilter,
})
filterRules.push({ rule_type: FILTER_TITLE, value: this._textFilter })
}
if (this.textFilterTarget == TEXT_FILTER_TARGET_ASN) {
if (
@@ -1027,10 +1009,7 @@ export class FilterEditorComponent
) {
existingRule = filterRules.find(
(fr) =>
fr.rule_type == FILTER_TITLE_CONTENT ||
fr.rule_type == FILTER_SIMPLE_TEXT ||
fr.rule_type == FILTER_TITLE ||
fr.rule_type == FILTER_SIMPLE_TITLE
fr.rule_type == FILTER_TITLE_CONTENT || fr.rule_type == FILTER_TITLE
)
existingRule.rule_type = FILTER_FULLTEXT_QUERY
}

View File

@@ -3,7 +3,7 @@ import { DataType } from './datatype'
export const NEGATIVE_NULL_FILTER_VALUE = -1
// These correspond to src/documents/models.py and changes here require a DB migration (and vice versa)
export const FILTER_TITLE = 0 // Deprecated in favor of Tantivy-backed `title_search`. Keep for now for existing saved views
export const FILTER_TITLE = 0
export const FILTER_CONTENT = 1
export const FILTER_ASN = 2
@@ -46,9 +46,7 @@ export const FILTER_ADDED_FROM = 46
export const FILTER_MODIFIED_BEFORE = 15
export const FILTER_MODIFIED_AFTER = 16
export const FILTER_TITLE_CONTENT = 19 // Deprecated in favor of Tantivy-backed `text` filtervar. Keep for now for existing saved views
export const FILTER_SIMPLE_TITLE = 48
export const FILTER_SIMPLE_TEXT = 49
export const FILTER_TITLE_CONTENT = 19
export const FILTER_FULLTEXT_QUERY = 20
export const FILTER_FULLTEXT_MORELIKE = 21
@@ -58,7 +56,7 @@ export const FILTER_OWNER_ISNULL = 34
export const FILTER_OWNER_DOES_NOT_INCLUDE = 35
export const FILTER_SHARED_BY_USER = 37
export const FILTER_CUSTOM_FIELDS_TEXT = 36 // Deprecated. UI no longer includes CF text-search mode. Keep for now for existing saved views
export const FILTER_CUSTOM_FIELDS_TEXT = 36
export const FILTER_HAS_CUSTOM_FIELDS_ALL = 38
export const FILTER_HAS_CUSTOM_FIELDS_ANY = 39
export const FILTER_DOES_NOT_HAVE_CUSTOM_FIELDS = 40
@@ -68,9 +66,6 @@ export const FILTER_CUSTOM_FIELDS_QUERY = 42
export const FILTER_MIME_TYPE = 47
export const SIMPLE_TEXT_PARAMETER = 'text'
export const SIMPLE_TITLE_PARAMETER = 'title_search'
export const FILTER_RULE_TYPES: FilterRuleType[] = [
{
id: FILTER_TITLE,
@@ -79,13 +74,6 @@ export const FILTER_RULE_TYPES: FilterRuleType[] = [
multi: false,
default: '',
},
{
id: FILTER_SIMPLE_TITLE,
filtervar: SIMPLE_TITLE_PARAMETER,
datatype: 'string',
multi: false,
default: '',
},
{
id: FILTER_CONTENT,
filtervar: 'content__icontains',
@@ -291,12 +279,6 @@ export const FILTER_RULE_TYPES: FilterRuleType[] = [
datatype: 'string',
multi: false,
},
{
id: FILTER_SIMPLE_TEXT,
filtervar: SIMPLE_TEXT_PARAMETER,
datatype: 'string',
multi: false,
},
{
id: FILTER_FULLTEXT_QUERY,
filtervar: 'query',

View File

@@ -10,7 +10,7 @@ import {
DOCUMENT_SORT_FIELDS,
DOCUMENT_SORT_FIELDS_FULLTEXT,
} from 'src/app/data/document'
import { FILTER_SIMPLE_TITLE } from 'src/app/data/filter-rule-type'
import { FILTER_TITLE } from 'src/app/data/filter-rule-type'
import { SETTINGS_KEYS } from 'src/app/data/ui-settings'
import { environment } from 'src/environments/environment'
import { PermissionsService } from '../permissions.service'
@@ -138,13 +138,13 @@ describe(`DocumentService`, () => {
subscription = service
.listAllFilteredIds([
{
rule_type: FILTER_SIMPLE_TITLE,
rule_type: FILTER_TITLE,
value: 'apple',
},
])
.subscribe()
const req = httpTestingController.expectOne(
`${environment.apiBaseUrl}${endpoint}/?page=1&page_size=100000&fields=id&title_search=apple`
`${environment.apiBaseUrl}${endpoint}/?page=1&page_size=100000&fields=id&title__icontains=apple`
)
expect(req.request.method).toEqual('GET')
})

View File

@@ -8,10 +8,6 @@ import {
FILTER_HAS_CUSTOM_FIELDS_ALL,
FILTER_HAS_CUSTOM_FIELDS_ANY,
FILTER_HAS_TAGS_ALL,
FILTER_SIMPLE_TEXT,
FILTER_SIMPLE_TITLE,
FILTER_TITLE,
FILTER_TITLE_CONTENT,
NEGATIVE_NULL_FILTER_VALUE,
} from '../data/filter-rule-type'
import {
@@ -132,26 +128,6 @@ describe('QueryParams Utils', () => {
is_tagged: 0,
})
params = queryParamsFromFilterRules([
{
rule_type: FILTER_TITLE_CONTENT,
value: 'bank statement',
},
])
expect(params).toEqual({
text: 'bank statement',
})
params = queryParamsFromFilterRules([
{
rule_type: FILTER_TITLE,
value: 'invoice',
},
])
expect(params).toEqual({
title_search: 'invoice',
})
params = queryParamsFromFilterRules([
{
rule_type: FILTER_HAS_TAGS_ALL,
@@ -172,30 +148,6 @@ describe('QueryParams Utils', () => {
it('should convert filter rules to query params', () => {
let rules = filterRulesFromQueryParams(
convertToParamMap({
text: 'bank statement',
})
)
expect(rules).toEqual([
{
rule_type: FILTER_SIMPLE_TEXT,
value: 'bank statement',
},
])
rules = filterRulesFromQueryParams(
convertToParamMap({
title_search: 'invoice',
})
)
expect(rules).toEqual([
{
rule_type: FILTER_SIMPLE_TITLE,
value: 'invoice',
},
])
rules = filterRulesFromQueryParams(
convertToParamMap({
tags__id__all,
})

View File

@@ -9,14 +9,8 @@ import {
FILTER_HAS_CUSTOM_FIELDS_ALL,
FILTER_HAS_CUSTOM_FIELDS_ANY,
FILTER_RULE_TYPES,
FILTER_SIMPLE_TEXT,
FILTER_SIMPLE_TITLE,
FILTER_TITLE,
FILTER_TITLE_CONTENT,
FilterRuleType,
NEGATIVE_NULL_FILTER_VALUE,
SIMPLE_TEXT_PARAMETER,
SIMPLE_TITLE_PARAMETER,
} from '../data/filter-rule-type'
import { ListViewState } from '../services/document-list-view.service'
@@ -103,8 +97,6 @@ export function transformLegacyFilterRules(
export function filterRulesFromQueryParams(
queryParams: ParamMap
): FilterRule[] {
let filterRulesFromQueryParams: FilterRule[] = []
const allFilterRuleQueryParams: string[] = FILTER_RULE_TYPES.map(
(rt) => rt.filtervar
)
@@ -112,6 +104,7 @@ export function filterRulesFromQueryParams(
.filter((rt) => rt !== undefined)
// transform query params to filter rules
let filterRulesFromQueryParams: FilterRule[] = []
allFilterRuleQueryParams
.filter((frqp) => queryParams.has(frqp))
.forEach((filterQueryParamName) => {
@@ -153,17 +146,7 @@ export function queryParamsFromFilterRules(filterRules: FilterRule[]): Params {
let params = {}
for (let rule of filterRules) {
let ruleType = FILTER_RULE_TYPES.find((t) => t.id == rule.rule_type)
if (
rule.rule_type === FILTER_TITLE_CONTENT ||
rule.rule_type === FILTER_SIMPLE_TEXT
) {
params[SIMPLE_TEXT_PARAMETER] = rule.value
} else if (
rule.rule_type === FILTER_TITLE ||
rule.rule_type === FILTER_SIMPLE_TITLE
) {
params[SIMPLE_TITLE_PARAMETER] = rule.value
} else if (ruleType.isnull_filtervar && rule.value == null) {
if (ruleType.isnull_filtervar && rule.value == null) {
params[ruleType.isnull_filtervar] = 1
} else if (
ruleType.isnull_filtervar &&

View File

@@ -7,6 +7,7 @@ from dataclasses import dataclass
from pathlib import Path
from typing import TYPE_CHECKING
import regex as regex_mod
from django.conf import settings
from pdf2image import convert_from_path
from pikepdf import Page
@@ -22,6 +23,8 @@ from documents.plugins.base import ConsumeTaskPlugin
from documents.plugins.base import StopConsumeTaskError
from documents.plugins.helpers import ProgressManager
from documents.plugins.helpers import ProgressStatusOptions
from documents.regex import safe_regex_match
from documents.regex import safe_regex_sub
from documents.utils import copy_basic_file_stats
from documents.utils import copy_file_with_basic_stats
from documents.utils import maybe_override_pixel_limit
@@ -68,8 +71,8 @@ class Barcode:
Note: This does NOT exclude ASN or separator barcodes - they can also be used
as tags if they match a tag mapping pattern (e.g., {"ASN12.*": "JOHN"}).
"""
for regex in self.settings.barcode_tag_mapping:
if re.match(regex, self.value, flags=re.IGNORECASE):
for pattern in self.settings.barcode_tag_mapping:
if safe_regex_match(pattern, self.value, flags=regex_mod.IGNORECASE):
return True
return False
@@ -392,11 +395,16 @@ class BarcodePlugin(ConsumeTaskPlugin):
for raw in tag_texts.split(","):
try:
tag_str: str | None = None
for regex in self.settings.barcode_tag_mapping:
if re.match(regex, raw, flags=re.IGNORECASE):
sub = self.settings.barcode_tag_mapping[regex]
for pattern in self.settings.barcode_tag_mapping:
if safe_regex_match(pattern, raw, flags=regex_mod.IGNORECASE):
sub = self.settings.barcode_tag_mapping[pattern]
tag_str = (
re.sub(regex, sub, raw, flags=re.IGNORECASE)
safe_regex_sub(
pattern,
sub,
raw,
flags=regex_mod.IGNORECASE,
)
if sub
else raw
)

View File

@@ -1,5 +1,6 @@
from __future__ import annotations
import hmac
import logging
import pickle
import re
@@ -75,7 +76,7 @@ def load_classifier(*, raise_exception: bool = False) -> DocumentClassifier | No
"Unrecoverable error while loading document "
"classification model, deleting model file.",
)
Path(settings.MODEL_FILE).unlink
Path(settings.MODEL_FILE).unlink()
classifier = None
if raise_exception:
raise e
@@ -97,7 +98,10 @@ class DocumentClassifier:
# v7 - Updated scikit-learn package version
# v8 - Added storage path classifier
# v9 - Changed from hashing to time/ids for re-train check
FORMAT_VERSION = 9
# v10 - HMAC-signed model file
FORMAT_VERSION = 10
HMAC_SIZE = 32 # SHA-256 digest length
def __init__(self) -> None:
# last time a document changed and therefore training might be required
@@ -128,67 +132,89 @@ class DocumentClassifier:
pickle.dumps(self.data_vectorizer),
).hexdigest()
@staticmethod
def _compute_hmac(data: bytes) -> bytes:
return hmac.new(
settings.SECRET_KEY.encode(),
data,
sha256,
).digest()
def load(self) -> None:
from sklearn.exceptions import InconsistentVersionWarning
raw = Path(settings.MODEL_FILE).read_bytes()
if len(raw) <= self.HMAC_SIZE:
raise ClassifierModelCorruptError
signature = raw[: self.HMAC_SIZE]
data = raw[self.HMAC_SIZE :]
if not hmac.compare_digest(signature, self._compute_hmac(data)):
raise ClassifierModelCorruptError
# Catch warnings for processing
with warnings.catch_warnings(record=True) as w:
with Path(settings.MODEL_FILE).open("rb") as f:
schema_version = pickle.load(f)
try:
(
schema_version,
self.last_doc_change_time,
self.last_auto_type_hash,
self.data_vectorizer,
self.tags_binarizer,
self.tags_classifier,
self.correspondent_classifier,
self.document_type_classifier,
self.storage_path_classifier,
) = pickle.loads(data)
except Exception as err:
raise ClassifierModelCorruptError from err
if schema_version != self.FORMAT_VERSION:
raise IncompatibleClassifierVersionError(
"Cannot load classifier, incompatible versions.",
)
else:
try:
self.last_doc_change_time = pickle.load(f)
self.last_auto_type_hash = pickle.load(f)
self.data_vectorizer = pickle.load(f)
self._update_data_vectorizer_hash()
self.tags_binarizer = pickle.load(f)
self.tags_classifier = pickle.load(f)
self.correspondent_classifier = pickle.load(f)
self.document_type_classifier = pickle.load(f)
self.storage_path_classifier = pickle.load(f)
except Exception as err:
raise ClassifierModelCorruptError from err
# Check for the warning about unpickling from differing versions
# and consider it incompatible
sk_learn_warning_url = (
"https://scikit-learn.org/stable/"
"model_persistence.html"
"#security-maintainability-limitations"
if schema_version != self.FORMAT_VERSION:
raise IncompatibleClassifierVersionError(
"Cannot load classifier, incompatible versions.",
)
for warning in w:
# The warning is inconsistent, the MLPClassifier is a specific warning, others have not updated yet
if issubclass(warning.category, InconsistentVersionWarning) or (
issubclass(warning.category, UserWarning)
and sk_learn_warning_url in str(warning.message)
):
raise IncompatibleClassifierVersionError("sklearn version update")
self._update_data_vectorizer_hash()
# Check for the warning about unpickling from differing versions
# and consider it incompatible
sk_learn_warning_url = (
"https://scikit-learn.org/stable/"
"model_persistence.html"
"#security-maintainability-limitations"
)
for warning in w:
# The warning is inconsistent, the MLPClassifier is a specific warning, others have not updated yet
if issubclass(warning.category, InconsistentVersionWarning) or (
issubclass(warning.category, UserWarning)
and sk_learn_warning_url in str(warning.message)
):
raise IncompatibleClassifierVersionError("sklearn version update")
def save(self) -> None:
target_file: Path = settings.MODEL_FILE
target_file_temp: Path = target_file.with_suffix(".pickle.part")
data = pickle.dumps(
(
self.FORMAT_VERSION,
self.last_doc_change_time,
self.last_auto_type_hash,
self.data_vectorizer,
self.tags_binarizer,
self.tags_classifier,
self.correspondent_classifier,
self.document_type_classifier,
self.storage_path_classifier,
),
)
signature = self._compute_hmac(data)
with target_file_temp.open("wb") as f:
pickle.dump(self.FORMAT_VERSION, f)
pickle.dump(self.last_doc_change_time, f)
pickle.dump(self.last_auto_type_hash, f)
pickle.dump(self.data_vectorizer, f)
pickle.dump(self.tags_binarizer, f)
pickle.dump(self.tags_classifier, f)
pickle.dump(self.correspondent_classifier, f)
pickle.dump(self.document_type_classifier, f)
pickle.dump(self.storage_path_classifier, f)
f.write(signature + data)
target_file_temp.rename(target_file)

View File

@@ -3,7 +3,6 @@ from __future__ import annotations
import functools
import inspect
import json
import logging
import operator
from contextlib import contextmanager
from typing import TYPE_CHECKING
@@ -78,8 +77,6 @@ DATETIME_KWARGS = [
CUSTOM_FIELD_QUERY_MAX_DEPTH = 10
CUSTOM_FIELD_QUERY_MAX_ATOMS = 20
logger = logging.getLogger("paperless.api")
class CorrespondentFilterSet(FilterSet):
class Meta:
@@ -165,13 +162,9 @@ class InboxFilter(Filter):
@extend_schema_field(serializers.CharField)
class TitleContentFilter(Filter):
# Deprecated but retained for existing saved views. UI uses Tantivy-backed `text` / `title_search` params.
def filter(self, qs: Any, value: Any) -> Any:
value = value.strip() if isinstance(value, str) else value
if value:
logger.warning(
"Deprecated document filter parameter 'title_content' used; use `text` instead.",
)
try:
return qs.filter(
Q(title__icontains=value) | Q(effective_content__icontains=value),
@@ -250,9 +243,6 @@ class CustomFieldsFilter(Filter):
def filter(self, qs, value):
value = value.strip() if isinstance(value, str) else value
if value:
logger.warning(
"Deprecated document filter parameter 'custom_fields__icontains' used; use `custom_field_query` or advanced Tantivy field syntax instead.",
)
fields_with_matching_selects = CustomField.objects.filter(
extra_data__icontains=value,
)
@@ -757,7 +747,6 @@ class DocumentFilterSet(FilterSet):
is_in_inbox = InboxFilter()
# Deprecated, but keep for now for existing saved views
title_content = TitleContentFilter()
content__istartswith = EffectiveContentFilter(lookup_expr="istartswith")
@@ -767,7 +756,6 @@ class DocumentFilterSet(FilterSet):
owner__id__none = ObjectFilter(field_name="owner", exclude=True)
# Deprecated, UI no longer includes CF text-search mode, but keep for now for existing saved views
custom_fields__icontains = CustomFieldsFilter()
custom_fields__id__all = ObjectFilter(field_name="custom_fields__field")

View File

@@ -1,92 +0,0 @@
# Generated by Django 5.2.12 on 2026-04-01 18:20
from django.db import migrations
from django.db import models
OLD_TITLE_RULE = 0
OLD_TITLE_CONTENT_RULE = 19
NEW_SIMPLE_TITLE_RULE = 48
NEW_SIMPLE_TEXT_RULE = 49
# See documents/models.py SavedViewFilterRule
def migrate_saved_view_rules_forward(apps, schema_editor):
SavedViewFilterRule = apps.get_model("documents", "SavedViewFilterRule")
SavedViewFilterRule.objects.filter(rule_type=OLD_TITLE_RULE).update(
rule_type=NEW_SIMPLE_TITLE_RULE,
)
SavedViewFilterRule.objects.filter(rule_type=OLD_TITLE_CONTENT_RULE).update(
rule_type=NEW_SIMPLE_TEXT_RULE,
)
class Migration(migrations.Migration):
dependencies = [
("documents", "0017_migrate_fulltext_query_field_prefixes"),
]
operations = [
migrations.AlterField(
model_name="savedviewfilterrule",
name="rule_type",
field=models.PositiveSmallIntegerField(
choices=[
(0, "title contains"),
(1, "content contains"),
(2, "ASN is"),
(3, "correspondent is"),
(4, "document type is"),
(5, "is in inbox"),
(6, "has tag"),
(7, "has any tag"),
(8, "created before"),
(9, "created after"),
(10, "created year is"),
(11, "created month is"),
(12, "created day is"),
(13, "added before"),
(14, "added after"),
(15, "modified before"),
(16, "modified after"),
(17, "does not have tag"),
(18, "does not have ASN"),
(19, "title or content contains"),
(20, "fulltext query"),
(21, "more like this"),
(22, "has tags in"),
(23, "ASN greater than"),
(24, "ASN less than"),
(25, "storage path is"),
(26, "has correspondent in"),
(27, "does not have correspondent in"),
(28, "has document type in"),
(29, "does not have document type in"),
(30, "has storage path in"),
(31, "does not have storage path in"),
(32, "owner is"),
(33, "has owner in"),
(34, "does not have owner"),
(35, "does not have owner in"),
(36, "has custom field value"),
(37, "is shared by me"),
(38, "has custom fields"),
(39, "has custom field in"),
(40, "does not have custom field in"),
(41, "does not have custom field"),
(42, "custom fields query"),
(43, "created to"),
(44, "created from"),
(45, "added to"),
(46, "added from"),
(47, "mime type is"),
(48, "simple title search"),
(49, "simple text search"),
],
verbose_name="rule type",
),
),
migrations.RunPython(
migrate_saved_view_rules_forward,
migrations.RunPython.noop,
),
]

View File

@@ -623,8 +623,6 @@ class SavedViewFilterRule(models.Model):
(45, _("added to")),
(46, _("added from")),
(47, _("mime type is")),
(48, _("simple title search")),
(49, _("simple text search")),
]
saved_view = models.ForeignKey(

View File

@@ -1,9 +1,11 @@
import datetime
import re
from collections.abc import Iterator
from re import Match
import regex
from regex import Match
from documents.plugins.date_parsing.base import DateParserPluginBase
from documents.regex import safe_regex_finditer
class RegexDateParserPlugin(DateParserPluginBase):
@@ -14,7 +16,7 @@ class RegexDateParserPlugin(DateParserPluginBase):
passed to its constructor.
"""
DATE_REGEX = re.compile(
DATE_REGEX = regex.compile(
r"(\b|(?!=([_-])))(\d{1,2})[\.\/-](\d{1,2})[\.\/-](\d{4}|\d{2})(\b|(?=([_-])))|"
r"(\b|(?!=([_-])))(\d{4}|\d{2})[\.\/-](\d{1,2})[\.\/-](\d{1,2})(\b|(?=([_-])))|"
r"(\b|(?!=([_-])))(\d{1,2}[\. ]+[a-zéûäëčžúřěáíóńźçŞğü]{3,9} \d{4}|[a-zéûäëčžúřěáíóńźçŞğü]{3,9} \d{1,2}, \d{4})(\b|(?=([_-])))|"
@@ -22,7 +24,7 @@ class RegexDateParserPlugin(DateParserPluginBase):
r"(\b|(?!=([_-])))([^\W\d_]{3,9} \d{4})(\b|(?=([_-])))|"
r"(\b|(?!=([_-])))(\d{1,2}[^ 0-9]{2}[\. ]+[^ ]{3,9}[ \.\/-]\d{4})(\b|(?=([_-])))|"
r"(\b|(?!=([_-])))(\b\d{1,2}[ \.\/-][a-zéûäëčžúřěáíóńźçŞğü]{3}[ \.\/-]\d{4})(\b|(?=([_-])))",
re.IGNORECASE,
regex.IGNORECASE,
)
def _process_match(
@@ -45,7 +47,7 @@ class RegexDateParserPlugin(DateParserPluginBase):
"""
Finds all regex matches in content and yields valid dates.
"""
for m in re.finditer(self.DATE_REGEX, content):
for m in safe_regex_finditer(self.DATE_REGEX, content):
date = self._process_match(m, date_order)
if date is not None:
yield date

View File

@@ -48,3 +48,73 @@ def safe_regex_search(pattern: str, text: str, *, flags: int = 0):
textwrap.shorten(pattern, width=80, placeholder=""),
)
return None
def safe_regex_match(pattern: str, text: str, *, flags: int = 0):
"""
Run a regex match with a timeout. Returns a match object or None.
Validation errors and timeouts are logged and treated as no match.
"""
try:
validate_regex_pattern(pattern)
compiled = regex.compile(pattern, flags=flags)
except (regex.error, ValueError) as exc:
logger.error(
"Error while processing regular expression %s: %s",
textwrap.shorten(pattern, width=80, placeholder=""),
exc,
)
return None
try:
return compiled.match(text, timeout=REGEX_TIMEOUT_SECONDS)
except TimeoutError:
logger.warning(
"Regular expression matching timed out for pattern %s",
textwrap.shorten(pattern, width=80, placeholder=""),
)
return None
def safe_regex_sub(pattern: str, repl: str, text: str, *, flags: int = 0) -> str | None:
"""
Run a regex substitution with a timeout. Returns the substituted string,
or None on error/timeout.
"""
try:
validate_regex_pattern(pattern)
compiled = regex.compile(pattern, flags=flags)
except (regex.error, ValueError) as exc:
logger.error(
"Error while processing regular expression %s: %s",
textwrap.shorten(pattern, width=80, placeholder=""),
exc,
)
return None
try:
return compiled.sub(repl, text, timeout=REGEX_TIMEOUT_SECONDS)
except TimeoutError:
logger.warning(
"Regular expression substitution timed out for pattern %s",
textwrap.shorten(pattern, width=80, placeholder=""),
)
return None
def safe_regex_finditer(compiled_pattern: regex.Pattern, text: str):
"""
Run regex finditer with a timeout. Yields match objects.
Stops iteration on timeout.
"""
try:
yield from compiled_pattern.finditer(text, timeout=REGEX_TIMEOUT_SECONDS)
except TimeoutError:
logger.warning(
"Regular expression finditer timed out for pattern %s",
textwrap.shorten(compiled_pattern.pattern, width=80, placeholder=""),
)
return

View File

@@ -2,11 +2,11 @@ from __future__ import annotations
import logging
import threading
import unicodedata
from collections import Counter
from dataclasses import dataclass
from datetime import UTC
from datetime import datetime
from enum import StrEnum
from typing import TYPE_CHECKING
from typing import Self
from typing import TypedDict
@@ -19,10 +19,7 @@ from django.conf import settings
from django.utils.timezone import get_current_timezone
from guardian.shortcuts import get_users_with_perms
from documents.search._normalize import ascii_fold
from documents.search._query import build_permission_filter
from documents.search._query import parse_simple_text_query
from documents.search._query import parse_simple_title_query
from documents.search._query import parse_user_query
from documents.search._schema import _write_sentinels
from documents.search._schema import build_schema
@@ -48,10 +45,14 @@ _AUTOCOMPLETE_REGEX_TIMEOUT = 1.0 # seconds; guards against ReDoS on untrusted
T = TypeVar("T")
class SearchMode(StrEnum):
QUERY = "query"
TEXT = "text"
TITLE = "title"
def _ascii_fold(s: str) -> str:
"""
Normalize unicode to ASCII equivalent characters for search consistency.
Converts accented characters (e.g., "café") to their ASCII base forms ("cafe")
to enable cross-language searching without requiring exact diacritic matching.
"""
return unicodedata.normalize("NFD", s).encode("ascii", "ignore").decode()
def _extract_autocomplete_words(text_sources: list[str]) -> set[str]:
@@ -73,7 +74,7 @@ def _extract_autocomplete_words(text_sources: list[str]) -> set[str]:
)
continue
for token in tokens:
normalized = ascii_fold(token.lower())
normalized = _ascii_fold(token.lower())
if normalized:
words.add(normalized)
return words
@@ -293,10 +294,8 @@ class TantivyBackend:
doc.add_text("checksum", document.checksum)
doc.add_text("title", document.title)
doc.add_text("title_sort", document.title)
doc.add_text("simple_title", document.title)
doc.add_text("content", content)
doc.add_text("bigram_content", content)
doc.add_text("simple_content", content)
# Original filename - only add if not None/empty
if document.original_filename:
@@ -434,7 +433,6 @@ class TantivyBackend:
sort_field: str | None,
*,
sort_reverse: bool,
search_mode: SearchMode = SearchMode.QUERY,
) -> SearchResults:
"""
Execute a search query against the document index.
@@ -443,32 +441,20 @@ class TantivyBackend:
permission filtering before executing against Tantivy. Supports both
relevance-based and field-based sorting.
QUERY search mode supports natural date keywords, field filters, etc.
TITLE search mode treats the query as plain text to search for in title only
TEXT search mode treats the query as plain text to search for in title and content
Args:
query: User's search query
query: User's search query (supports natural date keywords, field filters)
user: User for permission filtering (None for superuser/no filtering)
page: Page number (1-indexed) for pagination
page_size: Number of results per page
sort_field: Field to sort by (None for relevance ranking)
sort_reverse: Whether to reverse the sort order
search_mode: "query" for advanced Tantivy syntax, "text" for
plain-text search over title and content only, "title" for
plain-text search over title only
Returns:
SearchResults with hits, total count, and processed query
"""
self._ensure_open()
tz = get_current_timezone()
if search_mode is SearchMode.TEXT:
user_query = parse_simple_text_query(self._index, query)
elif search_mode is SearchMode.TITLE:
user_query = parse_simple_title_query(self._index, query)
else:
user_query = parse_user_query(self._index, query, tz)
user_query = parse_user_query(self._index, query, tz)
# Apply permission filter if user is not None (not superuser)
if user is not None:
@@ -608,7 +594,7 @@ class TantivyBackend:
List of word suggestions ordered by frequency, then alphabetically
"""
self._ensure_open()
normalized_term = ascii_fold(term.lower())
normalized_term = _ascii_fold(term.lower())
searcher = self._index.searcher()

View File

@@ -1,8 +0,0 @@
from __future__ import annotations
import unicodedata
def ascii_fold(text: str) -> str:
"""Normalize unicode text to ASCII equivalents for search consistency."""
return unicodedata.normalize("NFD", text).encode("ascii", "ignore").decode()

View File

@@ -12,8 +12,6 @@ import tantivy
from dateutil.relativedelta import relativedelta
from django.conf import settings
from documents.search._normalize import ascii_fold
if TYPE_CHECKING:
from datetime import tzinfo
@@ -53,7 +51,6 @@ _WHOOSH_REL_RANGE_RE = regex.compile(
)
# Whoosh-style 8-digit date: field:YYYYMMDD — field-aware so timezone can be applied correctly
_DATE8_RE = regex.compile(r"(?P<field>\w+):(?P<date8>\d{8})\b")
_SIMPLE_QUERY_TOKEN_RE = regex.compile(r"\S+")
def _fmt(dt: datetime) -> str:
@@ -439,37 +436,7 @@ DEFAULT_SEARCH_FIELDS = [
"document_type",
"tag",
]
SIMPLE_SEARCH_FIELDS = ["simple_title", "simple_content"]
TITLE_SEARCH_FIELDS = ["simple_title"]
_FIELD_BOOSTS = {"title": 2.0}
_SIMPLE_FIELD_BOOSTS = {"simple_title": 2.0}
def _build_simple_field_query(
index: tantivy.Index,
field: str,
tokens: list[str],
) -> tantivy.Query:
patterns = []
for idx, token in enumerate(tokens):
escaped = regex.escape(token)
# For multi-token substring search, only the first token can begin mid-word.
# Later tokens follow a whitespace boundary in the original query, so anchor
# them to the start of the next indexed token to reduce false positives like
# matching "Z-Berichte 16" for the query "Z-Berichte 6".
if idx == 0:
patterns.append(f".*{escaped}.*")
else:
patterns.append(f"{escaped}.*")
if len(patterns) == 1:
query = tantivy.Query.regex_query(index.schema, field, patterns[0])
else:
query = tantivy.Query.regex_phrase_query(index.schema, field, patterns)
boost = _SIMPLE_FIELD_BOOSTS.get(field, 1.0)
if boost > 1.0:
return tantivy.Query.boost_query(query, boost)
return query
def parse_user_query(
@@ -528,52 +495,3 @@ def parse_user_query(
)
return exact
def parse_simple_query(
index: tantivy.Index,
raw_query: str,
fields: list[str],
) -> tantivy.Query:
"""
Parse a plain-text query using Tantivy over a restricted field set.
Query string is escaped and normalized to be treated as "simple" text query.
"""
tokens = [
ascii_fold(token.lower())
for token in _SIMPLE_QUERY_TOKEN_RE.findall(raw_query, timeout=_REGEX_TIMEOUT)
]
tokens = [token for token in tokens if token]
if not tokens:
return tantivy.Query.empty_query()
field_queries = [
(tantivy.Occur.Should, _build_simple_field_query(index, field, tokens))
for field in fields
]
if len(field_queries) == 1:
return field_queries[0][1]
return tantivy.Query.boolean_query(field_queries)
def parse_simple_text_query(
index: tantivy.Index,
raw_query: str,
) -> tantivy.Query:
"""
Parse a plain-text query over title/content for simple search inputs.
"""
return parse_simple_query(index, raw_query, SIMPLE_SEARCH_FIELDS)
def parse_simple_title_query(
index: tantivy.Index,
raw_query: str,
) -> tantivy.Query:
"""
Parse a plain-text query over the title field only.
"""
return parse_simple_query(index, raw_query, TITLE_SEARCH_FIELDS)

View File

@@ -53,18 +53,6 @@ def build_schema() -> tantivy.Schema:
# CJK support - not stored, indexed only
sb.add_text_field("bigram_content", stored=False, tokenizer_name="bigram_analyzer")
# Simple substring search support for title/content - not stored, indexed only
sb.add_text_field(
"simple_title",
stored=False,
tokenizer_name="simple_search_analyzer",
)
sb.add_text_field(
"simple_content",
stored=False,
tokenizer_name="simple_search_analyzer",
)
# Autocomplete prefix scan - stored, not indexed
sb.add_text_field("autocomplete_word", stored=True, tokenizer_name="raw")

View File

@@ -70,7 +70,6 @@ def register_tokenizers(index: tantivy.Index, language: str | None) -> None:
index.register_tokenizer("paperless_text", _paperless_text(language))
index.register_tokenizer("simple_analyzer", _simple_analyzer())
index.register_tokenizer("bigram_analyzer", _bigram_analyzer())
index.register_tokenizer("simple_search_analyzer", _simple_search_analyzer())
# Fast-field tokenizer required for fast=True text fields in the schema
index.register_fast_field_tokenizer("simple_analyzer", _simple_analyzer())
@@ -115,15 +114,3 @@ def _bigram_analyzer() -> tantivy.TextAnalyzer:
.filter(tantivy.Filter.lowercase())
.build()
)
def _simple_search_analyzer() -> tantivy.TextAnalyzer:
"""Tokenizer for simple substring search fields: non-whitespace chunks -> lowercase -> ascii_fold."""
return (
tantivy.TextAnalyzerBuilder(
tantivy.Tokenizer.regex(r"\S+"),
)
.filter(tantivy.Filter.lowercase())
.filter(tantivy.Filter.ascii_fold())
.build()
)

View File

@@ -5,7 +5,6 @@ from documents.models import CustomField
from documents.models import CustomFieldInstance
from documents.models import Document
from documents.models import Note
from documents.search._backend import SearchMode
from documents.search._backend import TantivyBackend
from documents.search._backend import get_backend
from documents.search._backend import reset_backend
@@ -47,258 +46,6 @@ class TestWriteBatch:
class TestSearch:
"""Test search functionality."""
def test_text_mode_limits_default_search_to_title_and_content(
self,
backend: TantivyBackend,
):
"""Simple text mode must not match metadata-only fields."""
doc = Document.objects.create(
title="Invoice document",
content="monthly statement",
checksum="TXT1",
pk=9,
)
backend.add_or_update(doc)
metadata_only = backend.search(
"document_type:invoice",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
assert metadata_only.total == 0
content_match = backend.search(
"monthly",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
assert content_match.total == 1
def test_title_mode_limits_default_search_to_title_only(
self,
backend: TantivyBackend,
):
"""Title mode must not match content-only terms."""
doc = Document.objects.create(
title="Invoice document",
content="monthly statement",
checksum="TXT2",
pk=10,
)
backend.add_or_update(doc)
content_only = backend.search(
"monthly",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TITLE,
)
assert content_only.total == 0
title_match = backend.search(
"invoice",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TITLE,
)
assert title_match.total == 1
def test_text_mode_matches_partial_term_substrings(
self,
backend: TantivyBackend,
):
"""Simple text mode should support substring matching within tokens."""
doc = Document.objects.create(
title="Account access",
content="password reset instructions",
checksum="TXT3",
pk=11,
)
backend.add_or_update(doc)
prefix_match = backend.search(
"pass",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
assert prefix_match.total == 1
infix_match = backend.search(
"sswo",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
assert infix_match.total == 1
phrase_match = backend.search(
"sswo re",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
assert phrase_match.total == 1
def test_text_mode_does_not_match_on_partial_term_overlap(
self,
backend: TantivyBackend,
):
"""Simple text mode should not match documents that merely share partial fragments."""
doc = Document.objects.create(
title="Adobe Acrobat PDF Files",
content="Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
checksum="TXT7",
pk=13,
)
backend.add_or_update(doc)
non_match = backend.search(
"raptor",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
assert non_match.total == 0
def test_text_mode_anchors_later_query_tokens_to_token_starts(
self,
backend: TantivyBackend,
):
"""Multi-token simple search should not match later tokens in the middle of a word."""
exact_doc = Document.objects.create(
title="Z-Berichte 6",
content="monthly report",
checksum="TXT9",
pk=15,
)
prefix_doc = Document.objects.create(
title="Z-Berichte 60",
content="monthly report",
checksum="TXT10",
pk=16,
)
false_positive = Document.objects.create(
title="Z-Berichte 16",
content="monthly report",
checksum="TXT11",
pk=17,
)
backend.add_or_update(exact_doc)
backend.add_or_update(prefix_doc)
backend.add_or_update(false_positive)
results = backend.search(
"Z-Berichte 6",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
result_ids = {hit["id"] for hit in results.hits}
assert exact_doc.id in result_ids
assert prefix_doc.id in result_ids
assert false_positive.id not in result_ids
def test_text_mode_ignores_queries_without_searchable_tokens(
self,
backend: TantivyBackend,
):
"""Simple text mode should safely return no hits for symbol-only strings."""
doc = Document.objects.create(
title="Guide",
content="This is a guide.",
checksum="TXT8",
pk=14,
)
backend.add_or_update(doc)
no_tokens = backend.search(
"!!!",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
assert no_tokens.total == 0
def test_title_mode_matches_partial_term_substrings(
self,
backend: TantivyBackend,
):
"""Title mode should support substring matching within title tokens."""
doc = Document.objects.create(
title="Password guide",
content="reset instructions",
checksum="TXT4",
pk=12,
)
backend.add_or_update(doc)
prefix_match = backend.search(
"pass",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TITLE,
)
assert prefix_match.total == 1
infix_match = backend.search(
"sswo",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TITLE,
)
assert infix_match.total == 1
phrase_match = backend.search(
"sswo gu",
user=None,
page=1,
page_size=10,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TITLE,
)
assert phrase_match.total == 1
def test_scores_normalised_top_hit_is_one(self, backend: TantivyBackend):
"""Search scores must be normalized so top hit has score 1.0 for UI consistency."""
for i, title in enumerate(["bank invoice", "bank statement", "bank receipt"]):

View File

@@ -8,7 +8,6 @@ import tantivy
from documents.search._tokenizer import _bigram_analyzer
from documents.search._tokenizer import _paperless_text
from documents.search._tokenizer import _simple_search_analyzer
from documents.search._tokenizer import register_tokenizers
if TYPE_CHECKING:
@@ -42,20 +41,6 @@ class TestTokenizers:
idx.register_tokenizer("bigram_analyzer", _bigram_analyzer())
return idx
@pytest.fixture
def simple_search_index(self) -> tantivy.Index:
"""Index with simple-search field for Latin substring tests."""
sb = tantivy.SchemaBuilder()
sb.add_text_field(
"simple_content",
stored=False,
tokenizer_name="simple_search_analyzer",
)
schema = sb.build()
idx = tantivy.Index(schema, path=None)
idx.register_tokenizer("simple_search_analyzer", _simple_search_analyzer())
return idx
def test_ascii_fold_finds_accented_content(
self,
content_index: tantivy.Index,
@@ -81,24 +66,6 @@ class TestTokenizers:
q = bigram_index.parse_query("東京", ["bigram_content"])
assert bigram_index.searcher().search(q, limit=5).count == 1
def test_simple_search_analyzer_supports_regex_substrings(
self,
simple_search_index: tantivy.Index,
) -> None:
"""Whitespace-preserving simple search analyzer supports substring regex matching."""
writer = simple_search_index.writer()
doc = tantivy.Document()
doc.add_text("simple_content", "tag:invoice password-reset")
writer.add_document(doc)
writer.commit()
simple_search_index.reload()
q = tantivy.Query.regex_query(
simple_search_index.schema,
"simple_content",
".*sswo.*",
)
assert simple_search_index.searcher().search(q, limit=5).count == 1
def test_unsupported_language_logs_warning(self, caplog: LogCaptureFixture) -> None:
"""Unsupported language codes should log a warning and disable stemming gracefully."""
sb = tantivy.SchemaBuilder()

View File

@@ -91,135 +91,6 @@ class TestDocumentSearchApi(DirectoriesMixin, APITestCase):
self.assertEqual(response.data["count"], 0)
self.assertEqual(len(results), 0)
def test_simple_text_search(self) -> None:
tagged = Tag.objects.create(name="invoice")
matching_doc = Document.objects.create(
title="Quarterly summary",
content="Monthly bank report",
checksum="T1",
pk=11,
)
matching_doc.tags.add(tagged)
metadata_only_doc = Document.objects.create(
title="Completely unrelated",
content="No matching terms here",
checksum="T2",
pk=12,
)
metadata_only_doc.tags.add(tagged)
backend = get_backend()
backend.add_or_update(matching_doc)
backend.add_or_update(metadata_only_doc)
response = self.client.get("/api/documents/?text=monthly")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], matching_doc.id)
response = self.client.get("/api/documents/?text=tag:invoice")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 0)
def test_simple_text_search_matches_substrings(self) -> None:
matching_doc = Document.objects.create(
title="Quarterly summary",
content="Password reset instructions",
checksum="T5",
pk=15,
)
backend = get_backend()
backend.add_or_update(matching_doc)
response = self.client.get("/api/documents/?text=pass")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], matching_doc.id)
response = self.client.get("/api/documents/?text=sswo")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], matching_doc.id)
response = self.client.get("/api/documents/?text=sswo re")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], matching_doc.id)
def test_simple_text_search_does_not_match_on_partial_term_overlap(self) -> None:
non_matching_doc = Document.objects.create(
title="Adobe Acrobat PDF Files",
content="Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
checksum="T7",
pk=17,
)
backend = get_backend()
backend.add_or_update(non_matching_doc)
response = self.client.get("/api/documents/?text=raptor")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 0)
def test_simple_title_search(self) -> None:
title_match = Document.objects.create(
title="Quarterly summary",
content="No matching content here",
checksum="T3",
pk=13,
)
content_only = Document.objects.create(
title="Completely unrelated",
content="Quarterly summary appears only in content",
checksum="T4",
pk=14,
)
backend = get_backend()
backend.add_or_update(title_match)
backend.add_or_update(content_only)
response = self.client.get("/api/documents/?title_search=quarterly")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], title_match.id)
def test_simple_title_search_matches_substrings(self) -> None:
title_match = Document.objects.create(
title="Password handbook",
content="No matching content here",
checksum="T6",
pk=16,
)
backend = get_backend()
backend.add_or_update(title_match)
response = self.client.get("/api/documents/?title_search=pass")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], title_match.id)
response = self.client.get("/api/documents/?title_search=sswo")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], title_match.id)
response = self.client.get("/api/documents/?title_search=sswo hand")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(response.data["count"], 1)
self.assertEqual(response.data["results"][0]["id"], title_match.id)
def test_search_rejects_multiple_search_modes(self) -> None:
response = self.client.get("/api/documents/?text=bank&query=bank")
self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST)
self.assertEqual(
response.data["detail"],
"Specify only one of text, title_search, query, or more_like_id.",
)
def test_search_returns_all_for_api_version_9(self) -> None:
d1 = Document.objects.create(
title="invoice",
@@ -1622,31 +1493,6 @@ class TestDocumentSearchApi(DirectoriesMixin, APITestCase):
self.assertEqual(results["custom_fields"][0]["id"], custom_field1.id)
self.assertEqual(results["workflows"][0]["id"], workflow1.id)
def test_global_search_db_only_limits_documents_to_title_matches(self) -> None:
title_match = Document.objects.create(
title="bank statement",
content="no additional terms",
checksum="GS1",
pk=21,
)
content_only = Document.objects.create(
title="not a title match",
content="bank appears only in content",
checksum="GS2",
pk=22,
)
backend = get_backend()
backend.add_or_update(title_match)
backend.add_or_update(content_only)
self.client.force_authenticate(self.user)
response = self.client.get("/api/search/?query=bank&db_only=true")
self.assertEqual(response.status_code, status.HTTP_200_OK)
self.assertEqual(len(response.data["documents"]), 1)
self.assertEqual(response.data["documents"][0]["id"], title_match.id)
def test_global_search_filters_owned_mail_objects(self) -> None:
user1 = User.objects.create_user("mail-search-user")
user2 = User.objects.create_user("other-mail-search-user")

View File

@@ -1,5 +1,5 @@
import re
import shutil
import warnings
from pathlib import Path
from unittest import mock
@@ -366,8 +366,7 @@ class TestClassifier(DirectoriesMixin, TestCase):
self.assertCountEqual(new_classifier.predict_tags(self.doc2.content), [45, 12])
@mock.patch("documents.classifier.pickle.load")
def test_load_corrupt_file(self, patched_pickle_load: mock.MagicMock) -> None:
def test_load_corrupt_file(self) -> None:
"""
GIVEN:
- Corrupted classifier pickle file
@@ -378,36 +377,116 @@ class TestClassifier(DirectoriesMixin, TestCase):
"""
self.generate_train_and_save()
# First load is the schema version,allow it
patched_pickle_load.side_effect = [DocumentClassifier.FORMAT_VERSION, OSError()]
# Write garbage data (valid HMAC length but invalid content)
Path(settings.MODEL_FILE).write_bytes(b"\x00" * 64)
with self.assertRaises(ClassifierModelCorruptError):
self.classifier.load()
patched_pickle_load.assert_called()
patched_pickle_load.reset_mock()
patched_pickle_load.side_effect = [
DocumentClassifier.FORMAT_VERSION,
ClassifierModelCorruptError(),
]
self.assertIsNone(load_classifier())
patched_pickle_load.assert_called()
def test_load_corrupt_pickle_valid_hmac(self) -> None:
"""
GIVEN:
- A classifier file with valid HMAC but unparsable pickle data
WHEN:
- An attempt is made to load the classifier
THEN:
- The ClassifierModelCorruptError is raised
"""
garbage_data = b"this is not valid pickle data"
signature = DocumentClassifier._compute_hmac(garbage_data)
Path(settings.MODEL_FILE).write_bytes(signature + garbage_data)
with self.assertRaises(ClassifierModelCorruptError):
self.classifier.load()
def test_load_tampered_file(self) -> None:
"""
GIVEN:
- A classifier model file whose data has been modified
WHEN:
- An attempt is made to load the classifier
THEN:
- The ClassifierModelCorruptError is raised due to HMAC mismatch
"""
self.generate_train_and_save()
raw = Path(settings.MODEL_FILE).read_bytes()
# Flip a byte in the data portion (after the 32-byte HMAC)
tampered = raw[:32] + bytes([raw[32] ^ 0xFF]) + raw[33:]
Path(settings.MODEL_FILE).write_bytes(tampered)
with self.assertRaises(ClassifierModelCorruptError):
self.classifier.load()
def test_load_wrong_secret_key(self) -> None:
"""
GIVEN:
- A classifier model file signed with a different SECRET_KEY
WHEN:
- An attempt is made to load the classifier
THEN:
- The ClassifierModelCorruptError is raised due to HMAC mismatch
"""
self.generate_train_and_save()
with override_settings(SECRET_KEY="different-secret-key"):
with self.assertRaises(ClassifierModelCorruptError):
self.classifier.load()
def test_load_truncated_file(self) -> None:
"""
GIVEN:
- A classifier model file that is too short to contain an HMAC
WHEN:
- An attempt is made to load the classifier
THEN:
- The ClassifierModelCorruptError is raised
"""
Path(settings.MODEL_FILE).write_bytes(b"\x00" * 16)
with self.assertRaises(ClassifierModelCorruptError):
self.classifier.load()
def test_load_new_scikit_learn_version(self) -> None:
"""
GIVEN:
- classifier pickle file created with a different scikit-learn version
- classifier pickle file triggers an InconsistentVersionWarning
WHEN:
- An attempt is made to load the classifier
THEN:
- The classifier reports the warning was captured and processed
- IncompatibleClassifierVersionError is raised
"""
# TODO: This wasn't testing the warning anymore, as the schema changed
# but as it was implemented, it would require installing an old version
# rebuilding the file and committing that. Not developer friendly
# Need to rethink how to pass the load through to a file with a single
# old model?
from sklearn.exceptions import InconsistentVersionWarning
self.generate_train_and_save()
fake_warning = warnings.WarningMessage(
message=InconsistentVersionWarning(
estimator_name="MLPClassifier",
current_sklearn_version="1.0",
original_sklearn_version="0.9",
),
category=InconsistentVersionWarning,
filename="",
lineno=0,
)
real_catch_warnings = warnings.catch_warnings
class PatchedCatchWarnings(real_catch_warnings):
def __enter__(self):
w = super().__enter__()
w.append(fake_warning)
return w
with mock.patch(
"documents.classifier.warnings.catch_warnings",
PatchedCatchWarnings,
):
with self.assertRaises(IncompatibleClassifierVersionError):
self.classifier.load()
def test_one_correspondent_predict(self) -> None:
c1 = Correspondent.objects.create(
@@ -685,17 +764,6 @@ class TestClassifier(DirectoriesMixin, TestCase):
self.assertIsNone(load_classifier())
self.assertTrue(Path(settings.MODEL_FILE).exists())
def test_load_old_classifier_version(self) -> None:
shutil.copy(
Path(__file__).parent / "data" / "v1.17.4.model.pickle",
self.dirs.scratch_dir,
)
with override_settings(
MODEL_FILE=self.dirs.scratch_dir / "v1.17.4.model.pickle",
):
classifier = load_classifier()
self.assertIsNone(classifier)
@mock.patch("documents.classifier.DocumentClassifier.load")
def test_load_classifier_raise_exception(self, mock_load) -> None:
Path(settings.MODEL_FILE).touch()

View File

@@ -0,0 +1,128 @@
import pytest
import regex
from pytest_mock import MockerFixture
from documents.regex import safe_regex_finditer
from documents.regex import safe_regex_match
from documents.regex import safe_regex_search
from documents.regex import safe_regex_sub
from documents.regex import validate_regex_pattern
class TestValidateRegexPattern:
def test_valid_pattern(self):
validate_regex_pattern(r"\d+")
def test_invalid_pattern_raises(self):
with pytest.raises(ValueError):
validate_regex_pattern(r"[invalid")
class TestSafeRegexSearchAndMatch:
"""Tests for safe_regex_search and safe_regex_match (same contract)."""
@pytest.mark.parametrize(
("func", "pattern", "text", "expected_group"),
[
pytest.param(
safe_regex_search,
r"\d+",
"abc123def",
"123",
id="search-match-found",
),
pytest.param(
safe_regex_match,
r"\d+",
"123abc",
"123",
id="match-match-found",
),
],
)
def test_match_found(self, func, pattern, text, expected_group):
result = func(pattern, text)
assert result is not None
assert result.group() == expected_group
@pytest.mark.parametrize(
("func", "pattern", "text"),
[
pytest.param(safe_regex_search, r"\d+", "abcdef", id="search-no-match"),
pytest.param(safe_regex_match, r"\d+", "abc123", id="match-no-match"),
],
)
def test_no_match(self, func, pattern, text):
assert func(pattern, text) is None
@pytest.mark.parametrize(
"func",
[
pytest.param(safe_regex_search, id="search"),
pytest.param(safe_regex_match, id="match"),
],
)
def test_invalid_pattern_returns_none(self, func):
assert func(r"[invalid", "test") is None
@pytest.mark.parametrize(
"func",
[
pytest.param(safe_regex_search, id="search"),
pytest.param(safe_regex_match, id="match"),
],
)
def test_flags_respected(self, func):
assert func(r"abc", "ABC", flags=regex.IGNORECASE) is not None
@pytest.mark.parametrize(
("func", "method_name"),
[
pytest.param(safe_regex_search, "search", id="search"),
pytest.param(safe_regex_match, "match", id="match"),
],
)
def test_timeout_returns_none(self, func, method_name, mocker: MockerFixture):
mock_compile = mocker.patch("documents.regex.regex.compile")
getattr(mock_compile.return_value, method_name).side_effect = TimeoutError
assert func(r"\d+", "test") is None
class TestSafeRegexSub:
@pytest.mark.parametrize(
("pattern", "repl", "text", "expected"),
[
pytest.param(r"\d+", "NUM", "abc123def456", "abcNUMdefNUM", id="basic-sub"),
pytest.param(r"\d+", "NUM", "abcdef", "abcdef", id="no-match"),
pytest.param(r"abc", "X", "ABC", "X", id="flags"),
],
)
def test_substitution(self, pattern, repl, text, expected):
flags = regex.IGNORECASE if pattern == r"abc" else 0
result = safe_regex_sub(pattern, repl, text, flags=flags)
assert result == expected
def test_invalid_pattern_returns_none(self):
assert safe_regex_sub(r"[invalid", "x", "test") is None
def test_timeout_returns_none(self, mocker: MockerFixture):
mock_compile = mocker.patch("documents.regex.regex.compile")
mock_compile.return_value.sub.side_effect = TimeoutError
assert safe_regex_sub(r"\d+", "X", "test") is None
class TestSafeRegexFinditer:
def test_yields_matches(self):
pattern = regex.compile(r"\d+")
matches = list(safe_regex_finditer(pattern, "a1b22c333"))
assert [m.group() for m in matches] == ["1", "22", "333"]
def test_no_matches(self):
pattern = regex.compile(r"\d+")
assert list(safe_regex_finditer(pattern, "abcdef")) == []
def test_timeout_stops_iteration(self, mocker: MockerFixture):
mock_pattern = mocker.MagicMock()
mock_pattern.finditer.side_effect = TimeoutError
mock_pattern.pattern = r"\d+"
assert list(safe_regex_finditer(mock_pattern, "test")) == []

View File

@@ -31,6 +31,11 @@ from paperless.models import ApplicationConfiguration
class TestViews(DirectoriesMixin, TestCase):
@classmethod
def setUpTestData(cls) -> None:
super().setUpTestData()
ApplicationConfiguration.objects.get_or_create()
def setUp(self) -> None:
self.user = User.objects.create_user("testuser")
super().setUp()

View File

@@ -1995,23 +1995,11 @@ class ChatStreamingView(GenericAPIView):
list=extend_schema(
description="Document views including search",
parameters=[
OpenApiParameter(
name="text",
type=OpenApiTypes.STR,
location=OpenApiParameter.QUERY,
description="Simple Tantivy-backed text search query string",
),
OpenApiParameter(
name="title_search",
type=OpenApiTypes.STR,
location=OpenApiParameter.QUERY,
description="Simple Tantivy-backed title-only search query string",
),
OpenApiParameter(
name="query",
type=OpenApiTypes.STR,
location=OpenApiParameter.QUERY,
description="Advanced Tantivy search query string",
description="Advanced search query string",
),
OpenApiParameter(
name="full_perms",
@@ -2037,22 +2025,17 @@ class ChatStreamingView(GenericAPIView):
),
)
class UnifiedSearchViewSet(DocumentViewSet):
SEARCH_PARAM_NAMES = ("text", "title_search", "query", "more_like_id")
def get_serializer_class(self):
if self._is_search_request():
return SearchResultSerializer
else:
return DocumentSerializer
def _get_active_search_params(self, request: Request | None = None) -> list[str]:
request = request or self.request
return [
param for param in self.SEARCH_PARAM_NAMES if param in request.query_params
]
def _is_search_request(self):
return bool(self._get_active_search_params())
return (
"query" in self.request.query_params
or "more_like_id" in self.request.query_params
)
def list(self, request, *args, **kwargs):
if not self._is_search_request():
@@ -2060,7 +2043,6 @@ class UnifiedSearchViewSet(DocumentViewSet):
from documents.search import TantivyRelevanceList
from documents.search import get_backend
from documents.search._backend import SearchMode
try:
backend = get_backend()
@@ -2068,31 +2050,9 @@ class UnifiedSearchViewSet(DocumentViewSet):
filtered_qs = self.filter_queryset(self.get_queryset())
user = None if request.user.is_superuser else request.user
active_search_params = self._get_active_search_params(request)
if len(active_search_params) > 1:
raise ValidationError(
{
"detail": _(
"Specify only one of text, title_search, query, or more_like_id.",
),
},
)
if (
"text" in request.query_params
or "title_search" in request.query_params
or "query" in request.query_params
):
if "text" in request.query_params:
search_mode = SearchMode.TEXT
query_str = request.query_params["text"]
elif "title_search" in request.query_params:
search_mode = SearchMode.TITLE
query_str = request.query_params["title_search"]
else:
search_mode = SearchMode.QUERY
query_str = request.query_params["query"]
if "query" in request.query_params:
query_str = request.query_params["query"]
results = backend.search(
query_str,
user=user,
@@ -2100,7 +2060,6 @@ class UnifiedSearchViewSet(DocumentViewSet):
page_size=10000,
sort_field=None,
sort_reverse=False,
search_mode=search_mode,
)
else:
# more_like_id — validate permission on the seed document first
@@ -2173,8 +2132,6 @@ class UnifiedSearchViewSet(DocumentViewSet):
if str(e.detail) == str(invalid_more_like_id_message):
return HttpResponseForbidden(invalid_more_like_id_message)
return HttpResponseForbidden(_("Insufficient permissions."))
except ValidationError:
raise
except Exception as e:
logger.warning(f"An error occurred listing search results: {e!s}")
return HttpResponseBadRequest(
@@ -3046,9 +3003,6 @@ class GlobalSearchView(PassUserMixin):
serializer_class = SearchResultSerializer
def get(self, request, *args, **kwargs):
from documents.search import get_backend
from documents.search._backend import SearchMode
query = request.query_params.get("query", None)
if query is None:
return HttpResponseBadRequest("Query required")
@@ -3065,25 +3019,25 @@ class GlobalSearchView(PassUserMixin):
"view_document",
Document,
)
if db_only:
docs = all_docs.filter(title__icontains=query)[:OBJECT_LIMIT]
else:
user = None if request.user.is_superuser else request.user
# First search by title
docs = all_docs.filter(title__icontains=query)
if not db_only and len(docs) < OBJECT_LIMIT:
# If we don't have enough results, search by content.
# Over-fetch from Tantivy (no permission filter) and rely on
# the ORM all_docs queryset for authoritative permission gating.
from documents.search import get_backend
fts_results = get_backend().search(
query,
user=user,
user=None,
page=1,
page_size=1000,
sort_field=None,
sort_reverse=False,
search_mode=SearchMode.TEXT,
)
docs_by_id = all_docs.in_bulk([hit["id"] for hit in fts_results.hits])
docs = [
docs_by_id[hit["id"]]
for hit in fts_results.hits
if hit["id"] in docs_by_id
][:OBJECT_LIMIT]
fts_ids = {h["id"] for h in fts_results.hits}
docs = docs | all_docs.filter(id__in=fts_ids)
docs = docs[:OBJECT_LIMIT]
saved_views = (
get_objects_for_user_owner_aware(
request.user,

View File

@@ -11,6 +11,7 @@ from typing import Final
from urllib.parse import urlparse
from compression_middleware.middleware import CompressionMiddleware
from django.core.exceptions import ImproperlyConfigured
from django.utils.translation import gettext_lazy as _
from dotenv import load_dotenv
@@ -161,6 +162,9 @@ REST_FRAMEWORK = {
"ALLOWED_VERSIONS": ["9", "10"],
# DRF Spectacular default schema
"DEFAULT_SCHEMA_CLASS": "drf_spectacular.openapi.AutoSchema",
"DEFAULT_THROTTLE_RATES": {
"login": os.getenv("PAPERLESS_TOKEN_THROTTLE_RATE", "5/min"),
},
}
if DEBUG:
@@ -460,13 +464,13 @@ SECURE_PROXY_SSL_HEADER = (
else None
)
# The secret key has a default that should be fine so long as you're hosting
# Paperless on a closed network. However, if you're putting this anywhere
# public, you should change the key to something unique and verbose.
SECRET_KEY = os.getenv(
"PAPERLESS_SECRET_KEY",
"e11fl1oa-*ytql8p)(06fbj4ukrlo+n7k&q5+$1md7i+mge=ee",
)
SECRET_KEY = os.getenv("PAPERLESS_SECRET_KEY", "")
if not SECRET_KEY: # pragma: no cover
raise ImproperlyConfigured(
"PAPERLESS_SECRET_KEY is not set. "
"A unique, secret key is required for secure operation. "
'Generate one with: python3 -c "import secrets; print(secrets.token_urlsafe(64))"',
)
AUTH_PASSWORD_VALIDATORS = [
{

View File

@@ -34,6 +34,7 @@ from rest_framework.pagination import PageNumberPagination
from rest_framework.permissions import DjangoModelPermissions
from rest_framework.permissions import IsAuthenticated
from rest_framework.response import Response
from rest_framework.throttling import ScopedRateThrottle
from rest_framework.viewsets import ModelViewSet
from documents.permissions import PaperlessObjectPermissions
@@ -51,6 +52,8 @@ from paperless_ai.indexing import vector_store_file_exists
class PaperlessObtainAuthTokenView(ObtainAuthToken):
serializer_class = PaperlessAuthTokenSerializer
throttle_classes = [ScopedRateThrottle]
throttle_scope = "login"
class StandardPagination(PageNumberPagination):