[GH-ISSUE #498] HTTP 500 - database is locked #380

Closed
opened 2026-05-07 00:23:39 +02:00 by BreizhHardware · 5 comments

Originally created by @binwiederhier on GitHub (Nov 14, 2022).
Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/498

I've seen a few spikes of HTTP 500s lately, all related to SQLite Database is locked:

Nov 14 13:57:21 ntfy.sh ntfy[2496652]: INFO 88.99.x.x HTTP POST /upABC?up=1 Connection closed with HTTP 500 (ntfy error 50001): database is locked
Nov 14 13:57:27 ntfy.sh ntfy[2496652]: INFO 132.226.x.x HTTP POST /upDEF?up=1 Connection closed with HTTP 500 (ntfy error 50001): database is locked
Nov 14 13:57:27 ntfy.sh ntfy[2496652]: INFO 116.203.x.x HTTP POST /upXYZ?up=1 Connection closed with HTTP 500 (ntfy error 50001): database is locked
Nov 14 13:57:27 ntfy.sh ntfy[2496652]: INFO 114.132.x.x HTTP POST /abcdefg Connection closed with HTTP 500 (ntfy error 50001): database is locked

They happen every few hours or so and in "bulk" (many at once). In the grand scheme of things not a lot, but certainly cause for concern, because traffic is not going to get less.

Originally created by @binwiederhier on GitHub (Nov 14, 2022). Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/498 I've seen a few spikes of HTTP 500s lately, all related to SQLite Database is locked: ``` Nov 14 13:57:21 ntfy.sh ntfy[2496652]: INFO 88.99.x.x HTTP POST /upABC?up=1 Connection closed with HTTP 500 (ntfy error 50001): database is locked Nov 14 13:57:27 ntfy.sh ntfy[2496652]: INFO 132.226.x.x HTTP POST /upDEF?up=1 Connection closed with HTTP 500 (ntfy error 50001): database is locked Nov 14 13:57:27 ntfy.sh ntfy[2496652]: INFO 116.203.x.x HTTP POST /upXYZ?up=1 Connection closed with HTTP 500 (ntfy error 50001): database is locked Nov 14 13:57:27 ntfy.sh ntfy[2496652]: INFO 114.132.x.x HTTP POST /abcdefg Connection closed with HTTP 500 (ntfy error 50001): database is locked ``` They happen every few hours or so and in "bulk" (many at once). In the grand scheme of things not a lot, but certainly cause for concern, because traffic is not going to get less.
BreizhHardware 2026-05-07 00:23:39 +02:00
Author
Owner

@binwiederhier commented on GitHub (Nov 15, 2022):

I did some investigating. I'll copy+paste my findings from Discord.

--

I've found this comment to be helpful https://github.com/mattn/go-sqlite3/issues/274#issuecomment-211759641

The problem is that messages are written to the database synchronously while publishing, nad if there are dozens or hundreds of publishers, there are as many concurrent writers.

There's also a pruning mechanism which runs regularly.

By default SQLites busy timeout is 5s, after which it will return database is locked.

I think one way to solve this (other than to increase the busy timeout) is to asynchronously queue messages for persistence. And then have one message writer go routine that batch writes all messages that are queued after a few milliseconds.

The disadvantage is obviously that publishers won't get errors back when persisting fails. But the advantage is that the overall number of transactions is reduced.

There's also an issue with persisting on shutdown which is have to do.

I'm also not entirely sure if this violates the expected contract of the API, i.e. people expect to immediately be able to retrieve the message from the cache after it is published.

From the logs, it looks like pretty much all batches of database locked errors have a INFO Stats: 85383 messages published,... section following, which aligns with my guess. It's the pruning again that locks the database too long and that leads to timeouts when publishing.

Turned on DEBUG logging, and caught one:

Turned on DEBUG logging, and caught one:

Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Starting
Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Deleted 0 stale visitor(s)
Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Deleting expired attachments: [KqiiVKqjBis3 e62AFNB8W89Q]
Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Pruning messages older than 2022-11-14 14:00:10
Nov 15 02:00:15 ntfy.sh ntfy[2501811]: DEBUG Manager: Finished

The query is quite inefficient and doesn't use indices. Maybe it's as simple as using an index.

pruneMessagesQuery = DELETE FROM messages WHERE time < ? AND published = 1

n my super unscientific test, adding an index on time made it super fast.
Methodology:

  • Copy cache.db
  • Run this:
 sqlite3 cache-backup4.db
-- Loading resources from /root/.sqliterc
SQLite version 3.31.1 2020-01-27 19:55:54
Enter ".help" for usage hints.
sqlite> DELETE FROM messages WHERE time < 23 AND published = 1;
sqlite> 

This takes 2-4 seconds when run on a freshly copied, uncached file.

Once the file is cached by the OS, it's faster.

After running CREATE INDEX IF NOT EXISTS idx_time ON messages (time);

It's always fast. Even on a freshly copied, uncached file.

Though my database knowledge tells me that only exact comparisons (=) work with indices, and not > and <. Maybe SQLite indices work differently.

At 10:10pm EST, Nov 14, I ran CREATE INDEX IF NOT EXISTS idx_time ON messages (time); on the prod DB.

<!-- gh-comment-id:1314706996 --> @binwiederhier commented on GitHub (Nov 15, 2022): I did some investigating. I'll copy+paste my findings from Discord. -- I've found this comment to be helpful https://github.com/mattn/go-sqlite3/issues/274#issuecomment-211759641 The problem is that messages are written to the database synchronously while publishing, nad if there are dozens or hundreds of publishers, there are as many concurrent writers. There's also a pruning mechanism which runs regularly. By default SQLites busy timeout is 5s, after which it will return database is locked. I think one way to solve this (other than to increase the busy timeout) is to asynchronously queue messages for persistence. And then have one message writer go routine that batch writes all messages that are queued after a few milliseconds. The disadvantage is obviously that publishers won't get errors back when persisting fails. But the advantage is that the overall number of transactions is reduced. There's also an issue with persisting on shutdown which is have to do. I'm also not entirely sure if this violates the expected contract of the API, i.e. people expect to immediately be able to retrieve the message from the cache after it is published. From the logs, it looks like pretty much all batches of `database locked` errors have a `INFO Stats: 85383 messages published,...` section following, which aligns with my guess. It's the pruning _again_ that locks the database too long and that leads to timeouts when publishing. Turned on DEBUG logging, and caught one: Turned on DEBUG logging, and caught one: ``` Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Starting Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Deleted 0 stale visitor(s) Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Deleting expired attachments: [KqiiVKqjBis3 e62AFNB8W89Q] Nov 15 02:00:10 ntfy.sh ntfy[2501811]: DEBUG Manager: Pruning messages older than 2022-11-14 14:00:10 Nov 15 02:00:15 ntfy.sh ntfy[2501811]: DEBUG Manager: Finished ``` The query is quite inefficient and doesn't use indices. Maybe it's as simple as using an index. pruneMessagesQuery = `DELETE FROM messages WHERE time < ? AND published = 1` n my super unscientific test, adding an index on time made it super fast. Methodology: - Copy cache.db - Run this: ``` sqlite3 cache-backup4.db -- Loading resources from /root/.sqliterc SQLite version 3.31.1 2020-01-27 19:55:54 Enter ".help" for usage hints. sqlite> DELETE FROM messages WHERE time < 23 AND published = 1; sqlite> ``` This takes 2-4 seconds when run on a freshly copied, uncached file. Once the file is cached by the OS, it's faster. After running `CREATE INDEX IF NOT EXISTS idx_time ON messages (time);` It's always fast. Even on a freshly copied, uncached file. Though my database knowledge tells me that only exact comparisons (=) work with indices, and not > and <. Maybe SQLite indices work differently. At 10:10pm EST, Nov 14, I ran `CREATE INDEX IF NOT EXISTS idx_time ON messages (time);` on the prod DB.
Author
Owner

@binwiederhier commented on GitHub (Nov 15, 2022):

Short summary:

Actions:

  • YOLO-created index in prod (~ Nov 15 03:10 UTC) to make pruning quicker (CREATE INDEX IF NOT EXISTS idx_time ON messages (time);
  • Upgrade EC2 instance to t3.medium
  • If this works (no locks for 2 days or so), make code changes to do this in code
  • Increase busy_time to 10s via startup queries, pragma busy_timeout=10000;
  • Add vacuum command to startup queries
  • (Long term) make adding messages asynchronous and batch via queuing/channels, see #502
<!-- gh-comment-id:1314711475 --> @binwiederhier commented on GitHub (Nov 15, 2022): Short summary: - [Message cache pruning](https://github.com/binwiederhier/ntfy/blob/de2ca33700442c134724a768b94f710555feeaa5/server/server.go#L1145) can take >5s sometimes - The SQLite [busy timeout](https://www.sqlite.org/pragma.html#pragma_busy_timeout) is 5s by default - After that, transactions will return `database locked` - The query responsible is [pruneMessagesQuery](https://github.com/binwiederhier/ntfy/blob/de2ca33700442c134724a768b94f710555feeaa5/server/message_cache.go#L54), combined with [concurrent synchronous writes](https://github.com/binwiederhier/ntfy/blob/de2ca33700442c134724a768b94f710555feeaa5/server/server.go#L494) Actions: - [x] YOLO-created index in prod (~ `Nov 15 03:10 UTC`) to make pruning quicker (`CREATE INDEX IF NOT EXISTS idx_time ON messages (time);` - [x] Upgrade EC2 instance to t3.medium - [ ] ~If this works (no locks for 2 days or so), make code changes to do this in code~ - [x] Increase busy_time to 10s via startup queries, `pragma busy_timeout=10000;` - [x] Add `vacuum` command to startup queries - [x] (Long term) make adding messages asynchronous and batch via queuing/channels, see #502
Author
Owner

@binwiederhier commented on GitHub (Nov 15, 2022):

Well. 😢

  • the database is locked situations continued happening a few times
  • ntfy was OOM killed once, possibly as a result of the db being locked too long and messages stacking up
<!-- gh-comment-id:1315292095 --> @binwiederhier commented on GitHub (Nov 15, 2022): Well. :cry: - the `database is locked` situations continued happening a few times - ntfy was OOM killed once, possibly as a result of the db being locked too long and messages stacking up
Author
Owner

@binwiederhier commented on GitHub (Nov 16, 2022):

The situation on the server got worse, so I had to upgrade the EC2 instance to t3.medium (which will be a little more expensive, but fine).

I also started implementing the batching (#502), but I won't deploy for now. One change at a time.

The busy_timeout change can be deployed via the cache-startup-queries config option via a pragma. Maybe even the vaccum can be done like that too.

<!-- gh-comment-id:1316223502 --> @binwiederhier commented on GitHub (Nov 16, 2022): The situation on the server got worse, so I had to upgrade the EC2 instance to `t3.medium` (which will be a little more expensive, but fine). I also started implementing the batching (#502), but I won't deploy for now. One change at a time. The busy_timeout change can be deployed via the `cache-startup-queries` config option via a `pragma`. Maybe even the `vaccum` can be done like that too.
Author
Owner

@binwiederhier commented on GitHub (Nov 18, 2022):

This issue has been addressed. I've had it deployed for 24h and haven't seen a single database locked event.

<!-- gh-comment-id:1319443919 --> @binwiederhier commented on GitHub (Nov 18, 2022): This issue has been addressed. I've had it deployed for 24h and haven't seen a single database locked event.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ntfy#380
No description provided.