mirror of
https://github.com/binwiederhier/ntfy.git
synced 2026-05-09 08:26:00 +02:00
[GH-ISSUE #498] HTTP 500 - database is locked #380
Labels
No labels
ai-generated
android-app
android-app
android-app
🪲 bug
build
build
dependencies
docs
enhancement
enhancement
🔥 HOT
in-progress 🏃
ios
prio:low
prio:low
pull-request
question
🔒 security
server
server
unified-push
web-app
website
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ntfy#380
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @binwiederhier on GitHub (Nov 14, 2022).
Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/498
I've seen a few spikes of HTTP 500s lately, all related to SQLite Database is locked:
They happen every few hours or so and in "bulk" (many at once). In the grand scheme of things not a lot, but certainly cause for concern, because traffic is not going to get less.
@binwiederhier commented on GitHub (Nov 15, 2022):
I did some investigating. I'll copy+paste my findings from Discord.
--
I've found this comment to be helpful https://github.com/mattn/go-sqlite3/issues/274#issuecomment-211759641
The problem is that messages are written to the database synchronously while publishing, nad if there are dozens or hundreds of publishers, there are as many concurrent writers.
There's also a pruning mechanism which runs regularly.
By default SQLites busy timeout is 5s, after which it will return database is locked.
I think one way to solve this (other than to increase the busy timeout) is to asynchronously queue messages for persistence. And then have one message writer go routine that batch writes all messages that are queued after a few milliseconds.
The disadvantage is obviously that publishers won't get errors back when persisting fails. But the advantage is that the overall number of transactions is reduced.
There's also an issue with persisting on shutdown which is have to do.
I'm also not entirely sure if this violates the expected contract of the API, i.e. people expect to immediately be able to retrieve the message from the cache after it is published.
From the logs, it looks like pretty much all batches of
database lockederrors have aINFO Stats: 85383 messages published,...section following, which aligns with my guess. It's the pruning again that locks the database too long and that leads to timeouts when publishing.Turned on DEBUG logging, and caught one:
Turned on DEBUG logging, and caught one:
The query is quite inefficient and doesn't use indices. Maybe it's as simple as using an index.
pruneMessagesQuery =
DELETE FROM messages WHERE time < ? AND published = 1n my super unscientific test, adding an index on time made it super fast.
Methodology:
This takes 2-4 seconds when run on a freshly copied, uncached file.
Once the file is cached by the OS, it's faster.
After running
CREATE INDEX IF NOT EXISTS idx_time ON messages (time);It's always fast. Even on a freshly copied, uncached file.
Though my database knowledge tells me that only exact comparisons (=) work with indices, and not > and <. Maybe SQLite indices work differently.
At 10:10pm EST, Nov 14, I ran
CREATE INDEX IF NOT EXISTS idx_time ON messages (time);on the prod DB.@binwiederhier commented on GitHub (Nov 15, 2022):
Short summary:
database lockedActions:
Nov 15 03:10 UTC) to make pruning quicker (CREATE INDEX IF NOT EXISTS idx_time ON messages (time);If this works (no locks for 2 days or so), make code changes to do this in codepragma busy_timeout=10000;vacuumcommand to startup queries@binwiederhier commented on GitHub (Nov 15, 2022):
Well. 😢
database is lockedsituations continued happening a few times@binwiederhier commented on GitHub (Nov 16, 2022):
The situation on the server got worse, so I had to upgrade the EC2 instance to
t3.medium(which will be a little more expensive, but fine).I also started implementing the batching (#502), but I won't deploy for now. One change at a time.
The busy_timeout change can be deployed via the
cache-startup-queriesconfig option via apragma. Maybe even thevaccumcan be done like that too.@binwiederhier commented on GitHub (Nov 18, 2022):
This issue has been addressed. I've had it deployed for 24h and haven't seen a single database locked event.