mirror of
https://github.com/binwiederhier/ntfy.git
synced 2026-05-09 08:26:00 +02:00
[GH-ISSUE #1634] Stuck in CONNECTING state with passed nextRetryTime on Android 16 (SDK 36) #1140
Labels
No labels
ai-generated
android-app
android-app
android-app
🪲 bug
build
build
dependencies
docs
enhancement
enhancement
🔥 HOT
in-progress 🏃
ios
prio:low
prio:low
pull-request
question
🔒 security
server
server
unified-push
web-app
website
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ntfy#1140
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @marlonfan on GitHub (Mar 3, 2026).
Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/1634
Device & Environment
ntfy version: 1.22.2 (F-Droid)
OS: Android 16 (SDK 36)
Server Setup: Dual Nginx Proxy (Stream Proxy L4 -> HTTP Proxy L11)
Connection Protocol: WebSocket (ws)
Description
I've encountered a persistent "reconnecting" issue where the app remains stuck in the CONNECTING state for hours. Despite network changes (Wi-Fi/4G/VPN toggles), the state machine does not recover automatically. The only way to restore connection is to "Force Stop" the app and restart it.
Looking at the logs, there is a clear logic discrepancy: the nextRetryTime is stuck in the past, and the connection job seems to be a "zombie" process.
Evidence (Log Snippets)
Observations
While the WebSocket connection is stuck, the background PollWorker continues to function and successfully retrieves notifications via HTTP polling. This confirms the server and network are healthy.
It appears that NtfySubscriberService or the underlying OkHttpClient is not triggering a timeout or a clean failure when a connection attempt is hanging in the "half-open" state on Android 16.
This is critical for infrastructure stability as it leads to delayed or missed notifications without any user-facing error except the "reconnecting" status.
Possible Fix
Suggesting a strict connectTimeout or a watchdog mechanism in the NtfySubscriberService to force-kill and reset the connection if it stays in CONNECTING state significantly longer than the nextRetryTime.
@binwiederhier commented on GitHub (Mar 3, 2026):
Thank you for this detailed investigation and bug report. I'll add this to the top of the list.
@martkaczmarek commented on GitHub (Mar 3, 2026):
I also have some problem of this sort on Android app:
ntfy 1.23.0 (play)
Started a few days ago; I get constant "Listening for incoming notifications" notification. The messages still go through, but it looks like it's constantly reconnecting. Worked fine for like a month now.
Log from server side:
Server: docker, version:
ntfy 2.16.0 (
4b474a8), runtime go1.24.11, built at 2026-01-19T23:34:14Z@binwiederhier commented on GitHub (Mar 4, 2026):
@marlonfan What makes you think that the OkHTTP client is stuck? Can you provide more logs? You can tap "Record logs" in the app and then copy/upload anonymized logs and share them here.
It would be very interesting/good to have a full log showing the lifecycle of the WsConnection (including the
gid=..fields and such.Claude believes it found an issue, but I want to be sure that it's your issue. Could you provide more logs?
@martkaczmarek You are talking about something unrelated. The reconnections didn't change. The socket will always break and reconnect, because it's a mobile phone and connections fluctuate and you roam around. We added a "reconnecting ..." change to the foreground service notification that people hate. I am getting rid of that in the next release.
@binwiederhier commented on GitHub (Mar 4, 2026):
Also: @marlonfan, you are on an old version. There was a slight change in 1.23.0 the websocket behavior that may have an impact.
@martkaczmarek commented on GitHub (Mar 4, 2026):
Right, I figured in the meantime that it's not really related. Sorry for confusing the issue here.
@marlonfan commented on GitHub (Mar 5, 2026):
@binwiederhier Thank you very much for your reply. I have now switched to using HTTP JSON for two days and it feels good. I have already upgraded to version 1.23.0, and I will try switching to WebSocket in the next few days to see if there are any changes. I previously saved a log, but I'm not sure if it can be helpful to you.
hBEuxQ2ZqA.txt
@binwiederhier commented on GitHub (Mar 8, 2026):
@marlonfan This log was very helpful. I had a conversation with Claude, which suggests that it's related to doze mode, which given the timing (shortly after midnight) seems plausible.
I'll look at this further. I don't want to just implement a "fix" without fully understanding the cause.
Here's the transcript of the Claude session, which is very insightful:
Session Transcript: ntfy Issue #1634 Analysis
User
please look at https://github.com/binwiederhier/ntfy/issues/1634 -- in particular the log posted in https://github.com/user-attachments/files/25767352/hBEuxQ2ZqA.txt
Claude
(Fetched the GitHub issue and downloaded the 1075-line log file. Extracted WebSocket connection lifecycle events.)
Here's my analysis of the log:
Bug: WebSocket reconnect alarm lost during Android Doze
Timeline from the log
setExactsetExactsetExactMatch? --> true, does nothingstate=CONNECTING, nextRetryTime=1772469143634stuck foreverThe connection stays stuck in
CONNECTINGwithnextRetryTime=1772469143634(00:32:23) for the remaining 9+ hours of the log. Messages only arrive via the hourly poll worker.Root Cause
WsConnection.kt:115usesAlarmManager.setExact()which does NOT fire during Android Doze mode. OnlysetExactAndAllowWhileIdle()can wake the device during Doze. The phone entered deep Doze around midnight, so the 3rd reconnect alarm was deferred indefinitely.Why the service doesn't recover
SubscriberService.reallyRefreshConnections()(line 211) comparesConnectionIdobjects — which only contain subscription parameters (topics, credentials, protocol). Since those haven't changed, it seesMatch? --> trueand takes no action, even though the underlying WebSocket is dead and stuck.The connection alert (
maybeShowConnectionAlert) also doesn't trigger because the EOFException is mapped toerror=null(line 193 of WsConnection — "broken connections" are deliberately hidden from UI), sodetails.hasError()returns false.Suggested Fixes
Use
setExactAndAllowWhileIdle()instead ofsetExact()inWsConnection.kt:115,126— this is the primary fix to survive Doze modeAdd a staleness check in the periodic ServiceStartWorker refresh — if a connection has been in CONNECTING state past its
nextRetryTimefor more than N minutes, force a reconnect by callingstart()on it. This would be a safety net for any case where the alarm doesn't fire.Want me to implement these fixes?
User
what timezone is this person in. you can see from the logs: 1772469138636 2026-03-03 00:32:18.636
Claude
The epoch
1772469138636corresponds to2026-03-03 00:32:18 UTC+8(China Standard Time). You can verify: the formatted timestamp in the log shows00:32:18, and that matches UTC+8.User
what is 1772469143634
Claude
That's the
nextRetryTime— the scheduled alarm time (5 seconds after the disconnect):1772469138636→00:32:18.6361772469143634→00:32:23.634Difference is 4998ms (~5 seconds), matching the first entry in
RETRY_SECONDS. The alarm was supposed to fire at that time but never did.User
does setExactAndAllowWhileIdle need more permissions? what are the downsides?
Claude
setExactAndAllowWhileIdle()uses the sameSCHEDULE_EXACT_ALARMpermission that's already required forsetExact()— no additional permissions needed.The main downside is rate limiting: Android throttles these alarms to roughly one per 9 minutes per app during Doze. But that's irrelevant here since ntfy only fires one reconnect alarm at a time, and the retry intervals (5, 10, 15, 20, 30, 45, 60, 120s) are spaced far enough apart.
There's really no downside for this use case. It's a strict upgrade.
User
1 per 9 minutes doesnt seem enough. what if the first retry fails during doze? the next would be scheduled 10 seconds into the future, not 9min into the future
Claude
Good point. If the first retry at 5s fails (e.g. no network during Doze), the next alarm at 10s would get throttled to ~9 minutes anyway. And the one after that (15s) would also get throttled. So in practice the retry intervals during Doze would become something like 9min, 9min, 9min... regardless of the configured backoff.
But honestly that's still fine — and much better than the current behavior where it never reconnects at all. The connection is stuck forever right now. Getting a retry every ~9 minutes during Doze is a massive improvement.
And once the device exits Doze (user picks up phone, screen on), the throttling stops and retries happen at normal intervals. The backoff list already goes up to 120s anyway, so the worst case during Doze is the retry gets delayed from 120s to ~9min — not a big deal.
The alternative would be to use
setAlarmClock()which has no throttling at all, but that shows an alarm icon in the status bar, which would be inappropriate here.@marlonfan commented on GitHub (Mar 8, 2026):
@binwiederhier Thank you very much for your work. I also suspect this is related to the OS's scheduling mechanism. The customized OS in China has very aggressive power-saving strategies, which likely caused this issue. If you need further cooperation, feel free to contact me anytime. I look forward to seeing the problem resolved. Currently, I am using the HTTP JSON mode and found that the battery consumption is similar to the WebSocket approach. This gives me a backup plan. Anyway, best of luck~
By the way, just to ask, is the HTTP JSON implemented using long polling?
@binwiederhier commented on GitHub (Mar 8, 2026):
JSON stream is just a long running HTTP response similar to SSE. One may call that long polling, yes.
https://docs.ntfy.sh/subscribe/api/#subscribe-as-json-stream
@binwiederhier commented on GitHub (Mar 8, 2026):
I do think
setExactAndAllowWhileIdleis a good approach, but I still think there should be a mechanism to figure out that a WsConnection is in a "dead and will not awake" state, e.g. by observing the nextRetryTime and force-refreshing the connections.