[GH-ISSUE #1634] Stuck in CONNECTING state with passed nextRetryTime on Android 16 (SDK 36) #1140

Open
opened 2026-05-07 00:30:31 +02:00 by BreizhHardware · 10 comments

Originally created by @marlonfan on GitHub (Mar 3, 2026).
Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/1634

Device & Environment
ntfy version: 1.22.2 (F-Droid)

OS: Android 16 (SDK 36)

Server Setup: Dual Nginx Proxy (Stream Proxy L4 -> HTTP Proxy L11)

Connection Protocol: WebSocket (ws)

Description
I've encountered a persistent "reconnecting" issue where the app remains stuck in the CONNECTING state for hours. Despite network changes (Wi-Fi/4G/VPN toggles), the state machine does not recover automatically. The only way to restore connection is to "Force Stop" the app and restart it.

Looking at the logs, there is a clear logic discrepancy: the nextRetryTime is stuck in the past, and the connection job seems to be a "zombie" process.

Evidence (Log Snippets)

// 1. Connection getting stuck with expired retry time
... D NtfyMainActivity subscription: Subscription(..., state=CONNECTING, nextRetryTime=1772507100000)
// Note: The current log timestamp 1772507189623 is already past the nextRetryTime.

// 2. Typical Errors encountered during handshake
... E NtfyApiService Connection to https://ntfy.example.com failed (1): stream was reset: CANCEL
... E NtfySubscriberConn [https://ntfy.example.com] Connection failed: StandaloneCoroutine was cancelled
... E NtfyWsConnection ...: Connection failed (response is null): null
Exception: java.io.EOFException

// 3. Polling works while WebSocket is dead
... D NtfyPollWorker Polling for new notifications
... D NtfyPollWorker Finished polling for new notifications (Received 2 notifications)

Observations
While the WebSocket connection is stuck, the background PollWorker continues to function and successfully retrieves notifications via HTTP polling. This confirms the server and network are healthy.

It appears that NtfySubscriberService or the underlying OkHttpClient is not triggering a timeout or a clean failure when a connection attempt is hanging in the "half-open" state on Android 16.

This is critical for infrastructure stability as it leads to delayed or missed notifications without any user-facing error except the "reconnecting" status.

Possible Fix
Suggesting a strict connectTimeout or a watchdog mechanism in the NtfySubscriberService to force-kill and reset the connection if it stays in CONNECTING state significantly longer than the nextRetryTime.

Originally created by @marlonfan on GitHub (Mar 3, 2026). Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/1634 **Device & Environment** ntfy version: 1.22.2 (F-Droid) **OS**: Android 16 (SDK 36) **Server Setup**: Dual Nginx Proxy (Stream Proxy L4 -> HTTP Proxy L11) **Connection Protocol**: WebSocket (ws) **Description** I've encountered a persistent "reconnecting" issue where the app remains stuck in the CONNECTING state for hours. Despite network changes (Wi-Fi/4G/VPN toggles), the state machine does not recover automatically. The only way to restore connection is to "Force Stop" the app and restart it. Looking at the logs, there is a clear logic discrepancy: the nextRetryTime is stuck in the past, and the connection job seems to be a "zombie" process. **Evidence (Log Snippets)** ```Plaintext // 1. Connection getting stuck with expired retry time ... D NtfyMainActivity subscription: Subscription(..., state=CONNECTING, nextRetryTime=1772507100000) // Note: The current log timestamp 1772507189623 is already past the nextRetryTime. // 2. Typical Errors encountered during handshake ... E NtfyApiService Connection to https://ntfy.example.com failed (1): stream was reset: CANCEL ... E NtfySubscriberConn [https://ntfy.example.com] Connection failed: StandaloneCoroutine was cancelled ... E NtfyWsConnection ...: Connection failed (response is null): null Exception: java.io.EOFException // 3. Polling works while WebSocket is dead ... D NtfyPollWorker Polling for new notifications ... D NtfyPollWorker Finished polling for new notifications (Received 2 notifications) ``` **Observations** While the WebSocket connection is stuck, the background PollWorker continues to function and successfully retrieves notifications via HTTP polling. This confirms the server and network are healthy. It appears that NtfySubscriberService or the underlying OkHttpClient is not triggering a timeout or a clean failure when a connection attempt is hanging in the "half-open" state on Android 16. This is critical for infrastructure stability as it leads to delayed or missed notifications without any user-facing error except the "reconnecting" status. **Possible Fix** Suggesting a strict connectTimeout or a watchdog mechanism in the NtfySubscriberService to force-kill and reset the connection if it stays in CONNECTING state significantly longer than the nextRetryTime.
Author
Owner

@binwiederhier commented on GitHub (Mar 3, 2026):

Thank you for this detailed investigation and bug report. I'll add this to the top of the list.

<!-- gh-comment-id:3988614065 --> @binwiederhier commented on GitHub (Mar 3, 2026): Thank you for this detailed investigation and bug report. I'll add this to the top of the list.
Author
Owner

@martkaczmarek commented on GitHub (Mar 3, 2026):

I also have some problem of this sort on Android app:
ntfy 1.23.0 (play)
Started a few days ago; I get constant "Listening for incoming notifications" notification. The messages still go through, but it looks like it's constantly reconnecting. Worked fine for like a month now.
Log from server side:

ntfy | 2026-03-03T10:39:46.581224956Z 2026/03/03 11:39:46 INFO WebSocket error: read tcp 172.23.0.2:80->10.0.0.1:29035: read: connection reset by peer (error=read tcp 172.23.0.2:80->10.0.0.1:29035: read: connection reset by peer, http_method=GET, http_path=, tag=websocket, user_id=u_ziwn78b8jW, user_name=, visitor_auth_limiter_limit=0.016666666666666666, visitor_auth_limiter_tokens=30, visitor_id=ip:, visitor_ip=, visitor_messages=0, visitor_messages_limit=17280, visitor_messages_remaining=17280, visitor_request_limiter_limit=0.2, visitor_request_limiter_tokens=60, visitor_seen=2026-03-03T11:39:06.9+01:00)

Server: docker, version:
ntfy 2.16.0 (4b474a8), runtime go1.24.11, built at 2026-01-19T23:34:14Z

<!-- gh-comment-id:3990180531 --> @martkaczmarek commented on GitHub (Mar 3, 2026): I also have some problem of this sort on Android app: ntfy 1.23.0 (play) Started a few days ago; I get constant "Listening for incoming notifications" notification. The messages still go through, but it looks like it's constantly reconnecting. Worked fine for like a month now. Log from server side: > ntfy | 2026-03-03T10:39:46.581224956Z 2026/03/03 11:39:46 INFO WebSocket error: read tcp 172.23.0.2:80->10.0.0.1:29035: read: connection reset by peer (error=read tcp 172.23.0.2:80->10.0.0.1:29035: read: connection reset by peer, http_method=GET, http_path=*****************, tag=websocket, user_id=u_ziwn78b8jW, user_name=*****************, visitor_auth_limiter_limit=0.016666666666666666, visitor_auth_limiter_tokens=30, visitor_id=ip:*****************, visitor_ip=*****************, visitor_messages=0, visitor_messages_limit=17280, visitor_messages_remaining=17280, visitor_request_limiter_limit=0.2, visitor_request_limiter_tokens=60, visitor_seen=2026-03-03T11:39:06.9+01:00) Server: docker, version: ntfy 2.16.0 (4b474a8), runtime go1.24.11, built at 2026-01-19T23:34:14Z
Author
Owner

@binwiederhier commented on GitHub (Mar 4, 2026):

@marlonfan What makes you think that the OkHTTP client is stuck? Can you provide more logs? You can tap "Record logs" in the app and then copy/upload anonymized logs and share them here.

It would be very interesting/good to have a full log showing the lifecycle of the WsConnection (including the gid=.. fields and such.

Claude believes it found an issue, but I want to be sure that it's your issue. Could you provide more logs?

@martkaczmarek You are talking about something unrelated. The reconnections didn't change. The socket will always break and reconnect, because it's a mobile phone and connections fluctuate and you roam around. We added a "reconnecting ..." change to the foreground service notification that people hate. I am getting rid of that in the next release.

<!-- gh-comment-id:3999826986 --> @binwiederhier commented on GitHub (Mar 4, 2026): @marlonfan What makes you think that the OkHTTP client is stuck? Can you provide more logs? You can tap "Record logs" in the app and then copy/upload anonymized logs and share them here. It would be very interesting/good to have a full log showing the lifecycle of the WsConnection (including the `gid=..` fields and such. Claude believes it found an issue, but I want to be sure that it's your issue. Could you provide more logs? @martkaczmarek You are talking about something unrelated. The reconnections didn't change. The socket will always break and reconnect, because it's a mobile phone and connections fluctuate and you roam around. We added a "reconnecting ..." change to the foreground service notification that people hate. I am getting rid of that in the next release.
Author
Owner

@binwiederhier commented on GitHub (Mar 4, 2026):

Also: @marlonfan, you are on an old version. There was a slight change in 1.23.0 the websocket behavior that may have an impact.

<!-- gh-comment-id:3999928236 --> @binwiederhier commented on GitHub (Mar 4, 2026): Also: @marlonfan, you are on an old version. There was a slight change in 1.23.0 the websocket behavior that may have an impact.
Author
Owner

@martkaczmarek commented on GitHub (Mar 4, 2026):

@martkaczmarek You are talking about something unrelated. The reconnections didn't change. The socket will always break and reconnect, because it's a mobile phone and connections fluctuate and you roam around. We added a "reconnecting ..." change to the foreground service notification that people hate. I am getting rid of that in the next release.

Right, I figured in the meantime that it's not really related. Sorry for confusing the issue here.

<!-- gh-comment-id:4000455862 --> @martkaczmarek commented on GitHub (Mar 4, 2026): > [@martkaczmarek](https://github.com/martkaczmarek) You are talking about something unrelated. The reconnections didn't change. The socket will always break and reconnect, because it's a mobile phone and connections fluctuate and you roam around. We added a "reconnecting ..." change to the foreground service notification that people hate. I am getting rid of that in the next release. Right, I figured in the meantime that it's not really related. Sorry for confusing the issue here.
Author
Owner

@marlonfan commented on GitHub (Mar 5, 2026):

@binwiederhier Thank you very much for your reply. I have now switched to using HTTP JSON for two days and it feels good. I have already upgraded to version 1.23.0, and I will try switching to WebSocket in the next few days to see if there are any changes. I previously saved a log, but I'm not sure if it can be helpful to you.

hBEuxQ2ZqA.txt

<!-- gh-comment-id:4004677627 --> @marlonfan commented on GitHub (Mar 5, 2026): @binwiederhier Thank you very much for your reply. I have now switched to using HTTP JSON for two days and it feels good. I have already upgraded to version 1.23.0, and I will try switching to WebSocket in the next few days to see if there are any changes. I previously saved a log, but I'm not sure if it can be helpful to you. [hBEuxQ2ZqA.txt](https://github.com/user-attachments/files/25767352/hBEuxQ2ZqA.txt)
Author
Owner

@binwiederhier commented on GitHub (Mar 8, 2026):

@marlonfan This log was very helpful. I had a conversation with Claude, which suggests that it's related to doze mode, which given the timing (shortly after midnight) seems plausible.

I'll look at this further. I don't want to just implement a "fix" without fully understanding the cause.

Here's the transcript of the Claude session, which is very insightful:


Session Transcript: ntfy Issue #1634 Analysis

User

please look at https://github.com/binwiederhier/ntfy/issues/1634 -- in particular the log posted in https://github.com/user-attachments/files/25767352/hBEuxQ2ZqA.txt

Claude

(Fetched the GitHub issue and downloaded the 1075-line log file. Extracted WebSocket connection lifecycle events.)

Here's my analysis of the log:

Bug: WebSocket reconnect alarm lost during Android Doze

Timeline from the log

Time Event Result
22:57:48 Last message received on lid=25 Working normally
23:02:50 1st EOFException (lid=25) scheduleReconnect(5s) via setExact
23:02:55 Alarm fires (5s later) Reconnects as lid=26 ✓
23:30:13 2nd EOFException (lid=26) scheduleReconnect(5s) via setExact
23:31:55 Alarm fires (102s late — Doze delay) Reconnects as lid=27 ✓
00:32:18 3rd EOFException (lid=27) scheduleReconnect(5s) via setExact
00:32:23 Alarm should fire... NEVER FIRES
01:07:57 ServiceStartWorker runs Sees Match? --> true, does nothing
06:17–09:17+ PollWorker delivers notifications state=CONNECTING, nextRetryTime=1772469143634 stuck forever

The connection stays stuck in CONNECTING with nextRetryTime=1772469143634 (00:32:23) for the remaining 9+ hours of the log. Messages only arrive via the hourly poll worker.

Root Cause

WsConnection.kt:115 uses AlarmManager.setExact() which does NOT fire during Android Doze mode. Only setExactAndAllowWhileIdle() can wake the device during Doze. The phone entered deep Doze around midnight, so the 3rd reconnect alarm was deferred indefinitely.

Why the service doesn't recover

SubscriberService.reallyRefreshConnections() (line 211) compares ConnectionId objects — which only contain subscription parameters (topics, credentials, protocol). Since those haven't changed, it sees Match? --> true and takes no action, even though the underlying WebSocket is dead and stuck.

The connection alert (maybeShowConnectionAlert) also doesn't trigger because the EOFException is mapped to error=null (line 193 of WsConnection — "broken connections" are deliberately hidden from UI), so details.hasError() returns false.

Suggested Fixes

  1. Use setExactAndAllowWhileIdle() instead of setExact() in WsConnection.kt:115,126 — this is the primary fix to survive Doze mode

  2. Add a staleness check in the periodic ServiceStartWorker refresh — if a connection has been in CONNECTING state past its nextRetryTime for more than N minutes, force a reconnect by calling start() on it. This would be a safety net for any case where the alarm doesn't fire.

Want me to implement these fixes?


User

what timezone is this person in. you can see from the logs: 1772469138636 2026-03-03 00:32:18.636

Claude

The epoch 1772469138636 corresponds to 2026-03-03 00:32:18 UTC+8 (China Standard Time). You can verify: the formatted timestamp in the log shows 00:32:18, and that matches UTC+8.


User

what is 1772469143634

Claude

That's the nextRetryTime — the scheduled alarm time (5 seconds after the disconnect):

  • Disconnect: 177246913863600:32:18.636
  • nextRetry: 177246914363400:32:23.634

Difference is 4998ms (~5 seconds), matching the first entry in RETRY_SECONDS. The alarm was supposed to fire at that time but never did.


User

does setExactAndAllowWhileIdle need more permissions? what are the downsides?

Claude

setExactAndAllowWhileIdle() uses the same SCHEDULE_EXACT_ALARM permission that's already required for setExact() — no additional permissions needed.

The main downside is rate limiting: Android throttles these alarms to roughly one per 9 minutes per app during Doze. But that's irrelevant here since ntfy only fires one reconnect alarm at a time, and the retry intervals (5, 10, 15, 20, 30, 45, 60, 120s) are spaced far enough apart.

There's really no downside for this use case. It's a strict upgrade.


User

1 per 9 minutes doesnt seem enough. what if the first retry fails during doze? the next would be scheduled 10 seconds into the future, not 9min into the future

Claude

Good point. If the first retry at 5s fails (e.g. no network during Doze), the next alarm at 10s would get throttled to ~9 minutes anyway. And the one after that (15s) would also get throttled. So in practice the retry intervals during Doze would become something like 9min, 9min, 9min... regardless of the configured backoff.

But honestly that's still fine — and much better than the current behavior where it never reconnects at all. The connection is stuck forever right now. Getting a retry every ~9 minutes during Doze is a massive improvement.

And once the device exits Doze (user picks up phone, screen on), the throttling stops and retries happen at normal intervals. The backoff list already goes up to 120s anyway, so the worst case during Doze is the retry gets delayed from 120s to ~9min — not a big deal.

The alternative would be to use setAlarmClock() which has no throttling at all, but that shows an alarm icon in the status bar, which would be inappropriate here.

<!-- gh-comment-id:4019300276 --> @binwiederhier commented on GitHub (Mar 8, 2026): @marlonfan This log was very helpful. I had a conversation with Claude, which suggests that it's related to doze mode, which given the timing (shortly after midnight) seems plausible. I'll look at this further. I don't want to just implement a "fix" without fully understanding the cause. Here's the transcript of the Claude session, which is very insightful: --- # Session Transcript: ntfy Issue #1634 Analysis ## User please look at https://github.com/binwiederhier/ntfy/issues/1634 -- in particular the log posted in https://github.com/user-attachments/files/25767352/hBEuxQ2ZqA.txt ## Claude *(Fetched the GitHub issue and downloaded the 1075-line log file. Extracted WebSocket connection lifecycle events.)* Here's my analysis of the log: ### Bug: WebSocket reconnect alarm lost during Android Doze #### Timeline from the log | Time | Event | Result | |------|-------|--------| | 22:57:48 | Last message received on lid=25 | Working normally | | 23:02:50 | **1st EOFException** (lid=25) | scheduleReconnect(5s) via `setExact` | | 23:02:55 | Alarm fires (5s later) | Reconnects as lid=26 ✓ | | 23:30:13 | **2nd EOFException** (lid=26) | scheduleReconnect(5s) via `setExact` | | 23:31:55 | Alarm fires (**102s** late — Doze delay) | Reconnects as lid=27 ✓ | | 00:32:18 | **3rd EOFException** (lid=27) | scheduleReconnect(5s) via `setExact` | | 00:32:23 | Alarm should fire... | **NEVER FIRES** ✗ | | 01:07:57 | ServiceStartWorker runs | Sees `Match? --> true`, does nothing | | 06:17–09:17+ | PollWorker delivers notifications | `state=CONNECTING, nextRetryTime=1772469143634` stuck forever | The connection stays stuck in `CONNECTING` with `nextRetryTime=1772469143634` (00:32:23) for the remaining **9+ hours** of the log. Messages only arrive via the hourly poll worker. #### Root Cause **`WsConnection.kt:115` uses `AlarmManager.setExact()` which does NOT fire during Android Doze mode.** Only `setExactAndAllowWhileIdle()` can wake the device during Doze. The phone entered deep Doze around midnight, so the 3rd reconnect alarm was deferred indefinitely. #### Why the service doesn't recover **`SubscriberService.reallyRefreshConnections()`** (line 211) compares `ConnectionId` objects — which only contain subscription parameters (topics, credentials, protocol). Since those haven't changed, it sees `Match? --> true` and takes **no action**, even though the underlying WebSocket is dead and stuck. The connection alert (`maybeShowConnectionAlert`) also doesn't trigger because the EOFException is mapped to `error=null` (line 193 of WsConnection — "broken connections" are deliberately hidden from UI), so `details.hasError()` returns false. #### Suggested Fixes 1. **Use `setExactAndAllowWhileIdle()`** instead of `setExact()` in `WsConnection.kt:115,126` — this is the primary fix to survive Doze mode 2. **Add a staleness check in the periodic ServiceStartWorker refresh** — if a connection has been in CONNECTING state past its `nextRetryTime` for more than N minutes, force a reconnect by calling `start()` on it. This would be a safety net for any case where the alarm doesn't fire. Want me to implement these fixes? --- ## User what timezone is this person in. you can see from the logs: 1772469138636 2026-03-03 00:32:18.636 ## Claude The epoch `1772469138636` corresponds to `2026-03-03 00:32:18 UTC+8` (China Standard Time). You can verify: the formatted timestamp in the log shows `00:32:18`, and that matches UTC+8. --- ## User what is 1772469143634 ## Claude That's the `nextRetryTime` — the scheduled alarm time (5 seconds after the disconnect): - Disconnect: `1772469138636` → `00:32:18.636` - nextRetry: `1772469143634` → `00:32:23.634` Difference is 4998ms (~5 seconds), matching the first entry in `RETRY_SECONDS`. The alarm was supposed to fire at that time but never did. --- ## User does setExactAndAllowWhileIdle need more permissions? what are the downsides? ## Claude `setExactAndAllowWhileIdle()` uses the same `SCHEDULE_EXACT_ALARM` permission that's already required for `setExact()` — no additional permissions needed. The main downside is **rate limiting**: Android throttles these alarms to roughly one per 9 minutes per app during Doze. But that's irrelevant here since ntfy only fires one reconnect alarm at a time, and the retry intervals (5, 10, 15, 20, 30, 45, 60, 120s) are spaced far enough apart. There's really no downside for this use case. It's a strict upgrade. --- ## User 1 per 9 minutes doesnt seem enough. what if the first retry fails during doze? the next would be scheduled 10 seconds into the future, not 9min into the future ## Claude Good point. If the first retry at 5s fails (e.g. no network during Doze), the next alarm at 10s would get throttled to ~9 minutes anyway. And the one after that (15s) would also get throttled. So in practice the retry intervals during Doze would become something like 9min, 9min, 9min... regardless of the configured backoff. But honestly that's still fine — and much better than the current behavior where it **never reconnects at all**. The connection is stuck forever right now. Getting a retry every ~9 minutes during Doze is a massive improvement. And once the device exits Doze (user picks up phone, screen on), the throttling stops and retries happen at normal intervals. The backoff list already goes up to 120s anyway, so the worst case during Doze is the retry gets delayed from 120s to ~9min — not a big deal. The alternative would be to use `setAlarmClock()` which has no throttling at all, but that shows an alarm icon in the status bar, which would be inappropriate here.
Author
Owner

@marlonfan commented on GitHub (Mar 8, 2026):

@binwiederhier Thank you very much for your work. I also suspect this is related to the OS's scheduling mechanism. The customized OS in China has very aggressive power-saving strategies, which likely caused this issue. If you need further cooperation, feel free to contact me anytime. I look forward to seeing the problem resolved. Currently, I am using the HTTP JSON mode and found that the battery consumption is similar to the WebSocket approach. This gives me a backup plan. Anyway, best of luck~

By the way, just to ask, is the HTTP JSON implemented using long polling?

<!-- gh-comment-id:4019338450 --> @marlonfan commented on GitHub (Mar 8, 2026): @binwiederhier Thank you very much for your work. I also suspect this is related to the OS's scheduling mechanism. The customized OS in China has very aggressive power-saving strategies, which likely caused this issue. If you need further cooperation, feel free to contact me anytime. I look forward to seeing the problem resolved. Currently, I am using the HTTP JSON mode and found that the battery consumption is similar to the WebSocket approach. This gives me a backup plan. Anyway, best of luck~ By the way, just to ask, is the HTTP JSON implemented using long polling?
Author
Owner

@binwiederhier commented on GitHub (Mar 8, 2026):

JSON stream is just a long running HTTP response similar to SSE. One may call that long polling, yes.

https://docs.ntfy.sh/subscribe/api/#subscribe-as-json-stream

<!-- gh-comment-id:4020216884 --> @binwiederhier commented on GitHub (Mar 8, 2026): JSON stream is just a long running HTTP response similar to SSE. One may call that long polling, yes. https://docs.ntfy.sh/subscribe/api/#subscribe-as-json-stream
Author
Owner

@binwiederhier commented on GitHub (Mar 8, 2026):

I do think setExactAndAllowWhileIdle is a good approach, but I still think there should be a mechanism to figure out that a WsConnection is in a "dead and will not awake" state, e.g. by observing the nextRetryTime and force-refreshing the connections.

<!-- gh-comment-id:4020219265 --> @binwiederhier commented on GitHub (Mar 8, 2026): I do think `setExactAndAllowWhileIdle` is a good approach, but I still think there should be a mechanism to figure out that a WsConnection is in a "dead and will not awake" state, e.g. by observing the nextRetryTime and force-refreshing the connections.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ntfy#1140
No description provided.