[GH-ISSUE #338] Some publish requests on ntfy.sh take up to 15 seconds #264

Closed
opened 2026-05-07 00:22:26 +02:00 by BreizhHardware · 3 comments

Originally created by @binwiederhier on GitHub (Jun 21, 2022).
Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/338

So I've been noticing that every now and then some requests against ntfy.sh had been taking 11-15s (as opposed to <1s). At first I thought it was a problem with the Linux kernel tuning variables (somaxconn, nofile, ...). Then I thought it was nginx. After randomly poking around I found that the updateStatsAndPrune() code is likely to blame, because it locks the server mutex for a very long time (or so it appears).

Here's what I saw:

curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.00s user 0.00s system 12% cpu 0.055 total
Tue Jun 21 15:04:57 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 23% cpu 0.062 total
Tue Jun 21 15:05:02 EDT 2022
curl -sd hi ntfy.sh/mytopic123 > /dev/null  0.01s user 0.01s system 0% cpu 11.509 total
                                                                           ^^^^^^

This happened even when doing it against localhost:11080 (= not through nginx), meaning DNS and nginx could be ruled out.

I briefly turned on trace logging in ntfy and saw this:

Jun 21 18:51:05 ntfy.sh ntfy[4169942]: DEBUG Manager: Pruning messages older than 2022-06-21 06:51:05
Jun 21 18:51:20 ntfy.sh ntfy[4169942]: INFO Stats: 525254 messages published, 60415 in cache, 1478 topic(s) active, 1267 subscriber(s), 6045 visitor(s), 1733 mails received (4 successful, 1729 failed), 12 mails sent (12 successful, 0 failed)

This corresponds to this block of code: github.com/binwiederhier/ntfy@4e29216b5f/server/server.go (L1114-L1150), which locks the global server mutex here: github.com/binwiederhier/ntfy@4e29216b5f/server/server.go (L1083)

Note the timestamps, 18:51:05 + 18:51:20 -- That's 15 seconds to run this code, meaning that all POST/PUT requests have to wait on the lock this entire time.

This is likely relatively easy to fix, and looking at the code it is obviously pretty inefficient.

Originally created by @binwiederhier on GitHub (Jun 21, 2022). Original GitHub issue: https://github.com/binwiederhier/ntfy/issues/338 So I've been noticing that every now and then some requests against ntfy.sh had been taking 11-15s (as opposed to <1s). At first I thought it was a problem with the Linux kernel tuning variables (somaxconn, nofile, ...). Then I thought it was nginx. After randomly poking around I found that the `updateStatsAndPrune()` code is likely to blame, because it locks the server mutex for a very long time (or so it appears). Here's what I saw: ``` curl -sd hi ntfy.sh/mytopic123 > /dev/null 0.00s user 0.00s system 12% cpu 0.055 total Tue Jun 21 15:04:57 EDT 2022 curl -sd hi ntfy.sh/mytopic123 > /dev/null 0.01s user 0.01s system 23% cpu 0.062 total Tue Jun 21 15:05:02 EDT 2022 curl -sd hi ntfy.sh/mytopic123 > /dev/null 0.01s user 0.01s system 0% cpu 11.509 total ^^^^^^ ``` This happened even when doing it against localhost:11080 (= not through nginx), meaning DNS and nginx could be ruled out. I briefly turned on trace logging in ntfy and saw this: ``` Jun 21 18:51:05 ntfy.sh ntfy[4169942]: DEBUG Manager: Pruning messages older than 2022-06-21 06:51:05 Jun 21 18:51:20 ntfy.sh ntfy[4169942]: INFO Stats: 525254 messages published, 60415 in cache, 1478 topic(s) active, 1267 subscriber(s), 6045 visitor(s), 1733 mails received (4 successful, 1729 failed), 12 mails sent (12 successful, 0 failed) ``` This corresponds to this block of code: https://github.com/binwiederhier/ntfy/blob/4e29216b5f7fc170f82fe78896dad41a16916eb4/server/server.go#L1114-L1150, which locks the global server mutex here: https://github.com/binwiederhier/ntfy/blob/4e29216b5f7fc170f82fe78896dad41a16916eb4/server/server.go#L1083 Note the timestamps, 18:51:05 + 18:51:20 -- That's 15 seconds to run this code, meaning that all POST/PUT requests have to wait on the lock this entire time. This is likely relatively easy to fix, and looking at the code it is obviously pretty inefficient.
BreizhHardware 2026-05-07 00:22:26 +02:00
Author
Owner

@binwiederhier commented on GitHub (Jun 22, 2022):

In trying to fix this, I have encountered a horrible data race that I have not been able to figure out in quite some time. It appears to be happening when the Go HTTP code reads from the socket when closing the request, and causes a data race with the ResponseWriter.

Here's what the data race stack looks like (see https://github.com/binwiederhier/ntfy/runs/6994933562?check_suite_focus=true):

go test -v -race -run 'TestClient_Publish_Subscribe' ./client
=== RUN   TestClient_Publish_Subscribe
2022/06/22 13:33:35 INFO Listening on :21493[http], ntfy , log level is INFO
2022/06/22 13:33:36 INFO 127.0.0.1:21493/mytopic Connection exited
==================
WARNING: DATA RACE
Read at 0x00c0002ea0a8 by goroutine 31:
  bufio.(*Writer).Available()
      /usr/local/go/src/bufio/bufio.go:645 +0xcd
  bufio.(*Writer).WriteString()
      /usr/local/go/src/bufio/bufio.go:735 +0x18
  net/http.(*chunkWriter).close()
      /usr/local/go/src/net/http/server.go:407 +0xbd
  net/http.(*response).finishRequest()
      /usr/local/go/src/net/http/server.go:1630 +0xb2
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1972 +0xbfa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58

Previous write at 0x00c0002ea0a8 by goroutine 44:
  bufio.(*Writer).Flush()
      /usr/local/go/src/bufio/bufio.go:640 +0x319
  net/http.(*chunkWriter).flush()
      /usr/local/go/src/net/http/server.go:397 +0x92
  net/http.(*response).Flush()
      /usr/local/go/src/net/http/server.go:1682 +0x78
  heckel.io/ntfy/server.(*Server).handleSubscribeHTTP.func1()
      /home/pheckel/Code/ntfy/server/server.go:815 +0x174
  heckel.io/ntfy/server.(*topic).Publish.func1()
      /home/pheckel/Code/ntfy/server/topic.go:53 +0x3fb

Goroutine 31 (running) created at:
  net/http.(*Server).Serve()
      /usr/local/go/src/net/http/server.go:3071 +0x80c
  net/http.(*Server).ListenAndServe()
      /usr/local/go/src/net/http/server.go:2968 +0xbc
  heckel.io/ntfy/server.(*Server).Run.func1()
      /home/pheckel/Code/ntfy/server/server.go:191 +0x52

Goroutine 44 (finished) created at:
  heckel.io/ntfy/server.(*topic).Publish()
      /home/pheckel/Code/ntfy/server/topic.go:46 +0x116
  heckel.io/ntfy/server.(*Server).handlePublishWithoutResponse()
      /home/pheckel/Code/ntfy/server/server.go:468 +0xa64
  heckel.io/ntfy/server.(*Server).handlePublish()
      /home/pheckel/Code/ntfy/server/server.go:495 +0x77
  heckel.io/ntfy/server.(*Server).handlePublish-fm()
      <autogenerated>:1 +0x6d
  heckel.io/ntfy/server.(*Server).withAuth.func1()
      /home/pheckel/Code/ntfy/server/server.go:1361 +0x273
  heckel.io/ntfy/server.(*Server).limitRequests.func1()
      /home/pheckel/Code/ntfy/server/server.go:1268 +0x18a
  heckel.io/ntfy/server.(*Server).handleInternal()
      /home/pheckel/Code/ntfy/server/server.go:308 +0x25d3
  heckel.io/ntfy/server.(*Server).handle()
      /home/pheckel/Code/ntfy/server/server.go:253 +0x1c4
  heckel.io/ntfy/server.(*Server).handle-fm()
      <autogenerated>:1 +0x57
  net/http.HandlerFunc.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2084 +0x4d
  net/http.(*ServeMux).ServeHTTP()
      /usr/local/go/src/net/http/server.go:2462 +0xc5
  net/http.serverHandler.ServeHTTP()
      /usr/local/go/src/net/http/server.go:2916 +0x896
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1966 +0xbaa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58
==================
    testing.go:1312: race detected during execution of test
--- FAIL: TestClient_Publish_Subscribe (1.48s)
=== CONT  
    testing.go:1312: race detected during execution of test
FAIL
FAIL	heckel.io/ntfy/client	1.532s
FAIL

This stack in particular indicates that something inside the Go stdlib is reading from the underlying connection/socket

Read at 0x00c0002ea0a8 by goroutine 31:
  bufio.(*Writer).Available()
      /usr/local/go/src/bufio/bufio.go:645 +0xcd
  bufio.(*Writer).WriteString()
      /usr/local/go/src/bufio/bufio.go:735 +0x18
  net/http.(*chunkWriter).close()
      /usr/local/go/src/net/http/server.go:407 +0xbd
  net/http.(*response).finishRequest()
      /usr/local/go/src/net/http/server.go:1630 +0xb2
  net/http.(*conn).serve()
      /usr/local/go/src/net/http/server.go:1972 +0xbfa
  net/http.(*Server).Serve.func3()
      /usr/local/go/src/net/http/server.go:3071 +0x58

The w ResponseWriter is protected with wlock sync.Mutex, but that is obviously not used by the standard library, which is why the race occurs. I have not been able to figure out a proper fix, other than to call wlock.TryLock() (or Lock()) before the HTTP handler function exits. That is a hack, but it's harmless, because the lock is freed and irrelevant anyway after the function exits.

<!-- gh-comment-id:1163425889 --> @binwiederhier commented on GitHub (Jun 22, 2022): In trying to fix this, I have encountered a horrible data race that I have not been able to figure out in quite some time. It appears to be happening when the Go HTTP code reads from the socket when closing the request, and causes a data race with the `ResponseWriter`. Here's what the data race stack looks like (see https://github.com/binwiederhier/ntfy/runs/6994933562?check_suite_focus=true): ``` go test -v -race -run 'TestClient_Publish_Subscribe' ./client === RUN TestClient_Publish_Subscribe 2022/06/22 13:33:35 INFO Listening on :21493[http], ntfy , log level is INFO 2022/06/22 13:33:36 INFO 127.0.0.1:21493/mytopic Connection exited ================== WARNING: DATA RACE Read at 0x00c0002ea0a8 by goroutine 31: bufio.(*Writer).Available() /usr/local/go/src/bufio/bufio.go:645 +0xcd bufio.(*Writer).WriteString() /usr/local/go/src/bufio/bufio.go:735 +0x18 net/http.(*chunkWriter).close() /usr/local/go/src/net/http/server.go:407 +0xbd net/http.(*response).finishRequest() /usr/local/go/src/net/http/server.go:1630 +0xb2 net/http.(*conn).serve() /usr/local/go/src/net/http/server.go:1972 +0xbfa net/http.(*Server).Serve.func3() /usr/local/go/src/net/http/server.go:3071 +0x58 Previous write at 0x00c0002ea0a8 by goroutine 44: bufio.(*Writer).Flush() /usr/local/go/src/bufio/bufio.go:640 +0x319 net/http.(*chunkWriter).flush() /usr/local/go/src/net/http/server.go:397 +0x92 net/http.(*response).Flush() /usr/local/go/src/net/http/server.go:1682 +0x78 heckel.io/ntfy/server.(*Server).handleSubscribeHTTP.func1() /home/pheckel/Code/ntfy/server/server.go:815 +0x174 heckel.io/ntfy/server.(*topic).Publish.func1() /home/pheckel/Code/ntfy/server/topic.go:53 +0x3fb Goroutine 31 (running) created at: net/http.(*Server).Serve() /usr/local/go/src/net/http/server.go:3071 +0x80c net/http.(*Server).ListenAndServe() /usr/local/go/src/net/http/server.go:2968 +0xbc heckel.io/ntfy/server.(*Server).Run.func1() /home/pheckel/Code/ntfy/server/server.go:191 +0x52 Goroutine 44 (finished) created at: heckel.io/ntfy/server.(*topic).Publish() /home/pheckel/Code/ntfy/server/topic.go:46 +0x116 heckel.io/ntfy/server.(*Server).handlePublishWithoutResponse() /home/pheckel/Code/ntfy/server/server.go:468 +0xa64 heckel.io/ntfy/server.(*Server).handlePublish() /home/pheckel/Code/ntfy/server/server.go:495 +0x77 heckel.io/ntfy/server.(*Server).handlePublish-fm() <autogenerated>:1 +0x6d heckel.io/ntfy/server.(*Server).withAuth.func1() /home/pheckel/Code/ntfy/server/server.go:1361 +0x273 heckel.io/ntfy/server.(*Server).limitRequests.func1() /home/pheckel/Code/ntfy/server/server.go:1268 +0x18a heckel.io/ntfy/server.(*Server).handleInternal() /home/pheckel/Code/ntfy/server/server.go:308 +0x25d3 heckel.io/ntfy/server.(*Server).handle() /home/pheckel/Code/ntfy/server/server.go:253 +0x1c4 heckel.io/ntfy/server.(*Server).handle-fm() <autogenerated>:1 +0x57 net/http.HandlerFunc.ServeHTTP() /usr/local/go/src/net/http/server.go:2084 +0x4d net/http.(*ServeMux).ServeHTTP() /usr/local/go/src/net/http/server.go:2462 +0xc5 net/http.serverHandler.ServeHTTP() /usr/local/go/src/net/http/server.go:2916 +0x896 net/http.(*conn).serve() /usr/local/go/src/net/http/server.go:1966 +0xbaa net/http.(*Server).Serve.func3() /usr/local/go/src/net/http/server.go:3071 +0x58 ================== testing.go:1312: race detected during execution of test --- FAIL: TestClient_Publish_Subscribe (1.48s) === CONT testing.go:1312: race detected during execution of test FAIL FAIL heckel.io/ntfy/client 1.532s FAIL ``` This stack in particular indicates that something inside the Go stdlib is reading from the underlying connection/socket ``` Read at 0x00c0002ea0a8 by goroutine 31: bufio.(*Writer).Available() /usr/local/go/src/bufio/bufio.go:645 +0xcd bufio.(*Writer).WriteString() /usr/local/go/src/bufio/bufio.go:735 +0x18 net/http.(*chunkWriter).close() /usr/local/go/src/net/http/server.go:407 +0xbd net/http.(*response).finishRequest() /usr/local/go/src/net/http/server.go:1630 +0xb2 net/http.(*conn).serve() /usr/local/go/src/net/http/server.go:1972 +0xbfa net/http.(*Server).Serve.func3() /usr/local/go/src/net/http/server.go:3071 +0x58 ``` The `w ResponseWriter` is protected with `wlock sync.Mutex`, but that is obviously not used by the standard library, which is why the race occurs. I have not been able to figure out a proper fix, other than to call `wlock.TryLock()` (or `Lock()`) before the HTTP handler function exits. That is a hack, but it's harmless, because the lock is freed and irrelevant anyway after the function exits.
Author
Owner

@binwiederhier commented on GitHub (Jun 23, 2022):

Hopefully fixed, will be released on the next server release

<!-- gh-comment-id:1163792922 --> @binwiederhier commented on GitHub (Jun 23, 2022): Hopefully fixed, will be released on the next server release
Author
Owner

@binwiederhier commented on GitHub (Jun 23, 2022):

Wrote this test to see if it fixed it and how much performance has improved.

Before (note: Publishing message; took 948ms, ):

2022/06/22 20:29:06 Adding 50000 messages
2022/06/22 20:29:07 Done: Adding 50000 messages; took 1.068s
2022/06/22 20:29:07 Updating stats
2022/06/22 20:29:07 Publishing message
2022/06/22 20:29:08 INFO Stats: 0 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 0 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed)
2022/06/22 20:29:08 Done: Updating stats; took 984ms
    server_test.go:1393: 
        	Error Trace:	server_test.go:1393
        	Error:      	Should be true
        	Test:       	TestServer_PublishWhileUpdatingStatsWithLotsOfMessages
2022/06/22 20:29:08 Done: Publishing message; took 948ms
2022/06/22 20:29:08 Done: Waiting for all locks
--- FAIL: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (2.09s)

After (note: Publishing message; took 3ms ):

go test -v  -run 'TestServer_PublishWhileUpdatingStats' ./server
=== RUN   TestServer_PublishWhileUpdatingStatsWithLotsOfMessages
2022/06/22 20:30:34 Adding 50000 messages
2022/06/22 20:30:35 Done: Adding 50000 messages; took 843ms
2022/06/22 20:30:35 Updating stats
2022/06/22 20:30:35 Publishing message
2022/06/22 20:30:35 Done: Publishing message; took 3ms
2022/06/22 20:30:35 INFO Stats: 1 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 1 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed)
2022/06/22 20:30:35 Done: Updating stats; took 70ms
2022/06/22 20:30:35 Done: Waiting for all locks
--- PASS: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (0.93s)
PASS
ok  	heckel.io/ntfy/server	0.945s
<!-- gh-comment-id:1163798846 --> @binwiederhier commented on GitHub (Jun 23, 2022): Wrote this test to see if it fixed it and how much performance has improved. **Before** (note: `Publishing message; took 948ms`, :exclamation:): ``` 2022/06/22 20:29:06 Adding 50000 messages 2022/06/22 20:29:07 Done: Adding 50000 messages; took 1.068s 2022/06/22 20:29:07 Updating stats 2022/06/22 20:29:07 Publishing message 2022/06/22 20:29:08 INFO Stats: 0 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 0 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed) 2022/06/22 20:29:08 Done: Updating stats; took 984ms server_test.go:1393: Error Trace: server_test.go:1393 Error: Should be true Test: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages 2022/06/22 20:29:08 Done: Publishing message; took 948ms 2022/06/22 20:29:08 Done: Waiting for all locks --- FAIL: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (2.09s) ``` **After** (note: `Publishing message; took 3ms` :exclamation:): ``` go test -v -run 'TestServer_PublishWhileUpdatingStats' ./server === RUN TestServer_PublishWhileUpdatingStatsWithLotsOfMessages 2022/06/22 20:30:34 Adding 50000 messages 2022/06/22 20:30:35 Done: Adding 50000 messages; took 843ms 2022/06/22 20:30:35 Updating stats 2022/06/22 20:30:35 Publishing message 2022/06/22 20:30:35 Done: Publishing message; took 3ms 2022/06/22 20:30:35 INFO Stats: 1 messages published, 50000 in cache, 50000 topic(s) active, 0 subscriber(s), 1 visitor(s), 0 mails received (0 successful, 0 failed), 0 mails sent (0 successful, 0 failed) 2022/06/22 20:30:35 Done: Updating stats; took 70ms 2022/06/22 20:30:35 Done: Waiting for all locks --- PASS: TestServer_PublishWhileUpdatingStatsWithLotsOfMessages (0.93s) PASS ok heckel.io/ntfy/server 0.945s ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ntfy#264
No description provided.