[LBBS-12] mod_webmail: client_flush_pending_output infinite loop

[2024-06-22 08:12:28.038]   DEBUG[65783]: mod_webmail.c:3085 client_flush_pending_output: Flushing output ''
[2024-06-22 08:12:28.038]   DEBUG[65783]: mod_webmail.c:3085 client_flush_pending_output: Flushing output ''

.. ad infinitum.

Backtrace when running:

       dynamic = 0
        fulldynamic = 0
        bytes = 130
        log_stdout = <optimized out>
        need_reset = 0
        skip_logfile = 0
        __func__ = "__bbs_log"
#8  0x00007f58283f1a93 in client_flush_pending_output (client=<optimized out>) at mod_webmail.c:3085
        line = 0x7f582003caf0 ""
        __func__ = "client_flush_pending_output"
        res = <optimized out>
        __func__ = "idle_start"
#9  idle_start (client=0x7f5820007610, ws=<optimized out>) at mod_webmail.c:3118
        res = <optimized out>
        __func__ = "idle_start"
#10 0x00007f582b49d844 in ws_handler (node=node@entry=0x7f57e80d5720, http=http@entry=0x7f57df7f3730, proxied=proxied@entry=1) at net_ws.c:1047

Possibly related, sometimes idle_start can also trigger a crash:

Thread 1 (Thread 0x7fca6dffb6c0 (LWP 103850)):
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {7142744331360876602}}
        ret = <optimized out>
#1  0x00007fcab5c93e8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fcab5c44fb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007fcab5c2f472 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {4321147497007208943, 12798070773811002880, 11952406549267936580, 7632343802151250176, 14024600428130476874, 16319518678535185408, 6825180314748858317, 7632343802151250176, 140508610266257, 0, 3164090317, 140507405580880, 0, 1, 140508610284947, 0}}, sa_flags = -1244626555, sa_restorer = 0x0}
#4  0x0000558137fc1667 in __bbs_assert_fatal (condition_str=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at backtrace.c:311
#5  0x00007fcab1fffb34 in __bbs_assert (function=0x7fcab200e270 <__func__.46> "idle_start", line=3107, file=0x7fcab200c015 "mod_webmail.c", condition_str=0x7fcab200cf60 "client->imap->imap_selection_info != NULL", condition=<optimized out>) at ../include/bbs.h:440
        res = <optimized out>
        __func__ = "idle_start"
#6  idle_start (client=client@entry=0x7fca94001eb0, ws=0x7fca6dff7cc0) at mod_webmail.c:3107
        res = <optimized out>
        __func__ = "idle_start"
#7  0x00007fcab200a0d5 in on_poll_activity (ws=0x7fca6dff7cc0, data=0x7fca94001eb0) at mod_webmail.c:3281
        client = 0x7fca94001eb0
        idledata = <optimized out>
        res = 0
        __func__ = "on_poll_activity"
#8  0x00007fcab5055844 in ws_handler (node=node@entry=0x7fca840489e0, http=http@entry=0x7fca6dff8730, proxied=proxied@entry=1) at net_ws.c:1047
        cres = <optimized out>
        now = <optimized out>
        elapsed = <optimized out>
        this_poll_start = 1721916373
        elapsed_sec = 17
        numfds = 2
        ws = {client = 0x7fca94001dd0, node = 0x7fca840489e0, http = 0x7fca6dff8730, lock = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, info = {initialized = 1, staticinit = 0, destroyed = 0, owners = 0, lastlocked = 1721916363, lwp = 103850, lineno = 205, filename = "net_ws.c", '\000' <repeats 15 times>}}, data = 0x7fca94001eb0, pollfd = 49, pollms = 1740000, varlist = {first = 0x0, last = 0x0, lock = {lock = {__data = {__readers = 0, __writers = 0, __wrphase_futex = 0, __writers_futex = 0, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, intlock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, info = {initialized = 1, staticinit = 0, destroyed = 0, owners = 0, lastlocked = 0, lwp = 103850, lineno = 867, filename = "net_ws.c", '\000' <repeats 15 times>}}}, cookievals = {first = 0x0, last = 0x0, lock = {lock = {__data = {__readers = 0, __writers = 0, __wrphase_futex = 0, __writers_futex = 0, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000', __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, intlock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, info = {initialized = 1, staticinit = 0, destroyed = 0, owners = 0, lastlocked = 0, lwp = 103850, lineno = 868, filename = "net_ws.c", '\000' <repeats 15 times>}}}, proxied = 1, sessionchecked = 0, cookieschecked = 0}
        route = 0x55813848b6d0
        client = 0x7fca94001dd0
        res = <optimized out>
        want_ping = 0
        ping_data = '\000' <repeats 14 times>
        pfds = {{fd = 48, events = 1, revents = 0}, {fd = 49, events = 1, revents = 1}}
        lastping = 1721916356
        max_ms = 23000
        pollms = 23000
        app_ms_elapsed = 17000
        __func__ = "ws_handler"
#9  0x00007fcab5055ef6 in ws_direct_handler (node=node@entry=0x7fca840489e0, secure=<optimized out>) at net_ws.c:1167

Comments

9/15/2024 12:12 PM — InterLinked

The assertion appears to be triggered due to losing synchronization with the IMAP server, typically when messages have been expunged, triggering a FETCH 1:* on the IMAP server for a remote mailbox, causing the LIST command to time out in libetpan. Since we weren't checking if it succeeded, we were proceeding and then choking when the end of the LIST response arrived later. On top of that, we were aware that we weren't idling, but attempted to restart idling, which is also wrong. Together, these lead to the assertion above.

An easy way to reproduce this is to add a small artificial delay (e.g. 300 microseconds) in status_size_fetch_all in imap_client_status.c. This will cause the LIST command to timeout on a large mailbox (e.g. almost 30 seconds with a mailbox with 90,000 messages).

9/15/2024 12:41 PM — InterLinked

e.g.

[2024-09-15 12:16:21.560] DEBUG[374053]: imap_client_status.c:259 status_size_fetch_incremental: Messages have been expunged (MESSAGES 90610 -> 90609, UIDNEXT 103048 -> 103049)
[2024-09-15 12:16:21.560] DEBUG[374053]: imap_client_status.c:413 append_size_item: Fetching sizes of all messages in INBOX

9/15/2024 12:45 PM — InterLinked

Autoresolved by code commit on GitHub:

mod_webmail: Properly handle high-latency responses.
https://github.com/InterLinked1/lbbs/commit/e842676637e351f27e06708a6d5471fddedf9aff

You must be logged in to leave a comment.