coredump in RelayTransaction() #89

Closed
opened 2 years ago by duke · 6 comments
duke commented 2 years ago
Owner

Last message printed to STDOUT is:

RelayTransaction: Relaying to 2 of 5 peers

Coredump:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  RelayTransaction (tx=..., ss=...) at net.cpp:2128
2128	        if(!pnode->fRelayTxes)
[Current thread is 1 (Thread 0x7f75b96f2700 (LWP 9805))]
(gdb) bt
#0  RelayTransaction (tx=..., ss=...) at net.cpp:2128
#1  0x0000556f33ebd1ce in RelayTransaction (tx=...) at net.cpp:2094
#2  0x0000556f33f58e20 in sendrawtransaction (params=..., fHelp=fHelp@entry=false, mypk=...) at rpc/rawtransaction.cpp:1276
#3  0x0000556f343a203b in ShieldToAddress::operator() (this=0x7f75b96f19a0, zaddr=...) at ./pubkey.h:88
#4  0x0000556f343a2620 in boost::apply_visitor<ShieldToAddress, boost::variant<libzcash::InvalidEncoding, libzcash::SaplingPaymentAddress>&> (visitable=..., visitor=...)
    at .../hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/variant/detail/apply_visitor_unary.hpp:66
#5  AsyncRPCOperation_shieldcoinbase::main_impl (this=0x7f756813df70) at wallet/asyncrpcoperation_shieldcoinbase.cpp:196
#6  0x0000556f33d369f5 in AsyncRPCOperation_shieldcoinbase::main (this=0x7f756813df70) at wallet/asyncrpcoperation_shieldcoinbase.cpp:113
#7  0x0000556f33fe1d27 in AsyncRPCQueue::run (this=0x556f36898910, workerId=<optimized out>) at asyncrpcqueue.cpp:83
#8  0x00007f75c5513de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f75c5423609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007f75c51d2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Main problem seems to be that pnode is the null pointer:

(gdb) print pnode
$7 = (CNode *) 0x0

which means pnode->fRelayTxes dereferences the null pointer and crashes.

This happened on the latest commit of the dev branch:

$ git rev-parse HEAD
361998b5d2578d08fee4aa1348b17930595d7908

I have made hundreds of ztx's with this exact same code, so this seems to be a rare edge case, possibly a race condition.

Last message printed to STDOUT is: ``` RelayTransaction: Relaying to 2 of 5 peers ``` Coredump: ``` Program terminated with signal SIGSEGV, Segmentation fault. #0 RelayTransaction (tx=..., ss=...) at net.cpp:2128 2128 if(!pnode->fRelayTxes) [Current thread is 1 (Thread 0x7f75b96f2700 (LWP 9805))] (gdb) bt #0 RelayTransaction (tx=..., ss=...) at net.cpp:2128 #1 0x0000556f33ebd1ce in RelayTransaction (tx=...) at net.cpp:2094 #2 0x0000556f33f58e20 in sendrawtransaction (params=..., fHelp=fHelp@entry=false, mypk=...) at rpc/rawtransaction.cpp:1276 #3 0x0000556f343a203b in ShieldToAddress::operator() (this=0x7f75b96f19a0, zaddr=...) at ./pubkey.h:88 #4 0x0000556f343a2620 in boost::apply_visitor<ShieldToAddress, boost::variant<libzcash::InvalidEncoding, libzcash::SaplingPaymentAddress>&> (visitable=..., visitor=...) at .../hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/variant/detail/apply_visitor_unary.hpp:66 #5 AsyncRPCOperation_shieldcoinbase::main_impl (this=0x7f756813df70) at wallet/asyncrpcoperation_shieldcoinbase.cpp:196 #6 0x0000556f33d369f5 in AsyncRPCOperation_shieldcoinbase::main (this=0x7f756813df70) at wallet/asyncrpcoperation_shieldcoinbase.cpp:113 #7 0x0000556f33fe1d27 in AsyncRPCQueue::run (this=0x556f36898910, workerId=<optimized out>) at asyncrpcqueue.cpp:83 #8 0x00007f75c5513de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007f75c5423609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #10 0x00007f75c51d2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 ``` Main problem seems to be that `pnode` is the null pointer: ``` (gdb) print pnode $7 = (CNode *) 0x0 ``` which means `pnode->fRelayTxes` dereferences the null pointer and crashes. This happened on the latest commit of the `dev` branch: ``` $ git rev-parse HEAD 361998b5d2578d08fee4aa1348b17930595d7908 ``` I have made hundreds of ztx's with this exact same code, so this seems to be a rare edge case, possibly a race condition.
Poster
Owner

From /var/log/syslog

Dec 28 06:53:02 host kernel: [2549939.713136] hushd[9805]: segfault at 302 ip 0000556f33eb7be4 sp 00007f75b96f0660 error 4 in hushd[556f33c8e000+c00000]

This tells us that it was not killed by the Out Of Memory Killer (OOM Killer), which was one possibility. Also, it does not seem to be from a laptop going to sleep and waking up, which was another option. This seems to be a genuine bug in the code, which does not check if pnode is a valid pointer.

From /var/log/syslog ``` Dec 28 06:53:02 host kernel: [2549939.713136] hushd[9805]: segfault at 302 ip 0000556f33eb7be4 sp 00007f75b96f0660 error 4 in hushd[556f33c8e000+c00000] ``` This tells us that it was not killed by the Out Of Memory Killer (OOM Killer), which was one possibility. Also, it does not seem to be from a laptop going to sleep and waking up, which was another option. This seems to be a genuine bug in the code, which does not check if `pnode` is a valid pointer.
Poster
Owner

wallet.dat is 145MB, this bug may be more likely for large wallets, I have not seen it on a small wallet. This is not the first time the bug has happened, but it's the first time I captured a full coredump of it.

wallet.dat is 145MB, this bug may be more likely for large wallets, I have not seen it on a small wallet. This is not the first time the bug has happened, but it's the first time I captured a full coredump of it.
Poster
Owner

Digging deeper, we can see that the line which coredumps has not changed in 9 years! But we broke some assumptions that the code makes. It takes out a lock on vNodes but now we have another data structure vRelayNodes which does not have a lock. This is likely the issue.

69d9c6492 (Matt Corallo   2012-08-13 05:26:30 +0200 2113)     LOCK(cs_vNodes);
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2114) 
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2115)     auto vRelayNodes = vNodes;
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2116) 
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2117)     // We always round down, except when we have only 1 connection
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2118)     auto newSize = (vNodes.size() / 2) == 0 ? 1 : (vNodes.size() / 2);
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2119) 
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2120)     random_shuffle( vRelayNodes.begin(), vRelayNodes.end(), GetRandInt );
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2121) 
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2122)     vRelayNodes.resize(newSize);
d8d9f655a7 (Duke Leto      2021-08-26 00:00:38 -0400 2123)     fprintf(stderr, "%s: Relaying to %lu of %lu peers\n", __func__, newSize, vNodes.size() );
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2124) 
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2125)     // Only relay to randomly chosen 50% of peers
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2126)     BOOST_FOREACH(CNode* pnode, vRelayNodes)
269d9c6492 (Matt Corallo   2012-08-13 05:26:30 +0200 2127)     {
4c8fc1a588 (Matt Corallo   2012-08-20 21:10:25 -0400 2128)         if(!pnode->fRelayTxes)
4c8fc1a588 (Matt Corallo   2012-08-20 21:10:25 -0400 2129)             continue;
269d9c6492 (Matt Corallo   2012-08-13 05:26:30 +0200 2130)         LOCK(pnode->cs_filter);
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2131)         if (pnode->pfilter) {
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2132)             if (pnode->pfilter->IsRelevantAndUpdate(tx)) {
269d9c6492 (Matt Corallo   2012-08-13 05:26:30 +0200 2133)                 pnode->PushInventory(inv);
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2134)             }
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2135)         } else {
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2136)             pnode->PushInventory(inv);
315ed51192 (Duke Leto      2021-04-09 14:49:24 -0400 2137)         }
269d9c6492 (Matt Corallo   2012-08-13 05:26:30 +0200 2138)     }
Digging deeper, we can see that the line which coredumps has not changed in 9 years! But we broke some assumptions that the code makes. It takes out a lock on `vNodes` but now we have another data structure `vRelayNodes` which does not have a lock. This is likely the issue. ``` 69d9c6492 (Matt Corallo 2012-08-13 05:26:30 +0200 2113) LOCK(cs_vNodes); 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2114) 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2115) auto vRelayNodes = vNodes; 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2116) 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2117) // We always round down, except when we have only 1 connection 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2118) auto newSize = (vNodes.size() / 2) == 0 ? 1 : (vNodes.size() / 2); 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2119) 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2120) random_shuffle( vRelayNodes.begin(), vRelayNodes.end(), GetRandInt ); 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2121) 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2122) vRelayNodes.resize(newSize); d8d9f655a7 (Duke Leto 2021-08-26 00:00:38 -0400 2123) fprintf(stderr, "%s: Relaying to %lu of %lu peers\n", __func__, newSize, vNodes.size() ); 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2124) 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2125) // Only relay to randomly chosen 50% of peers 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2126) BOOST_FOREACH(CNode* pnode, vRelayNodes) 269d9c6492 (Matt Corallo 2012-08-13 05:26:30 +0200 2127) { 4c8fc1a588 (Matt Corallo 2012-08-20 21:10:25 -0400 2128) if(!pnode->fRelayTxes) 4c8fc1a588 (Matt Corallo 2012-08-20 21:10:25 -0400 2129) continue; 269d9c6492 (Matt Corallo 2012-08-13 05:26:30 +0200 2130) LOCK(pnode->cs_filter); 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2131) if (pnode->pfilter) { 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2132) if (pnode->pfilter->IsRelevantAndUpdate(tx)) { 269d9c6492 (Matt Corallo 2012-08-13 05:26:30 +0200 2133) pnode->PushInventory(inv); 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2134) } 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2135) } else { 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2136) pnode->PushInventory(inv); 315ed51192 (Duke Leto 2021-04-09 14:49:24 -0400 2137) } 269d9c6492 (Matt Corallo 2012-08-13 05:26:30 +0200 2138) } ```
Poster
Owner

I am working on a fix to this. I have code that avoids the coredump but does not fully fix the problem yet.

I am working on a fix to this. I have code that avoids the coredump but does not fully fix the problem yet.
duke referenced this issue from a commit 2 years ago
Poster
Owner

Working on this on the relaytx branch. Currently the coredump is avoided, but also the tx seems to be not relayed correctly and potentially makes other nodes ban us. The wallet may be in an inconsistent state since it thinks some txes were relayed but they were not.

Working on this on the `relaytx` branch. Currently the coredump is avoided, but also the tx seems to be not relayed correctly and potentially makes other nodes ban us. The wallet may be in an inconsistent state since it thinks some txes were relayed but they were not.
Poster
Owner

The coredump has been fixed on the zsweep branch by checking to see if pnode == NULL but the core bug of vRelayNodes not having a lock is still present.

The coredump has been fixed on the `zsweep` branch by checking to see if `pnode == NULL` but the core bug of `vRelayNodes` not having a lock is still present.
duke closed this issue 2 years ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.