Crash whening shielding funds on randomx testchain #324

I have done this testing on the dev and antispam branches and the same thing happens, so this seems to be an existing bug, not caused by the antispam branch.

To reproduce : ./src/hushd -ac_algo=randomx -ac_name=ANTISPAM -ac_private=1 -ac_blocktime=180 -ac_reward=500000000 -ac_supply=55555 -gen=1 -genproclimit=1 -testnode=1

The bug happens when running z_shieldcoinbase to shield mining funds. To test the antispam branch I would shield only a single utxo per tx so I could quickly create at least ten zutxos to create a "large zin" ztx :

./src/hush-cli -ac_name=ANTISPAM z_shieldcoinbase "*" $(./src/hush-cli -ac_name=ANTISPAM z_getnewaddress) 0 1

Creating between 10 to 20 transactions that creates a t=>z tx with fee=0 and shielding a single UTXO will crash the node quickly, within a few minutes. @onryo ran into the same issues and it seems to only happen with randomx, not equihash. I can confirm that doing the above testing with equihash works fine and does not crash.

StartNode: Loading addresses for peers.dat at 1696690419805
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
RelayTransaction: No nodes to relay to!
./test_antispam: line 5: 60719 Killed                  ./src/hushd -ac_algo=randomx -ac_name=ANTISPAM -ac_private=1 -ac_blocktime=180 -ac_reward=500000000 -ac_supply=55555 -gen=1 -genproclimit=1 -testnode=1 $@

The one relevant line from debug.log seems to be :

 621 2023-10-07 14:36:55 RandomXMiner: allocating randomx dataset failed!

Using -ac_randomx=debug does not seem to add anything useful.

There is no coredump, you can see it says "Killed" not "Segmentation fault" and in /var/log/syslog I find hush-randomx invoked oom-killer which means there is a memory leak and the mining process is killed by the OOM killer

I'll have to check if the same bug/error, but I know I can get a consistent crash if I stop mining, shield or send funds, and then try to start mining again. It might consistently crash when trying to shield when mining and this is with latest release and not related to antispam code.

Seems to be same/similar bug and I get hush-randomx invoked oom-killer. I have no funds to shield, but this is reproducible by starting mining, stopping mining, sending funds, and then start mining again. It will also happen if you try to send while mining.

@fekt my current guess is that different randomx threads are attempting to use the same datastructure and something is being used uninitalized or after it's been destroyed. RandomX mining has a lot more "setup" than Equihash mining, such as setting up the RandomX VM and everything that goes with it, so it makes sense this only happens with randomx and not equihash. The message allocating randomx dataset failed is only a symptom, it tells us we are about to run out of memory and bring the ire of OOM killer, it's not the core problem.

There are different ways that the RandomXMiner function in src/miner.cpp can "finish" : it can reach the end of the function and return void, or it can throw an exception when mining is interrupted normally or when an error happens. I think that randomx memory is not freed correctly when mining stops, i.e. when you see HushRandomXMiner terminated in debug.log , because that code is only run when the function returns void.

@fekt latest commit on dev tries to free memory more correctly but I still see the problem on my testchain with testnode=1. Have you seen this type of bug on dragonx mainnet or only on testchains with testnode=1 ?

@duke I've only tested and seen this on dragonx mainnet. It seems to crash consistently when trying to send/shield when mining or if you stop mining, send/shield, and start mining again. I can test dev branch later.

Seems memory isn't freed up. It's a fairly shitty box I am testing on so I wasn't sure if it was just me. I think Sebuh mentioned it awhile back when shielding but not sure if anyone else has the problem. I usually always stop mining before shielding/sending to avoid the shielding failing. Shielding/sending will work when stopping mining, but the node will crash as soon as you start mining again. It's possible the memory gets freed up if waiting longer before starting mining again.

@fekt ok, thanks for the details. My current suspicion is that randomx_destroy_vm is not called when mining is stopped by the user, it's only called internally when the "solver canceled" is logged to debug.log

@fekt there is now a memleak branch where the memleak seems to be fixed (only tested on a testchain with -testnode=1) but causes 1 invalid block to be mined for each height. Details are in the commit message.

@duke so far it looks to be fixed while testing memleak on dragonx mainnet but i need to find a block to test shielding. no crash when starting/stopping mining and sending multiple times. sent fine while mining too.

that invalid block issue is pre-existing i think unless it's happening more frequently with these changes. i've seen it before and others reported it but assumed another miner won block.

Found some blocks and shielding works while mining as well. I do see the invalid block in stdout ocassionally, but I have nothing to really judge if it makes mining less efficient than it was. This is stdout I'm seeing:

RandomXMiner: nonce & 0xffff == 0xffff, break
RandomXMiner: nonce & 0xffff == 0xffff, break
TestBlockValidity: failure C checkPOW=1
RandomXMiner: Invalid randomx block mined, try again
000008b25b25faeb79f5bb1a09661551208bda596036b5ae984b0a4118a5c43f
RandomXMiner: gotinvalid=1
RandomXMiner: gotinvalid=1
RandomXMiner: nonce & 0xffff == 0xffff, break
RandomXMiner: nonce & 0xffff == 0xffff, break
RandomXMiner: nonce & 0xffff == 0xffff, break
RandomXMiner: nonce & 0xffff == 0xffff, break
TestBlockValidity: failure C checkPOW=1
RandomXMiner: Invalid randomx block mined, try again
00004dcbbe8f728b276af926249c8325599451446013efad3bc72e872b50e33e
RandomXMiner: gotinvalid=1
RandomXMiner: gotinvalid=1
RandomXMiner: nonce & 0xffff == 0xffff, break

@fekt from what I can tell, we only see these invalid blocks when the old code would crash. If I mine a testchain with the new code, I don't see them until I try to stop and restart mining. The specific thing that makes the block invalid is

2023-10-11 16:10:41 ERROR: ContextualCheckBlock: block height mismatch in coinbase

so it's something like the block template the miner uses gets out of sync with the latest blockheight of the network. This can definitely happen in the normal course of mining on a network with other miners : It will happen if some other miner on the network finds a block, but your mining node doesn't learn about it until after your node thinks it finds a block. The coinbase height in your block will be off by 1.

I think this happens more than it should, but this bug was covered up by the node crashing, so we never saw it before. It does seem a net win for miners to have this code versus the old code which crashes. The downtime of a mining node is much worse than a potentially smaller hashrate (by taking into account you sometimes mine an invalid block).

When mining on a testchain with testnode=1, I see one invalid block mined per block height, because there is only one node mining. In a real scenario, many nodes would be mining and this issue may be spread out across all mining nodes, instead of happening to only one. We should probably test this code on a testchain with 2 mining nodes and see what happens.

So I am inclined to merge this to dev for further testing by others. What are your thoughts @fekt ?

@duke i think it's fine to merge. i'm still finding blocks, potentially more than usual but i have no real baseline to compare against along with different difficulty. maybe sebuh or someone with a high power rig can test to see if they have a major decrease in blocks?

@fekt ok, memleak branch is merged to dev. Let's create a new issue for any new details related to this