Crash whening shielding funds on randomx testchain
#324
Closed
opened 7 months ago by duke
·
14 comments
No Branch/Tag Specified
arm
asyncnotedecryption
danger
dev
dev-aarch64
dev-mac
dev-old-randomx
divzaddrs
dragonx
duke
freebsd
getfilterednotes
hip39
insync
jahway603
master
mvstuff
onryo
p2p_privacy
ramhash
relaytx
rx-largepages
setbestchain
warmup
witness_cache
wolfssl
wolfssl_win
z_createrawtransaction
z_importwallet
z_signmessage
v0.11.2.z0
v0.11.2.z1
v0.11.2.z2
v0.11.2.z3
v0.11.2.z4
v0.11.2.z5
v0.11.2.z6
v0.11.2.z7
v0.11.2.z8
v0.11.2.z9
v1.0.0
v1.0.0-beta1
v1.0.0-beta2
v1.0.0-rc1
v1.0.0-rc2
v1.0.0-rc3
v1.0.0-rc4
v1.0.1
v1.0.10
v1.0.10-1
v1.0.11
v1.0.11-rc1
v1.0.12
v1.0.12-rc1
v1.0.13
v1.0.13-rc1
v1.0.13-rc2
v1.0.14
v1.0.14-rc1
v1.0.15
v1.0.15-rc1
v1.0.2
v1.0.3
v1.0.4
v1.0.5
v1.0.6
v1.0.7-1
v1.0.8
v1.0.8-1
v1.0.9
v1.1.0
v1.1.0-rc1
v1.1.1
v1.1.1-rc1
v1.1.1-rc2
v1.1.2
v1.1.2-rc1
v2.0.0
v2.0.0-rc1
v2.0.1
v3.0.0
v3.1.0
v3.1.1
v3.10.0
v3.10.1
v3.10.2
v3.2.0
v3.2.1
v3.2.1-alpha
v3.2.1-beta
v3.2.2
v3.2.3
v3.3.0
v3.3.1
v3.3.2
v3.4.0
v3.4.1
v3.5.0
v3.5.1
v3.5.2
v3.6.0
v3.6.1
v3.6.2
v3.6.3
v3.7.0
v3.7.1
v3.8.0
v3.9.0
v3.9.1
v3.9.2
v3.9.3
v3.9.4
Labels
bounty up to 500 HUSH 2001-5000 bounty
bounty between 2001 and 5000 HUSH 501-2000 bounty
bounty between 501 and 2000 HUSH arm
something doesn't work on arm beginners
for new developers bug
may or may not be a bug build
problems building documentation
not enough information feature
new feature high priority
high priority i2p
related to i2p low priority
low priority medium priority
medium priority question
something is not clear release
release label or issue related to it testing
related to testing tor
related to tor wontfix
this won't be fixed
Apply labels
Clear labels
0-500 bounty
bounty up to 500 HUSH 2001-5000 bounty
bounty between 2001 and 5000 HUSH 501-2000 bounty
bounty between 501 and 2000 HUSH arm
something doesn't work on arm beginners
for new developers bug
may or may not be a bug build
problems building documentation
not enough information feature
new feature high priority
high priority i2p
related to i2p low priority
low priority medium priority
medium priority question
something is not clear release
release label or issue related to it testing
related to testing tor
related to tor wontfix
this won't be fixed
No Label
0-500 bounty
2001-5000 bounty
501-2000 bounty
arm
beginners
bug
build
documentation
feature
high priority
i2p
low priority
medium priority
question
release
testing
tor
wontfix
Milestone
Set milestone
Clear milestone
No items
No Milestone
Projects
Clear projects
No project
Assignees
Assign users
Clear assignees
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.
No due date set.
Dependencies
This issue currently doesn't have any dependencies.
Reference in new issue
There is no content yet.
Delete Branch '%!s(MISSING)'
Deleting a branch is permanent. It CANNOT be undone. Continue?
No
Yes
I have done this testing on the
dev
andantispam
branches and the same thing happens, so this seems to be an existing bug, not caused by theantispam
branch.To reproduce :
./src/hushd -ac_algo=randomx -ac_name=ANTISPAM -ac_private=1 -ac_blocktime=180 -ac_reward=500000000 -ac_supply=55555 -gen=1 -genproclimit=1 -testnode=1
The bug happens when running
z_shieldcoinbase
to shield mining funds. To test the antispam branch I would shield only a single utxo per tx so I could quickly create at least ten zutxos to create a "large zin" ztx :Creating between 10 to 20 transactions that creates a t=>z tx with fee=0 and shielding a single UTXO will crash the node quickly, within a few minutes. @onryo ran into the same issues and it seems to only happen with randomx, not equihash. I can confirm that doing the above testing with equihash works fine and does not crash.
The one relevant line from debug.log seems to be :
Using
-ac_randomx=debug
does not seem to add anything useful.There is no coredump, you can see it says "Killed" not "Segmentation fault" and in /var/log/syslog I find
hush-randomx invoked oom-killer
which means there is a memory leak and the mining process is killed by the OOM killerI'll have to check if the same bug/error, but I know I can get a consistent crash if I stop mining, shield or send funds, and then try to start mining again. It might consistently crash when trying to shield when mining and this is with latest release and not related to antispam code.
Seems to be same/similar bug and I get
hush-randomx invoked oom-killer
. I have no funds to shield, but this is reproducible by starting mining, stopping mining, sending funds, and then start mining again. It will also happen if you try to send while mining.@fekt my current guess is that different randomx threads are attempting to use the same datastructure and something is being used uninitalized or after it's been destroyed. RandomX mining has a lot more "setup" than Equihash mining, such as setting up the RandomX VM and everything that goes with it, so it makes sense this only happens with randomx and not equihash. The message
allocating randomx dataset failed
is only a symptom, it tells us we are about to run out of memory and bring the ire of OOM killer, it's not the core problem.There are different ways that the
RandomXMiner
function in src/miner.cpp can "finish" : it can reach the end of the function and return void, or it can throw an exception when mining is interrupted normally or when an error happens. I think that randomx memory is not freed correctly when mining stops, i.e. when you seeHushRandomXMiner terminated
in debug.log , because that code is only run when the function returns void.@fekt latest commit on dev tries to free memory more correctly but I still see the problem on my testchain with testnode=1. Have you seen this type of bug on dragonx mainnet or only on testchains with testnode=1 ?
@duke I've only tested and seen this on dragonx mainnet. It seems to crash consistently when trying to send/shield when mining or if you stop mining, send/shield, and start mining again. I can test dev branch later.
Seems memory isn't freed up. It's a fairly shitty box I am testing on so I wasn't sure if it was just me. I think Sebuh mentioned it awhile back when shielding but not sure if anyone else has the problem. I usually always stop mining before shielding/sending to avoid the shielding failing. Shielding/sending will work when stopping mining, but the node will crash as soon as you start mining again. It's possible the memory gets freed up if waiting longer before starting mining again.
@fekt ok, thanks for the details. My current suspicion is that
randomx_destroy_vm
is not called when mining is stopped by the user, it's only called internally when the "solver canceled" is logged to debug.log@fekt there is now a
memleak
branch where the memleak seems to be fixed (only tested on a testchain with -testnode=1) but causes 1 invalid block to be mined for each height. Details are in the commit message.@duke so far it looks to be fixed while testing
memleak
on dragonx mainnet but i need to find a block to test shielding. no crash when starting/stopping mining and sending multiple times. sent fine while mining too.that invalid block issue is pre-existing i think unless it's happening more frequently with these changes. i've seen it before and others reported it but assumed another miner won block.
Found some blocks and shielding works while mining as well. I do see the invalid block in stdout ocassionally, but I have nothing to really judge if it makes mining less efficient than it was. This is stdout I'm seeing:
@fekt from what I can tell, we only see these invalid blocks when the old code would crash. If I mine a testchain with the new code, I don't see them until I try to stop and restart mining. The specific thing that makes the block invalid is
so it's something like the block template the miner uses gets out of sync with the latest blockheight of the network. This can definitely happen in the normal course of mining on a network with other miners : It will happen if some other miner on the network finds a block, but your mining node doesn't learn about it until after your node thinks it finds a block. The coinbase height in your block will be off by 1.
I think this happens more than it should, but this bug was covered up by the node crashing, so we never saw it before. It does seem a net win for miners to have this code versus the old code which crashes. The downtime of a mining node is much worse than a potentially smaller hashrate (by taking into account you sometimes mine an invalid block).
When mining on a testchain with testnode=1, I see one invalid block mined per block height, because there is only one node mining. In a real scenario, many nodes would be mining and this issue may be spread out across all mining nodes, instead of happening to only one. We should probably test this code on a testchain with 2 mining nodes and see what happens.
So I am inclined to merge this to dev for further testing by others. What are your thoughts @fekt ?
@duke i think it's fine to merge. i'm still finding blocks, potentially more than usual but i have no real baseline to compare against along with different difficulty. maybe sebuh or someone with a high power rig can test to see if they have a major decrease in blocks?
@fekt ok, memleak branch is merged to dev. Let's create a new issue for any new details related to this