Coredump on reduce_memory branch #292

Closed
opened 1 year ago by duke · 10 comments
duke commented 1 year ago
Owner

hushd had recently crashed for me when doing a long rescan with rescanheight and keepnotewitnesscache=1 parameters, using the dev branch, and then I used the same exact command on the reduce_memory branch and got the assertion from this new code added in the branch in src/chain.h :

uint256 GetBlockHash() const
{
        assert(phashBlock);
        return *phashBlock;
}     

When I uncomment the assert and add

fprintf(stderr, "GetBlockHash: phashBlock NULL at height %d\n", this->GetHeight());

I get


Sapling spend: 47958396 bytes, output: 3592860 bytes
AppInitServers: start
AppInit2: Loading block index...
GetBlockHash: phashBlock NULL at height 1119143

with a coredump backtrace of

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055a1b46aaca6 in CBlockIndex::GetBlockHash (this=0x7fff64a4bda0) at chain.h:516
516	        return *phashBlock;
[Current thread is 1 (Thread 0x7fedcd7b2c80 (LWP 739242))]
(gdb) bt
#0  0x000055a1b46aaca6 in CBlockIndex::GetBlockHash (this=0x7fff64a4bda0) at chain.h:516
#1  CBlockTreeDB::LoadBlockIndexGuts (this=<optimized out>) at txdb.cpp:701
#2  0x000055a1b44f3d1a in LoadBlockIndexDB () at main.cpp:5765
#3  0x000055a1b44f53ac in LoadBlockIndex () at main.cpp:6269
#4  0x000055a1b447b82f in AppInit2 (threadGroup=..., scheduler=...) at init.cpp:1897
#5  0x000055a1b44192c4 in AppInit (argc=<optimized out>, argv=<optimized out>) at bitcoind.cpp:238
#6  0x000055a1b440171f in main (argc=6, argv=0x7fff64a4ed98) at bitcoind.cpp:268
hushd had recently crashed for me when doing a long rescan with rescanheight and keepnotewitnesscache=1 parameters, using the dev branch, and then I used the same exact command on the reduce_memory branch and got the assertion from this new code added in the branch in src/chain.h : ``` uint256 GetBlockHash() const { assert(phashBlock); return *phashBlock; } ``` When I uncomment the assert and add ``` fprintf(stderr, "GetBlockHash: phashBlock NULL at height %d\n", this->GetHeight()); ``` I get ``` Sapling spend: 47958396 bytes, output: 3592860 bytes AppInitServers: start AppInit2: Loading block index... GetBlockHash: phashBlock NULL at height 1119143 ``` with a coredump backtrace of ``` Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000055a1b46aaca6 in CBlockIndex::GetBlockHash (this=0x7fff64a4bda0) at chain.h:516 516 return *phashBlock; [Current thread is 1 (Thread 0x7fedcd7b2c80 (LWP 739242))] (gdb) bt #0 0x000055a1b46aaca6 in CBlockIndex::GetBlockHash (this=0x7fff64a4bda0) at chain.h:516 #1 CBlockTreeDB::LoadBlockIndexGuts (this=<optimized out>) at txdb.cpp:701 #2 0x000055a1b44f3d1a in LoadBlockIndexDB () at main.cpp:5765 #3 0x000055a1b44f53ac in LoadBlockIndex () at main.cpp:6269 #4 0x000055a1b447b82f in AppInit2 (threadGroup=..., scheduler=...) at init.cpp:1897 #5 0x000055a1b44192c4 in AppInit (argc=<optimized out>, argv=<optimized out>) at bitcoind.cpp:238 #6 0x000055a1b440171f in main (argc=6, argv=0x7fff64a4ed98) at bitcoind.cpp:268 ```
duke added the
bug
label 1 year ago
duke commented 1 year ago
Poster
Owner

The reason I consider this a bug is that if I use the dev branch instead of reduce_memory, with the exact same hushd command, it can continue to rescan the node, even after the crash. So this is shitty upstream Zcash code, which adds an assertion that crashes the node and makes it very hard for the user to continue. In this situation, a user likely has to reindex the entire chain history or do a fresh sync. Adding assert() is very user hostile and we should do better than Zcash, who obviously did not test this code in a variety of situations. It's likely that many Zcash users will not be able to easily recover after a crash and will be forced to do a fresh sync or reindex of all time 💩

The reason I consider this a bug is that if I use the `dev` branch instead of `reduce_memory`, with the exact same hushd command, it can continue to rescan the node, even after the crash. So this is shitty upstream Zcash code, which adds an assertion that crashes the node and makes it very hard for the user to continue. In this situation, a user likely has to reindex the entire chain history or do a fresh sync. Adding `assert()` is very user hostile and we should do better than Zcash, who obviously did not test this code in a variety of situations. It's likely that many Zcash users will not be able to easily recover after a crash and will be forced to do a fresh sync or reindex of all time :poop:
fekt commented 1 year ago
Collaborator

@duke I am getting the same coredump after building and trying to run. I was running dev branch and stopped it before trying to run reduce_memory branch. No crashed node or rescan and the dev branch starts back up fine.

This happens directly on startup of reduce_memory branch.

AppInit2: Loading block index...
GetBlockHash: phashBlock NULL at height 1119143
hushd: chain.h:517: uint256 CBlockIndex::GetBlockHash() const: Assertion `phashBlock' failed.
Aborted (core dumped)
@duke I am getting the same coredump after building and trying to run. I was running `dev` branch and stopped it before trying to run `reduce_memory` branch. No crashed node or rescan and the `dev` branch starts back up fine. This happens directly on startup of `reduce_memory` branch. ``` AppInit2: Loading block index... GetBlockHash: phashBlock NULL at height 1119143 hushd: chain.h:517: uint256 CBlockIndex::GetBlockHash() const: Assertion `phashBlock' failed. Aborted (core dumped) ```
duke commented 1 year ago
Poster
Owner

@fekt thanks for testing. I guess we need to see if the code works correctly on a fresh sync, that might give us a clue.

@fekt thanks for testing. I guess we need to see if the code works correctly on a fresh sync, that might give us a clue.
duke commented 1 year ago
Poster
Owner

The latest code on this branch attempts to recalculate the blockhash if phashBlock is null. It asks the block index on disk for the data but does not yet store the data in phashBlock. Needs testing.

The latest code on this branch attempts to recalculate the blockhash if `phashBlock` is null. It asks the block index on disk for the data but does not yet store the data in `phashBlock`. Needs testing.
duke commented 1 year ago
Poster
Owner

I am testing the latest code on a fresh sync, and there is no crash. So it seems that this bug only happens when hushd partially syncs with a different branch and then tries to continue syncing with the reduce_memory code.

I am testing the latest code on a fresh sync, and there is no crash. So it seems that this bug only happens when hushd partially syncs with a different branch and then tries to continue syncing with the `reduce_memory` code.
duke commented 1 year ago
Poster
Owner

The latest code gets further in the startup process, but eventually crashes. To test, I synced a small number of blocks with the duke branch, then stopped the node and then continued the sync with the reduce_memory branch .

STDERR looks like :

AppInitServers: start
AppInit2: Loading block index...
GetBlockHash: phashBlock NULL at height 427881, returning blockhash=16ac79b935b6920c5c9252df171166ba7179fa5f98cf0702b0e2c49f9136e21d instead
GetBlockHash: phashBlock NULL at height 502787, returning blockhash=0b18e01b5cca6ac47deb04de8feacef7fd0805123b561a8e2307d9c7540d81ea instead

followed by about 650K lines of "phashBlock NULL" lines and then finally

GetBlockHash: phashBlock NULL at height 291401, returning blockhash=e8e446f496e5cd67530d7ea7390953c0b1d22a5c77d0c13b7b7457105e6c573b instead
load blockindexDB chained 1682259458
load blockindexDB 1682259458
.............about to prune block index
finished loading blocks HUSH3
processing /home/xxx/.hush/HUSH3/hushstate 879KB, validated.-1
inds.0x55a6afb3a580 validate /home/xxx/.hush/HUSH3/hushstate.ind fsize.277732 datalen.900751 n.69433 lastfpos.0
took 0 seconds to process /home/xxx/.hush/HUSH3/hushstate 879KB
zindex=disabled in block index
Error: Error: A fatal internal error occurred, see debug.log for details
StartShutdown: fRequestShudown=true
AppInit2: error in hd data
: Unable to rewind the database to a pre-upgrade state. You will need to redownload the blockchain.

error in HDD data, might just need to update to latest, if that doesnt work, then you need to resync
Shutdown: stopping HUSH HTTP/REST/RPC
Shutdown: stopping node

debug.log has a few more details :

2023-04-23 14:17:37 LoadBlockIndexDB: loaded guts
2023-04-23 14:17:38 LoadBlockIndexDB: last block file = 1
2023-04-23 14:17:38 LoadBlockIndexDB: last block file info: CBlockFileInfo(blocks=29592, size=89251732, heights=39261...68852, time=2019-06-20...2019-08-11)
2023-04-23 14:17:38 Checking all blk files are present...
2023-04-23 14:17:38 LoadBlockIndexDB: transaction index enabled
2023-04-23 14:17:38 LoadBlockIndexDB: address index disabled
2023-04-23 14:17:38 LoadBlockIndexDB: shielded index disabled
2023-04-23 14:17:38 LoadBlockIndexDB: timestamp index disabled
2023-04-23 14:17:38 LoadBlockIndexDB: spent index disabled
2023-04-23 14:17:38 LoadBlockIndexDB: hashBestChain=00000002b016b4dfba34d925cc2a61f7c342d0fb0e1ed2a2ca7560a2107d6725 height=0 date=1970-01-01 00:00:00 progress=0.500000
2023-04-23 14:17:38 init message: Rewinding blocks if needed...
2023-04-23 14:17:39 operator(): Failed to read index entry*** Files to write to block index database
2023-04-23 14:17:39 Error: Error: A fatal internal error occurred, see debug.log for details
2023-04-23 14:17:39 : Unable to rewind the database to a pre-upgrade state. You will need to redownload the blockchain.

error in HDD data, might just need to update to latest, if that doesnt work, then you need to resync
2023-04-23 14:17:39 Aborted block database rebuild. Exiting.
2023-04-23 14:17:39 Shutdown: In progress...
2023-04-23 14:17:39 StopRPC: waiting for async rpc workers to stop
2023-04-23 14:17:39 scheduler thread interrupt
2023-04-23 14:17:39 StopNode()
The latest code gets further in the startup process, but eventually crashes. To test, I synced a small number of blocks with the `duke` branch, then stopped the node and then continued the sync with the `reduce_memory` branch . STDERR looks like : ``` AppInitServers: start AppInit2: Loading block index... GetBlockHash: phashBlock NULL at height 427881, returning blockhash=16ac79b935b6920c5c9252df171166ba7179fa5f98cf0702b0e2c49f9136e21d instead GetBlockHash: phashBlock NULL at height 502787, returning blockhash=0b18e01b5cca6ac47deb04de8feacef7fd0805123b561a8e2307d9c7540d81ea instead ``` followed by about 650K lines of "phashBlock NULL" lines and then finally ``` GetBlockHash: phashBlock NULL at height 291401, returning blockhash=e8e446f496e5cd67530d7ea7390953c0b1d22a5c77d0c13b7b7457105e6c573b instead load blockindexDB chained 1682259458 load blockindexDB 1682259458 .............about to prune block index finished loading blocks HUSH3 processing /home/xxx/.hush/HUSH3/hushstate 879KB, validated.-1 inds.0x55a6afb3a580 validate /home/xxx/.hush/HUSH3/hushstate.ind fsize.277732 datalen.900751 n.69433 lastfpos.0 took 0 seconds to process /home/xxx/.hush/HUSH3/hushstate 879KB zindex=disabled in block index Error: Error: A fatal internal error occurred, see debug.log for details StartShutdown: fRequestShudown=true AppInit2: error in hd data : Unable to rewind the database to a pre-upgrade state. You will need to redownload the blockchain. error in HDD data, might just need to update to latest, if that doesnt work, then you need to resync Shutdown: stopping HUSH HTTP/REST/RPC Shutdown: stopping node ``` debug.log has a few more details : ``` 2023-04-23 14:17:37 LoadBlockIndexDB: loaded guts 2023-04-23 14:17:38 LoadBlockIndexDB: last block file = 1 2023-04-23 14:17:38 LoadBlockIndexDB: last block file info: CBlockFileInfo(blocks=29592, size=89251732, heights=39261...68852, time=2019-06-20...2019-08-11) 2023-04-23 14:17:38 Checking all blk files are present... 2023-04-23 14:17:38 LoadBlockIndexDB: transaction index enabled 2023-04-23 14:17:38 LoadBlockIndexDB: address index disabled 2023-04-23 14:17:38 LoadBlockIndexDB: shielded index disabled 2023-04-23 14:17:38 LoadBlockIndexDB: timestamp index disabled 2023-04-23 14:17:38 LoadBlockIndexDB: spent index disabled 2023-04-23 14:17:38 LoadBlockIndexDB: hashBestChain=00000002b016b4dfba34d925cc2a61f7c342d0fb0e1ed2a2ca7560a2107d6725 height=0 date=1970-01-01 00:00:00 progress=0.500000 2023-04-23 14:17:38 init message: Rewinding blocks if needed... 2023-04-23 14:17:39 operator(): Failed to read index entry*** Files to write to block index database 2023-04-23 14:17:39 Error: Error: A fatal internal error occurred, see debug.log for details 2023-04-23 14:17:39 : Unable to rewind the database to a pre-upgrade state. You will need to redownload the blockchain. error in HDD data, might just need to update to latest, if that doesnt work, then you need to resync 2023-04-23 14:17:39 Aborted block database rebuild. Exiting. 2023-04-23 14:17:39 Shutdown: In progress... 2023-04-23 14:17:39 StopRPC: waiting for async rpc workers to stop 2023-04-23 14:17:39 scheduler thread interrupt 2023-04-23 14:17:39 StopNode() ```
duke commented 1 year ago
Poster
Owner

It seems that on line 1943 of src/init.cpp the call to RewindBlockIndex() is failing, which gives the Unable to rewind the database... error. Inside that function is a call to pblocktree->WriteBatchSync() which seems to fail on line 3698 of src/main.cpp . That is what calls AbortNode with "Files to write to block index database" . WriteBatchSync is a thin wrapper around CDBWrapper::WriteBatch which is a thinner wrapper around leveldb::DB->Write , which can be seen in src/dbwrapper.cpp . So the origin of this bug is LevelDB is not happy (which is the database format of the block index), which is most likely caused by some data not being present which is what causes phashBlock to be null.

It seems that on line 1943 of src/init.cpp the call to `RewindBlockIndex()` is failing, which gives the `Unable to rewind the database...` error. Inside that function is a call to `pblocktree->WriteBatchSync()` which seems to fail on line 3698 of src/main.cpp . That is what calls `AbortNode` with "Files to write to block index database" . `WriteBatchSync` is a thin wrapper around `CDBWrapper::WriteBatch` which is a thinner wrapper around `leveldb::DB->Write` , which can be seen in src/dbwrapper.cpp . So the origin of this bug is LevelDB is not happy (which is the database format of the block index), which is most likely caused by some data not being present which is what causes `phashBlock` to be null.
Poster
Owner

There was some important code missing in this branch and when I added it, we now get a different kind of error, which at least does not corrupt the block index.

ConnectBlock(): hashPrevBlock != view.GetBestBlock() 0000000000000000000000000000000000000000000000000000000000000000 != 00000000511faa4661e6a9e2c648a23efff04dbeb42e3651283d31fc04d13d0f
hushd: main.cpp:2535: void InvalidChainFound(CBlockIndex*): Assertion `tip' failed.
There was some important code missing in this branch and when I added it, we now get a different kind of error, which at least does not corrupt the block index. ``` ConnectBlock(): hashPrevBlock != view.GetBestBlock() 0000000000000000000000000000000000000000000000000000000000000000 != 00000000511faa4661e6a9e2c648a23efff04dbeb42e3651283d31fc04d13d0f hushd: main.cpp:2535: void InvalidChainFound(CBlockIndex*): Assertion `tip' failed. ```
Poster
Owner

There is new code on this branch, currently undergoing testing

There is new code on this branch, currently undergoing testing
duke self-assigned this 7 months ago
Poster
Owner

this no longer happens with the new code, closing

this no longer happens with the new code, closing
duke closed this issue 7 months ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.