Assertion `mapInfo.count(nId) == 1' failed #266

Closed
opened 1 year ago by duke · 14 comments
duke commented 1 year ago
Owner

Seems like a rare edge case, possibly from corrupt data on disk :

hushd: addrman.cpp:537: CAddrInfo CAddrMan::Select_(bool): Assertion `mapInfo.count(nId) == 1' failed.

In any case, this kind of error should not coredump hushd.

Seems like a rare edge case, possibly from corrupt data on disk : ``` hushd: addrman.cpp:537: CAddrInfo CAddrMan::Select_(bool): Assertion `mapInfo.count(nId) == 1' failed. ``` In any case, this kind of error should not coredump hushd.
duke self-assigned this 1 year ago
duke commented 1 year ago
Poster
Owner

This happened on Debian 11. Also possibly related:

ERROR: Read: Deserialize or I/O error - Unsupported CAddress disk format version: iostream error
This happened on Debian 11. Also possibly related: ``` ERROR: Read: Deserialize or I/O error - Unsupported CAddress disk format version: iostream error ```
duke added the
bug
label 1 year ago
fekt commented 1 year ago
Collaborator

I am also seeing this assertion a few times now on latest dev branch running stratum. Never seen it before. Also Debian 11. Maybe from BIP155 changes.

I am also seeing this assertion a few times now on latest dev branch running stratum. Never seen it before. Also Debian 11. Maybe from BIP155 changes.
duke commented 1 year ago
Poster
Owner

@fekt this error has to do with peers.dat file on disk being potentially corrupted from a previous crash. If you are running into it more than once, I suggest saving the peers.dat to a new file so we can inspect/debug with it, and let hushd make a brand new peers.dat . This may incur a few extra minutes of downtime when hushd starts up when it populates a new peers.dat file.

@fekt this error has to do with peers.dat file on disk being potentially corrupted from a previous crash. If you are running into it more than once, I suggest saving the peers.dat to a new file so we can inspect/debug with it, and let hushd make a brand new peers.dat . This may incur a few extra minutes of downtime when hushd starts up when it populates a new peers.dat file.
duke commented 1 year ago
Poster
Owner

To explain more, we inherited this code from BTC and what it appears to do is enforce that a given peer id only appears once in peers.dat, i.e. the same peer unique id should not appear more than once in the file. This is usually the case, but it seems like it's possible for it to happen, possibly if hushd crashes.

Like I say, assert()'s in full node code are a bad design practice that both BTC/ZEC subscribe to, instead, we should make the code do the best it can to continue. In this case, we should probably delete duplicate id's until there is only one. Deleting a peer from peers.dat has almost no downside (it will just be re-added later if it was supposed to be there) but crashing the node because peers.dat is slightly inconsistent is horrific behavior.

To explain more, we inherited this code from BTC and what it appears to do is enforce that a given peer id only appears once in peers.dat, i.e. the same peer unique id should not appear more than once in the file. This is usually the case, but it seems like it's possible for it to happen, possibly if hushd crashes. Like I say, `assert()`'s in full node code are a bad design practice that both BTC/ZEC subscribe to, instead, we should make the code do the best it can to continue. In this case, we should probably delete duplicate id's until there is only one. Deleting a peer from peers.dat has almost no downside (it will just be re-added later if it was supposed to be there) but crashing the node because peers.dat is slightly inconsistent is horrific behavior.
fekt commented 1 year ago
Collaborator

I guess I mentioned this originally on TG a couple months ago and didn't remember.

Node last coredumped yesterday and peers.dat not modified since 4/17. I tried dumping and reading with this script, but it gives an invalid header error: https://github.com/asood123/bitpeers

I tried deleting peers.dat, restarted node and let it create a new one, and then stopped node. Tried reading that peers.dat and get an invalid header as well. Could be that python script is too old. Not sure.

I attached the peers.dat from last coredump.

I guess I mentioned this originally on TG a couple months ago and didn't remember. Node last coredumped yesterday and peers.dat not modified since 4/17. I tried dumping and reading with this script, but it gives an invalid header error: https://github.com/asood123/bitpeers I tried deleting peers.dat, restarted node and let it create a new one, and then stopped node. Tried reading that peers.dat and get an invalid header as well. Could be that python script is too old. Not sure. I attached the peers.dat from last coredump.
duke commented 1 year ago
Poster
Owner

@fekt that script looks interesting and yeah, it seems to be from before BTC merged BIP155. You can tell because the invalid header error asserts that version=1 https://github.com/asood123/bitpeers/blob/master/bitpeers.py#L237 which is the old pre-BIP155 peers.dat format version. Still, the script looks useful and can likely be updated to support the latest version with a few lines of changes.

The next step is to improve the mapInfo.count assertion and make it actually report which nId has duplicates, and how many. The code should delete duplicate nId entries instead of doing an assert()

@fekt that script looks interesting and yeah, it seems to be from before BTC merged BIP155. You can tell because the invalid header error asserts that `version=1` https://github.com/asood123/bitpeers/blob/master/bitpeers.py#L237 which is the old pre-BIP155 peers.dat format version. Still, the script looks useful and can likely be updated to support the latest version with a few lines of changes. The next step is to improve the mapInfo.count assertion and make it actually report which `nId` has duplicates, and how many. The code should delete duplicate `nId` entries instead of doing an `assert()`
duke referenced this issue from a commit 1 year ago
duke commented 1 year ago
Poster
Owner

@fekt I realized the issue is not duplicates, since mapInfo is a std::map which only allows a single value for a key. The issue is the code is not finding any key with value nId . I changed the code to log the issue to STDERR with no assertion and to try the next iteration of the loop, since it is randomly choosing values anyway. I will test it on the duke branch and if it doesn't make things worse I will migrate it to the dev branch.

@fekt I realized the issue is not duplicates, since `mapInfo` is a `std::map` which only allows a single value for a key. The issue is the code is not finding any key with value `nId` . I changed the code to log the issue to STDERR with no assertion and to try the next iteration of the loop, since it is randomly choosing values anyway. I will test it on the `duke` branch and if it doesn't make things worse I will migrate it to the `dev` branch.
fekt commented 1 year ago
Collaborator

@duke sounds good. i have it running on a box now and will update if anything worthy.

@duke sounds good. i have it running on a box now and will update if anything worthy.
fekt commented 1 year ago
Collaborator

This is still running fine and hasn't crashed. I just installed duke branch on lite2.hushpool.is because that kept crashing with same assert.

This is still running fine and hasn't crashed. I just installed `duke` branch on lite2.hushpool.is because that kept crashing with same assert.
fekt commented 1 year ago
Collaborator

lite2.hushpool.is crashed with this now

addrman.cpp:153: void CAddrMan::SwapRandom(unsigned int, unsigned int): Assertion 'it_1 != mapInfo.end()' failed.

lite2.hushpool.is crashed with this now `addrman.cpp:153: void CAddrMan::SwapRandom(unsigned int, unsigned int): Assertion 'it_1 != mapInfo.end()' failed.`
duke commented 1 year ago
Poster
Owner

@fekt thanks for reporting. I only removed some assertions because I never saw any of the others happen, but now that those are removed, it's possible that other assertions are now being triggered.

@fekt thanks for reporting. I only removed some assertions because I never saw any of the others happen, but now that those are removed, it's possible that other assertions are now being triggered.
Poster
Owner

If/when we close #297, this issue should be closed as well

If/when we close #297, this issue should be closed as well
Collaborator

@duke should this issue be closed as issue #297 was closed?

@duke should this issue be closed as issue https://git.hush.is/hush/hush3/issues/297 was closed?
Poster
Owner

seems to be fixed, closing

seems to be fixed, closing
duke closed this issue 10 months ago
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.