lightwalletd should not crash if hushd is down #53

Closed
opened 1 year ago by duke · 11 comments
duke commented 1 year ago
Owner

Yep

  • Increase max retryCount in FirstRPC
  • Add retryCount code to BlockIngestor call to getbestblockhash RPC
  • Make all RPCs retry
  • Test dev branch
Yep * [x] Increase max retryCount in `FirstRPC` * [x] Add retryCount code to `BlockIngestor` call to getbestblockhash RPC * [x] Make all RPCs retry * [ ] Test dev branch
duke commented 1 year ago
Poster
Owner

Next step for this issue is to find the code that actually crashes lightwalletd when hushd is down and then we will know what code needs to change. Instead of crashing, a better behavior would be to start a polling timer that looks to see if hushd is back up every 60 seconds.

Next step for this issue is to find the code that actually crashes lightwalletd when hushd is down and then we will know what code needs to change. Instead of crashing, a better behavior would be to start a polling timer that looks to see if hushd is back up every 60 seconds.
fekt commented 1 year ago
Collaborator

I haven't cehcked the code yet, but from the last log I looked at, it seems to have code to retry connecting and aborts after x failures (somewhere between 3 and 5).

I haven't cehcked the code yet, but from the last log I looked at, it seems to have code to retry connecting and aborts after x failures (somewhere between 3 and 5).
fekt commented 1 year ago
Collaborator

Seems to usually be this BlockIngestor and call for getbestblockhash that causes lightwalletd to exit if hushd is not running:
https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L442

It may depend on the call at the time too as there are multiple .Fatal. I've seen getblockchaininfo calls sometimes fall multiple times before connecting, but still under limit of 10 before it exits. This is what I was referring to seeing in logs with retries, but that might be from node still starting up.
https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L266

Seems to usually be this `BlockIngestor` and call for `getbestblockhash` that causes lightwalletd to exit if hushd is not running: https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L442 It may depend on the call at the time too as there are multiple .Fatal. I've seen `getblockchaininfo` calls sometimes fall multiple times before connecting, but still under limit of 10 before it exits. This is what I was referring to seeing in logs with retries, but that might be from node still starting up. https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L266
duke commented 1 year ago
Poster
Owner

@fekt thanks for the info. It looks like we will need to change a few places in the code. The FirstRPC function seems like an easy change, we can increase retryCount to a much larger number. That change will only affect lightwalletd looking for hushd when it first starts. The BlockIngester code is likely what usually crashes lightwalletd when hushd crashes after lightwalletd has started up. We can add similar retryCount code to that. There may be other places that also need to be updated, but these two are a good place to start.

@fekt thanks for the info. It looks like we will need to change a few places in the code. The `FirstRPC` function seems like an easy change, we can increase `retryCount` to a much larger number. That change will only affect lightwalletd looking for hushd when it first starts. The `BlockIngester` code is likely what usually crashes lightwalletd when hushd crashes after lightwalletd has started up. We can add similar `retryCount` code to that. There may be other places that also need to be updated, but these two are a good place to start.
duke commented 1 year ago
Poster
Owner

I added a checklist to the issue description. We may want to write a wrapper function that calls an RPC with retries, and then change all instances of calling RPCs to that function, instead of copy+pasting retry code all over the place.

I added a checklist to the issue description. We may want to write a wrapper function that calls an RPC with retries, and then change all instances of calling RPCs to that function, instead of copy+pasting retry code all over the place.
duke commented 1 year ago
Poster
Owner

The latest code on the dev branch wraps all calls to RPC methods in a function that will retry up to 50 times (6875s or ~1.9hrs). One way to test this new code is to run lightwalletd on the dev branch, run hush-cli stop and see what lightwalletd does. It should go into a loop of retrying the RPC and logging each try, instead of crashing. Restarting hushd should make lightwalletd eventually start working again and it will log "RPC successful" . If 50 retries happen, lightwalletd will exit with a fatal error, as before.

The latest code on the dev branch wraps all calls to RPC methods in a function that will retry up to 50 times (6875s or ~1.9hrs). One way to test this new code is to run lightwalletd on the dev branch, run `hush-cli stop` and see what lightwalletd does. It should go into a loop of retrying the RPC and logging each try, instead of crashing. Restarting hushd should make lightwalletd eventually start working again and it will log "RPC successful" . If 50 retries happen, lightwalletd will exit with a fatal error, as before.
fekt commented 1 year ago
Collaborator

I have dev branch on lite2.hushpool.is and will try to do some testing later or over the weekend.

I have dev branch on lite2.hushpool.is and will try to do some testing later or over the weekend.
Collaborator

Seems good.

I stopped hushd and lightwalletd kept running for over 10 retries of getbestblockhash. Started hushd and everything kept running and started syncing where it left off.

Started lightwalletd while hushd not running and lightwalletd kept running for over 10 retries of getblockchaininfo. Started hushd and everything kept running and started syncing where it left off.

Seems good. I stopped hushd and lightwalletd kept running for over 10 retries of `getbestblockhash`. Started hushd and everything kept running and started syncing where it left off. Started lightwalletd while hushd not running and lightwalletd kept running for over 10 retries of `getblockchaininfo`. Started hushd and everything kept running and started syncing where it left off.
Poster
Owner

@fekt awesome, thanks for testing. Closing this.

@fekt awesome, thanks for testing. Closing this.
duke closed this issue 12 months ago
Collaborator

Should a new 0.1.3 release be made to release this fix?

Should a new 0.1.3 release be made to release this fix?
Collaborator

Nvm, just saw #54

Nvm, just saw https://git.hush.is/hush/lightwalletd/issues/54
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.