Add retryCount code to BlockIngestor call to getbestblockhash RPC
Make all RPCs retry
Test dev branch
Yep
* [x] Increase max retryCount in `FirstRPC`
* [x] Add retryCount code to `BlockIngestor` call to getbestblockhash RPC
* [x] Make all RPCs retry
* [ ] Test dev branch
Next step for this issue is to find the code that actually crashes lightwalletd when hushd is down and then we will know what code needs to change. Instead of crashing, a better behavior would be to start a polling timer that looks to see if hushd is back up every 60 seconds.
Next step for this issue is to find the code that actually crashes lightwalletd when hushd is down and then we will know what code needs to change. Instead of crashing, a better behavior would be to start a polling timer that looks to see if hushd is back up every 60 seconds.
I haven't cehcked the code yet, but from the last log I looked at, it seems to have code to retry connecting and aborts after x failures (somewhere between 3 and 5).
I haven't cehcked the code yet, but from the last log I looked at, it seems to have code to retry connecting and aborts after x failures (somewhere between 3 and 5).
It may depend on the call at the time too as there are multiple .Fatal. I've seen getblockchaininfo calls sometimes fall multiple times before connecting, but still under limit of 10 before it exits. This is what I was referring to seeing in logs with retries, but that might be from node still starting up. https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L266
Seems to usually be this `BlockIngestor` and call for `getbestblockhash` that causes lightwalletd to exit if hushd is not running:
https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L442
It may depend on the call at the time too as there are multiple .Fatal. I've seen `getblockchaininfo` calls sometimes fall multiple times before connecting, but still under limit of 10 before it exits. This is what I was referring to seeing in logs with retries, but that might be from node still starting up.
https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L266
@fekt thanks for the info. It looks like we will need to change a few places in the code. The FirstRPC function seems like an easy change, we can increase retryCount to a much larger number. That change will only affect lightwalletd looking for hushd when it first starts. The BlockIngester code is likely what usually crashes lightwalletd when hushd crashes after lightwalletd has started up. We can add similar retryCount code to that. There may be other places that also need to be updated, but these two are a good place to start.
@fekt thanks for the info. It looks like we will need to change a few places in the code. The `FirstRPC` function seems like an easy change, we can increase `retryCount` to a much larger number. That change will only affect lightwalletd looking for hushd when it first starts. The `BlockIngester` code is likely what usually crashes lightwalletd when hushd crashes after lightwalletd has started up. We can add similar `retryCount` code to that. There may be other places that also need to be updated, but these two are a good place to start.
I added a checklist to the issue description. We may want to write a wrapper function that calls an RPC with retries, and then change all instances of calling RPCs to that function, instead of copy+pasting retry code all over the place.
I added a checklist to the issue description. We may want to write a wrapper function that calls an RPC with retries, and then change all instances of calling RPCs to that function, instead of copy+pasting retry code all over the place.
The latest code on the dev branch wraps all calls to RPC methods in a function that will retry up to 50 times (6875s or ~1.9hrs). One way to test this new code is to run lightwalletd on the dev branch, run hush-cli stop and see what lightwalletd does. It should go into a loop of retrying the RPC and logging each try, instead of crashing. Restarting hushd should make lightwalletd eventually start working again and it will log "RPC successful" . If 50 retries happen, lightwalletd will exit with a fatal error, as before.
The latest code on the dev branch wraps all calls to RPC methods in a function that will retry up to 50 times (6875s or ~1.9hrs). One way to test this new code is to run lightwalletd on the dev branch, run `hush-cli stop` and see what lightwalletd does. It should go into a loop of retrying the RPC and logging each try, instead of crashing. Restarting hushd should make lightwalletd eventually start working again and it will log "RPC successful" . If 50 retries happen, lightwalletd will exit with a fatal error, as before.
I stopped hushd and lightwalletd kept running for over 10 retries of getbestblockhash. Started hushd and everything kept running and started syncing where it left off.
Started lightwalletd while hushd not running and lightwalletd kept running for over 10 retries of getblockchaininfo. Started hushd and everything kept running and started syncing where it left off.
Seems good.
I stopped hushd and lightwalletd kept running for over 10 retries of `getbestblockhash`. Started hushd and everything kept running and started syncing where it left off.
Started lightwalletd while hushd not running and lightwalletd kept running for over 10 retries of `getblockchaininfo`. Started hushd and everything kept running and started syncing where it left off.
Yep
FirstRPC
BlockIngestor
call to getbestblockhash RPCNext step for this issue is to find the code that actually crashes lightwalletd when hushd is down and then we will know what code needs to change. Instead of crashing, a better behavior would be to start a polling timer that looks to see if hushd is back up every 60 seconds.
I haven't cehcked the code yet, but from the last log I looked at, it seems to have code to retry connecting and aborts after x failures (somewhere between 3 and 5).
Seems to usually be this
BlockIngestor
and call forgetbestblockhash
that causes lightwalletd to exit if hushd is not running:https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L442
It may depend on the call at the time too as there are multiple .Fatal. I've seen
getblockchaininfo
calls sometimes fall multiple times before connecting, but still under limit of 10 before it exits. This is what I was referring to seeing in logs with retries, but that might be from node still starting up.https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L266
@fekt thanks for the info. It looks like we will need to change a few places in the code. The
FirstRPC
function seems like an easy change, we can increaseretryCount
to a much larger number. That change will only affect lightwalletd looking for hushd when it first starts. TheBlockIngester
code is likely what usually crashes lightwalletd when hushd crashes after lightwalletd has started up. We can add similarretryCount
code to that. There may be other places that also need to be updated, but these two are a good place to start.I added a checklist to the issue description. We may want to write a wrapper function that calls an RPC with retries, and then change all instances of calling RPCs to that function, instead of copy+pasting retry code all over the place.
The latest code on the dev branch wraps all calls to RPC methods in a function that will retry up to 50 times (6875s or ~1.9hrs). One way to test this new code is to run lightwalletd on the dev branch, run
hush-cli stop
and see what lightwalletd does. It should go into a loop of retrying the RPC and logging each try, instead of crashing. Restarting hushd should make lightwalletd eventually start working again and it will log "RPC successful" . If 50 retries happen, lightwalletd will exit with a fatal error, as before.I have dev branch on lite2.hushpool.is and will try to do some testing later or over the weekend.
Seems good.
I stopped hushd and lightwalletd kept running for over 10 retries of
getbestblockhash
. Started hushd and everything kept running and started syncing where it left off.Started lightwalletd while hushd not running and lightwalletd kept running for over 10 retries of
getblockchaininfo
. Started hushd and everything kept running and started syncing where it left off.@fekt awesome, thanks for testing. Closing this.
Should a new 0.1.3 release be made to release this fix?
Nvm, just saw #54