lightwalletd should not crash if hushd is down #53

Yep

Increase max retryCount in FirstRPC
Add retryCount code to BlockIngestor call to getbestblockhash RPC
Make all RPCs retry
Test dev branch

Next step for this issue is to find the code that actually crashes lightwalletd when hushd is down and then we will know what code needs to change. Instead of crashing, a better behavior would be to start a polling timer that looks to see if hushd is back up every 60 seconds.

I haven't cehcked the code yet, but from the last log I looked at, it seems to have code to retry connecting and aborts after x failures (somewhere between 3 and 5).

Seems to usually be this BlockIngestor and call for getbestblockhash that causes lightwalletd to exit if hushd is not running:
https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L442

It may depend on the call at the time too as there are multiple .Fatal. I've seen getblockchaininfo calls sometimes fall multiple times before connecting, but still under limit of 10 before it exits. This is what I was referring to seeing in logs with retries, but that might be from node still starting up.
https://git.hush.is/hush/lightwalletd/src/branch/master/common/common.go#L266

@fekt thanks for the info. It looks like we will need to change a few places in the code. The FirstRPC function seems like an easy change, we can increase retryCount to a much larger number. That change will only affect lightwalletd looking for hushd when it first starts. The BlockIngester code is likely what usually crashes lightwalletd when hushd crashes after lightwalletd has started up. We can add similar retryCount code to that. There may be other places that also need to be updated, but these two are a good place to start.

I added a checklist to the issue description. We may want to write a wrapper function that calls an RPC with retries, and then change all instances of calling RPCs to that function, instead of copy+pasting retry code all over the place.

The latest code on the dev branch wraps all calls to RPC methods in a function that will retry up to 50 times (6875s or ~1.9hrs). One way to test this new code is to run lightwalletd on the dev branch, run hush-cli stop and see what lightwalletd does. It should go into a loop of retrying the RPC and logging each try, instead of crashing. Restarting hushd should make lightwalletd eventually start working again and it will log "RPC successful" . If 50 retries happen, lightwalletd will exit with a fatal error, as before.

I have dev branch on lite2.hushpool.is and will try to do some testing later or over the weekend.

Seems good.

I stopped hushd and lightwalletd kept running for over 10 retries of getbestblockhash. Started hushd and everything kept running and started syncing where it left off.

Started lightwalletd while hushd not running and lightwalletd kept running for over 10 retries of getblockchaininfo. Started hushd and everything kept running and started syncing where it left off.

@fekt awesome, thanks for testing. Closing this.

Should a new 0.1.3 release be made to release this fix?

Nvm, just saw #54