Save block data to prevent sync from block 0 when a fresh node crashes #370

Open
opened 5 months ago by onryo · 10 comments
onryo commented 5 months ago
Collaborator

A node may crash on a fresh sync and when restarting the node it starts from 0. Would be useful to write changes every X blocks so when the node crashes it won't start from block 0.

A node may crash on a fresh sync and when restarting the node it starts from 0. Would be useful to write changes every X blocks so when the node crashes it won't start from block 0.
onryo added the
feature
label 5 months ago
Owner

I believe we need to add lines 1403-1425 from zcash ChainTipAdded() function to the end of our ChainTip() function : https://github.com/zcash/zcash/blob/master/src/wallet/wallet.cpp#L1403

That code calls SetBestChain() every WITNESS_WRITE_UPDATES blocks, which they set to 10000. That stores the "bestblock" key in wallet.dat which stores what height the wallet knows about.

I believe we need to add lines 1403-1425 from zcash ChainTipAdded() function to the end of our ChainTip() function : https://github.com/zcash/zcash/blob/master/src/wallet/wallet.cpp#L1403 That code calls SetBestChain() every WITNESS_WRITE_UPDATES blocks, which they set to 10000. That stores the "bestblock" key in wallet.dat which stores what height the wallet knows about.
duke self-assigned this 5 months ago
duke referenced this issue from a commit 5 months ago
Owner

@onryo ok, we have some potential untested code on the setbestchain branch. It needs a lot of testing, since this is a pretty low-level change. Testing could include letting the OOM Killer kill it and/or kill -9 .

Some testing with mining and calling getblocktemplate should be done on this branch as these changes affect how ChainTip() works and could affect mining.

@onryo ok, we have some potential untested code on the `setbestchain` branch. It needs a lot of testing, since this is a pretty low-level change. Testing could include letting the OOM Killer kill it and/or kill -9 . Some testing with mining and calling getblocktemplate should be done on this branch as these changes affect how `ChainTip()` works and could affect mining.
Poster
Collaborator

@onryo ok, we have some potential untested code on the setbestchain branch. It needs a lot of testing, since this is a pretty low-level change. Testing could include letting the OOM Killer kill it and/or kill -9 .

My fresh node crashed at block 728560, I restarted and now it says Activating best chain... every time I call getblockcount, in stdout I see that it started syncing from block 0:

[HUSH3].1050 (HUSH3) matched.1 i.1 j.1 notarized.0 1 opretlen.146 len.3 offset.68 opoffset.3
> @onryo ok, we have some potential untested code on the `setbestchain` branch. It needs a lot of testing, since this is a pretty low-level change. Testing could include letting the OOM Killer kill it and/or kill -9 . My fresh node crashed at block 728560, I restarted and now it says `Activating best chain...` every time I call `getblockcount`, in stdout I see that it started syncing from block 0: ``` [HUSH3].1050 (HUSH3) matched.1 i.1 j.1 notarized.0 1 opretlen.146 len.3 offset.68 opoffset.3 ```
Poster
Collaborator

@onryo ok, we have some potential untested code on the setbestchain branch. It needs a lot of testing, since this is a pretty low-level change. Testing could include letting the OOM Killer kill it and/or kill -9 .

My fresh node crashed at block 728560, I restarted and now it says Activating best chain... every time I call getblockcount, in stdout I see that it started syncing from block 0:

[HUSH3].1050 (HUSH3) matched.1 i.1 j.1 notarized.0 1 opretlen.146 len.3 offset.68 opoffset.3

I stopped the node and it's core dumped:

hushd: /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp:108: void boost::recursive_mutex::lock(): Assertion `!posix::pthread_mutex_lock(&m)' failed.
Aborted (core dumped)
StartShutdown: fRequestShudown=true
[New Thread 0x7ffee17ff640 (LWP 143300)]
Shutdown: stopping HUSH HTTP/REST/RPC
[Thread 0x7fffdb477640 (LWP 143228) exited]
[Thread 0x7ffee17ff640 (LWP 143300) exited]
[Thread 0x7fffd746f640 (LWP 143236) exited]
[Thread 0x7fffd7c70640 (LWP 143235) exited]
[Thread 0x7fffd8471640 (LWP 143234) exited]
[Thread 0x7fffd8c72640 (LWP 143233) exited]
[Thread 0x7fffd9473640 (LWP 143232) exited]
[Thread 0x7fffd9c74640 (LWP 143231) exited]
[Thread 0x7fffda475640 (LWP 143230) exited]
[Thread 0x7fffdac76640 (LWP 143229) exited]
[Thread 0x7fffdbc78640 (LWP 143227) exited]
[Thread 0x7ffff4efa640 (LWP 143226) exited]
[Thread 0x7ffff56fb640 (LWP 143225) exited]
[Thread 0x7ffff5efc640 (LWP 143224) exited]
[Thread 0x7ffff66fd640 (LWP 143223) exited]
Shutdown: stopping node
[Thread 0x7ffff6efe640 (LWP 143222) exited]
hushd: /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp:108: void boost::recursive_mutex::lock(): Assertion `!posix::pthread_mutex_lock(&m)' failed.

Thread 18 "hush-txnotify" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffef4dff640 (LWP 143249)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140733006739008) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140733006739008) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140733006739008) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140733006739008, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff78287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff782871b in __assert_fail_base (fmt=0x7ffff79dd130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555562af3a8 "!posix::pthread_mutex_lock(&m)", 
    file=0x5555562af740 "/home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp", line=108, function=<optimized out>) at ./assert/assert.c:92
#6  0x00007ffff7839e96 in __GI___assert_fail (assertion=0x5555562af3a8 "!posix::pthread_mutex_lock(&m)", 
    file=0x5555562af740 "/home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp", line=108, 
    function=0x5555562af380 "void boost::recursive_mutex::lock()") at ./assert/assert.c:101
#7  0x0000555555c5c848 in boost::recursive_mutex::lock (this=<optimized out>) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp:108
#8  AnnotatedMixin<boost::recursive_mutex>::lock (this=<optimized out>) at /home/onryo/hush3/src/sync.h:77
#9  boost::unique_lock<AnnotatedMixin<boost::recursive_mutex> >::lock (this=<synthetic pointer>) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/lock_types.hpp:346
#10 CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::Enter (nLine=<optimized out>, pszFile=0x55555636c989 "util.cpp", pszName=<synthetic pointer>, this=<synthetic pointer>)
    at /home/onryo/hush3/src/sync.h:133
#11 CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::CMutexLock (mutexIn=..., pszName=<synthetic pointer>, pszFile=0x55555636c989 "util.cpp", fTry=false, nLine=<optimized out>, 
    this=<synthetic pointer>) at /home/onryo/hush3/src/sync.h:154
#12 GetDataDir (fNetSpecific=fNetSpecific@entry=true) at util.cpp:675
#13 0x000055555578ffbc in GetBlockPosFilename (pos=..., prefix=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:89
#14 0x000055555579d76d in OpenDiskFile (pos=..., prefix=0x5555562b0f88 "blk", fReadOnly=<optimized out>) at main.cpp:5716
#15 0x00005555557c5e5f in OpenBlockFile (fReadOnly=true, pos=...) at main.cpp:5741
#16 ReadBlockFromDisk (height=<optimized out>, checkPOW=true, pos=..., block=...) at main.cpp:2309
#17 ReadBlockFromDisk (block=..., pindex=0x55555db56f30, checkPOW=checkPOW@entry=true) at main.cpp:2344
#18 0x000055555598c363 in ThreadNotifyWallets (pindexLastTip=0x5555658e60f0) at validationinterface.cpp:207
#19 0x000055555574296d in boost::function0<void>::operator() (this=0x7ffef4dfecd0) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/function/function_template.hpp:763
#20 TraceThread<boost::function<void ()> >(char const*, boost::function<void ()>) (name=<optimized out>, func=...) at /home/onryo/hush3/src/util.h:269
#21 0x00005555557340a0 in boost::_bi::list2<boost::_bi::value<char const*>, boost::_bi::value<boost::function<void ()> > >::operator()<void (*)(char const*, boost::function<void ()>), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(char const*, boost::function<void ()>), boost::_bi::list0&, int) (a=<synthetic pointer>..., 
    f=@0x7fffcc097f88: 0x555555742900 <TraceThread<boost::function<void ()> >(char const*, boost::function<void ()>)>, this=0x7fffcc097f90)
    at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/bind/bind.hpp:319
#22 boost::_bi::bind_t<void, void (*)(char const*, boost::function<void ()>), boost::_bi::list2<boost::_bi::value<char const*>, boost::_bi::value<boost::function<void ()> > > >::operator()() (
    this=0x7fffcc097f88) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/bind/bind.hpp:1294
#23 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(char const*, boost::function<void ()>), boost::_bi::list2<boost::_bi::value<char const*>, boost::_bi::value<boost::function<void ()> > > > >::run() (this=0x7fffcc097e50) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/detail/thread.hpp:120
#24 0x0000555555e67c1b in thread_proxy ()
#25 0x00007ffff7894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#26 0x00007ffff7926660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
> > @onryo ok, we have some potential untested code on the `setbestchain` branch. It needs a lot of testing, since this is a pretty low-level change. Testing could include letting the OOM Killer kill it and/or kill -9 . > > My fresh node crashed at block 728560, I restarted and now it says `Activating best chain...` every time I call `getblockcount`, in stdout I see that it started syncing from block 0: > > ``` > [HUSH3].1050 (HUSH3) matched.1 i.1 j.1 notarized.0 1 opretlen.146 len.3 offset.68 opoffset.3 > ``` I stopped the node and it's core dumped: ``` hushd: /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp:108: void boost::recursive_mutex::lock(): Assertion `!posix::pthread_mutex_lock(&m)' failed. Aborted (core dumped) ``` ``` StartShutdown: fRequestShudown=true [New Thread 0x7ffee17ff640 (LWP 143300)] Shutdown: stopping HUSH HTTP/REST/RPC [Thread 0x7fffdb477640 (LWP 143228) exited] [Thread 0x7ffee17ff640 (LWP 143300) exited] [Thread 0x7fffd746f640 (LWP 143236) exited] [Thread 0x7fffd7c70640 (LWP 143235) exited] [Thread 0x7fffd8471640 (LWP 143234) exited] [Thread 0x7fffd8c72640 (LWP 143233) exited] [Thread 0x7fffd9473640 (LWP 143232) exited] [Thread 0x7fffd9c74640 (LWP 143231) exited] [Thread 0x7fffda475640 (LWP 143230) exited] [Thread 0x7fffdac76640 (LWP 143229) exited] [Thread 0x7fffdbc78640 (LWP 143227) exited] [Thread 0x7ffff4efa640 (LWP 143226) exited] [Thread 0x7ffff56fb640 (LWP 143225) exited] [Thread 0x7ffff5efc640 (LWP 143224) exited] [Thread 0x7ffff66fd640 (LWP 143223) exited] Shutdown: stopping node [Thread 0x7ffff6efe640 (LWP 143222) exited] hushd: /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp:108: void boost::recursive_mutex::lock(): Assertion `!posix::pthread_mutex_lock(&m)' failed. Thread 18 "hush-txnotify" received signal SIGABRT, Aborted. [Switching to Thread 0x7ffef4dff640 (LWP 143249)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140733006739008) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. ``` ``` #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140733006739008) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140733006739008) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140733006739008, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7842476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff78287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff782871b in __assert_fail_base (fmt=0x7ffff79dd130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555562af3a8 "!posix::pthread_mutex_lock(&m)", file=0x5555562af740 "/home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp", line=108, function=<optimized out>) at ./assert/assert.c:92 #6 0x00007ffff7839e96 in __GI___assert_fail (assertion=0x5555562af3a8 "!posix::pthread_mutex_lock(&m)", file=0x5555562af740 "/home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp", line=108, function=0x5555562af380 "void boost::recursive_mutex::lock()") at ./assert/assert.c:101 #7 0x0000555555c5c848 in boost::recursive_mutex::lock (this=<optimized out>) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/pthread/recursive_mutex.hpp:108 #8 AnnotatedMixin<boost::recursive_mutex>::lock (this=<optimized out>) at /home/onryo/hush3/src/sync.h:77 #9 boost::unique_lock<AnnotatedMixin<boost::recursive_mutex> >::lock (this=<synthetic pointer>) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/lock_types.hpp:346 #10 CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::Enter (nLine=<optimized out>, pszFile=0x55555636c989 "util.cpp", pszName=<synthetic pointer>, this=<synthetic pointer>) at /home/onryo/hush3/src/sync.h:133 #11 CMutexLock<AnnotatedMixin<boost::recursive_mutex> >::CMutexLock (mutexIn=..., pszName=<synthetic pointer>, pszFile=0x55555636c989 "util.cpp", fTry=false, nLine=<optimized out>, this=<synthetic pointer>) at /home/onryo/hush3/src/sync.h:154 #12 GetDataDir (fNetSpecific=fNetSpecific@entry=true) at util.cpp:675 #13 0x000055555578ffbc in GetBlockPosFilename (pos=..., prefix=<optimized out>) at /usr/include/c++/11/ext/new_allocator.h:89 #14 0x000055555579d76d in OpenDiskFile (pos=..., prefix=0x5555562b0f88 "blk", fReadOnly=<optimized out>) at main.cpp:5716 #15 0x00005555557c5e5f in OpenBlockFile (fReadOnly=true, pos=...) at main.cpp:5741 #16 ReadBlockFromDisk (height=<optimized out>, checkPOW=true, pos=..., block=...) at main.cpp:2309 #17 ReadBlockFromDisk (block=..., pindex=0x55555db56f30, checkPOW=checkPOW@entry=true) at main.cpp:2344 #18 0x000055555598c363 in ThreadNotifyWallets (pindexLastTip=0x5555658e60f0) at validationinterface.cpp:207 #19 0x000055555574296d in boost::function0<void>::operator() (this=0x7ffef4dfecd0) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/function/function_template.hpp:763 #20 TraceThread<boost::function<void ()> >(char const*, boost::function<void ()>) (name=<optimized out>, func=...) at /home/onryo/hush3/src/util.h:269 #21 0x00005555557340a0 in boost::_bi::list2<boost::_bi::value<char const*>, boost::_bi::value<boost::function<void ()> > >::operator()<void (*)(char const*, boost::function<void ()>), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(char const*, boost::function<void ()>), boost::_bi::list0&, int) (a=<synthetic pointer>..., f=@0x7fffcc097f88: 0x555555742900 <TraceThread<boost::function<void ()> >(char const*, boost::function<void ()>)>, this=0x7fffcc097f90) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/bind/bind.hpp:319 #22 boost::_bi::bind_t<void, void (*)(char const*, boost::function<void ()>), boost::_bi::list2<boost::_bi::value<char const*>, boost::_bi::value<boost::function<void ()> > > >::operator()() ( this=0x7fffcc097f88) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/bind/bind.hpp:1294 #23 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(char const*, boost::function<void ()>), boost::_bi::list2<boost::_bi::value<char const*>, boost::_bi::value<boost::function<void ()> > > > >::run() (this=0x7fffcc097e50) at /home/onryo/hush3/depends/x86_64-unknown-linux-gnu/share/../include/boost/thread/detail/thread.hpp:120 #24 0x0000555555e67c1b in thread_proxy () #25 0x00007ffff7894ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #26 0x00007ffff7926660 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 ```
Owner

@onryo ok, I pushed another commit to the branch, please test again.

@onryo ok, I pushed another commit to the branch, please test again.
Poster
Collaborator

@onryo ok, I pushed another commit to the branch, please test again.

No more coredump.

> @onryo ok, I pushed another commit to the branch, please test again. No more coredump.
Owner

@onryo glad to hear there is no more coredump when calling stop , that is progress. Please also test what happens when running out of memory and/or kill -9.

When calling stop the existing behavior is to save the bestchain height to the wallet, so you should still see that happen with the new code. The new behavior we are looking for is: does it save a recent bestchain height when killed by the kernel or user?

It only saves the bestchain every 10K blocks that are added, so you should see it save a height within 10K blocks of when it's killed.

@onryo glad to hear there is no more coredump when calling `stop` , that is progress. Please also test what happens when running out of memory and/or kill -9. When calling `stop` the existing behavior is to save the bestchain height to the wallet, so you should still see that happen with the new code. The new behavior we are looking for is: does it save a recent bestchain height when killed by the kernel or user? It only saves the bestchain every 10K blocks that are added, so you should see it save a height within 10K blocks of when it's killed.
Owner

Forgot to mention, but it's possible that the wallet.dat gets corrupted with the OOM killer or kill -9 kills hushd so make sure you are not testing with an important wallet and/or you have a recent backup. I think testing these changes on an empty wallet that hushd creates on startup is fine. A wallet with transactions is just going to make the testing slower.

Forgot to mention, but it's possible that the wallet.dat gets corrupted with the OOM killer or `kill -9` kills hushd so make sure you are not testing with an important wallet and/or you have a recent backup. I think testing these changes on an empty wallet that hushd creates on startup is fine. A wallet with transactions is just going to make the testing slower.
Poster
Collaborator

When calling stop the existing behavior is to save the bestchain height to the wallet, so you should still see that happen with the new code. The new behavior we are looking for is: does it save a recent bestchain height when killed by the kernel or user?

I only confirm there is no more coredump but syncing still starts from 0.

> When calling `stop` the existing behavior is to save the bestchain height to the wallet, so you should still see that happen with the new code. The new behavior we are looking for is: does it save a recent bestchain height when killed by the kernel or user? I only confirm there is no more coredump but syncing still starts from 0.
Owner

@onryo do you see a line like wrote bestchain to wallet at height in your debug.log ?

@onryo do you see a line like `wrote bestchain to wallet at height` in your debug.log ?
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.