Skip to content

NFS server suddenly hangs, all jobs fail (now even with less activity ~2.5 KIOPS)

NFS server logs:

03/07/2019 20:33:47 : 960 :nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
03/07/2019 20:33:47 : 961 :nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
03/07/2019 20:33:47 : 963 :nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
03/07/2019 20:35:17 : 110 :nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
04/07/2019 04:42:18 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f87cb69e800 fd 108 recv errno 104 (will set dead)
04/07/2019 04:42:18 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f87cb69e400 fd 99 recv errno 104 (will set dead)
04/07/2019 07:01:57 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f8873515c00 fd 111 recv errno 104 (will set dead)
04/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f8727157000 fd 111 recv errno 104 (will set dead)
04/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f8998038000 fd 113 recv errno 104 (will set dead)
04/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f86dc7eb800 fd 112 recv errno 104 (will set dead)
04/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f86fbe5dc00 fd 114 recv errno 104 (will set dead)
04/07/2019 07:03:27 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f88a1d1f800 fd 111 recv errno 104 (will set dead)
04/07/2019 07:09:13 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f8998160c00 fd 111 recv errno 104 (will set dead)
04/07/2019 07:09:19 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f86a6750c00 fd 111 recv errno 104 (will set dead)
04/07/2019 07:09:59 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f88701ce400 fd 111 recv errno 104 (will set dead)
04/07/2019 07:10:05 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f873456c000 fd 111 recv errno 104 (will set dead)
04/07/2019 07:13:30 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f88a09e9800 fd 111 recv errno 104 (will set dead)
05/07/2019 07:01:57 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f8689a9e800 fd 111 recv errno 104 (will set dead)
05/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f87c02ca800 fd 111 recv errno 104 (will set dead)
05/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f878b2ac400 fd 113 recv errno 104 (will set dead)
05/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f86f9669800 fd 112 recv errno 104 (will set dead)
05/07/2019 07:02:17 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f86db265800 fd 114 recv errno 104 (will set dead)
05/07/2019 07:03:28 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f86aed57800 fd 111 recv errno 104 (will set dead)
05/07/2019 07:09:29 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f86a51b3800 fd 111 recv errno 104 (will set dead)
05/07/2019 07:09:35 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f86a2686800 fd 111 recv errno 104 (will set dead)
05/07/2019 07:10:12 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f8789e5dc00 fd 111 recv errno 104 (will set dead)
05/07/2019 07:10:18 : 0 :rpc :TIRPC :EVENT :svc_vc_recv: 0x7f86dba98000 fd 111 recv errno 104 (will set dead)
05/07/2019 07:13:41 : 0 :rpc :TIRPC :EVENT :svc_vc_wait: 0x7f8785baf800 fd 111 recv errno 104 (will set dead)
Jul 05 08:17:12.800 rook-edgefs-nfs-caltech-nfs-7649f6bdf5-c9tqz libccowfsio[194] error   [634]      tc_pool.c:184  : Recreating tenant context 0x7f8e3c2a6c00, on delay response: 17, consensus: 647619
Jul 05 11:59:49.728 rook-edgefs-nfs-caltech-nfs-7649f6bdf5-c9tqz libccowfsio[194] error   [692]      tc_pool.c:184  : Recreating tenant context 0x7f8e3c2aa800, on delay response: 0, consensus: 644531
Jul 05 20:07:35.141 rook-edgefs-nfs-caltech-nfs-7649f6bdf5-c9tqz libccowfsio[194] error   [497]      tc_pool.c:181  : Detected struggling tenant context 0x7f8e3c2ab400, on delay response: 378244, consensus: 0
Jul 05 20:18:05.151 rook-edgefs-nfs-caltech-nfs-7649f6bdf5-c9tqz libccowfsio[194] error   [497]      tc_pool.c:181  : Detected struggling tenant context 0x7f8e3c2ab400, on delay response: 628240, consensus: 0
05/07/2019 20:20:33 : 0 :rpc :TIRPC :EVENT :svc_ioq_flushv() writev failed (104)

This kills the multi-day jobs using this storage

Edited by Dima Mishin