何为工程,工匠的精神,系统的过程 - 看PostgreSQL解决fsync bug

何为工程,工匠的精神,系统的过程

PostgreSQL最近发布的一个小版本修复了一个一直存在的fsync bug。一个如此成熟的数据库里居然有fsync的bug还是极其罕见的,于是花了点时间了解了整个事情的来龙去脉。具体的问题如下:

Pg wrote some blocks, which went to OS dirty buffers for writeback.
Writeback failed due to an underlying storage error. The block I/O layer
and XFS marked the writeback page as failed (AS_EIO), but had no way to
tell the app about the failure. When Pg called fsync() on the FD during the
next checkpoint, fsync() returned EIO because of the flagged page, to tell
Pg that a previous async write failed. Pg treated the checkpoint as failed
and didn’t advance the redo start position in the control file.

All good so far.

But then we retried the checkpoint, which retried the fsync(). The retry
succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.

The write never made it to disk, but we completed the checkpoint, and
merrily carried on our way. Whoops, data loss.

fsync的这个行为从使用者角度讲确实比较意外,以致于PG的开发者用了brain damage来形容。不过看完这段描述,我其实更想确认一开始是怎么定位到这个问题的,直觉是靠Log,根据报告者提供的线索是在checkpoint时发生的问题,再结合着这个bug的补丁,具体在src/backend/access/transam/slru.c里:

case SLRU_FSYNC_FAILED:

– ereport(ERROR,

+ ereport(data_sync_elevel(ERROR),

(errcode_for_file_access(),

  errmsg(“could not access status of transaction %u”, xid),

  errdetail(“Could not fsync file \”%s\”: %m.”,

可以看到本来fsync错误后就会打一条Log,所以基本可以确认是靠Log定位的。

完整的调用链是:

src/backend/access/transam/clog.c#CheckPointCLOG 

>>src/backend/access/transam/slru.c#SimpleLruFlush

>>>>src/backend/access/transam/slru.c#SlruReportIOError 

简单总结一下:

  1. fsync的错误处理行为既不符合常识,也没有文档注明。
  2. Postgres按照常识臆想了出错时的处理方法,打Log,然后重试,都是标准的处理方法。可是恰恰是重试导致了问题。

fsync的行为确实有问题,但是像这样的情况其实有许许多多,许多Bug也就是因为这样出来的。而另一方面,解决这个问题的步骤则体现了工程化的一面:

首先这个问题能被最终定位到,应该是因为Postgres的代码习惯比较好,出错时打了条Log。这点体现了defensive编程的重要性,不能嫌麻烦,错误处理老老实实打条Log,否则像这种fsync的问题不太可能被定位。

其次是PostgreSQL整个社区的严谨性。从把问题反馈给社区的清晰描述,再到一群人的讨论,再到最终的Patch。Patch的描述也堪称教科书式,问题在哪里,如何解决,还剩下哪些问题,打算怎么解决:

On some operating systems, it doesn’t make sense to retry fsync(),

because dirty data cached by the kernel may have been dropped on

write-back failure.  In that case the only remaining copy of the

data is in the WAL.  A subsequent fsync() could appear to succeed,

but not have flushed the data.  That means that a future checkpoint

could apparently complete successfully but have lost data.

Therefore, violently prevent any future checkpoint attempts by

panicking on the first fsync() failure.  Note that we already

did the same for WAL data; this change extends that behavior to

non-temporary data files.

Provide a GUC data_sync_retry to control this new behavior, for

users of operating systems that don’t eject dirty data, and possibly

forensic/testing uses.  If it is set to on and the write-back error

was transient, a later checkpoint might genuinely succeed (on a

system that does not throw away buffers on failure); if the error is

permanent, later checkpoints will continue to fail.  The GUC defaults

to off, meaning that we panic.

Back-patch to all supported releases.

There is still a narrow window for error-loss on some operating

systems: if the file is closed and later reopened and a write-back

error occurs in the intervening time, but the inode has the bad

luck to be evicted due to memory pressure before we reopen, we could

miss the error.  A later patch will address that with a scheme

for keeping files with dirty data open at all times, but we judge

that to be too complicated to back-patch.

足足250+多个单词,整个格式也很严谨,比如说所有句号后面都是空两格,其他标点后都空一格。

程序员有不少称号,调侃的比如程序猿/媛,码农,搬砖工,而正式的称呼则是软件工程师,工程工程,何为工程,工匠的精神,系统的过程。

相关链接:

  1. 一开始的问题报告
  2. 补丁
  3. 看代码可以通过sourcegraph


Subscribe to 天舟的云游格

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe