（1/3）smartdからの手紙「SMART error detected」

サーバの一台からアラートを受け取りました。内容はSMARTエラーなのでその調査についてご報告します。

f:id:treedown:20160416000653p:plain

件名：<SMART error (OfflineUncorrectableSector) detected on host: svr01>
----------------------------------------------
This message was generated by the smartd daemon running on:

host name: svr01
DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], 4 Offline uncorrectable sectors

Device info:
WDC WD2500AAKX-193CA0, S/N:WD-WMAYVXXXXXXX, WWN:0-0000ee-000000a00, FW:15.01H15, 250 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.
----------------------------------------------
以下を参考に対処を実施してみました。
http://see-take.blogspot.jp/2010/01/hddsmart.html
価値ある記事をありがとうございます。

調査作業

対象の環境はDebian Jessieで構築したソフトウェアRAID1のサーバの内、一台のハードディスクにSMARTエラーが出ている、という状況です。
さっそくアラートメールに指定されている/dev/sdbに対して調査を実行します。
コマンドは「smartctl -t long /dev/sdb」で調査です。
----------------------------------------------
$ sudo smartctl -t long /dev/sdb
smartctl 6.4 2014-10-07 r4002 [i686-linux-3.16.0-4-686-pae] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-linemode" successful.
Testing has begun.
Please wait 45 minutes for test to complete.
Test will complete after Thu Apr 14 11:45:19 2016

Use smartctl -X to abort test.
----------------------------------------------
「Test will complete after Thu Apr 14 11:45:19 2016」とメッセージにあるように調査完了まで一時間程度待ちます。
時間経過後、コマンドで状況確認を実行します。
コマンドは「smartctl -A -l selftest /dev/sdb」で表示します。
----------------------------------------------
$ sudo smartctl -A -l selftest /dev/sdb
smartctl 6.4 2014-10-07 r4002 [i686-linux-3.16.0-4-686-pae] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail Always       -       2533
3 Spin_Up_Time            0x0027   141   141   021    Pre-fail Always       -       3925
4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       993
5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail Always       -       0
7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11641
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       535
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       27
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       965
194 Temperature_Celsius     0x0022   094   087   000    Old_age   Always       -       49
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector 0x0032   200   200   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       4
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       5

SMART Self-test log structure revision number 1
Num Test_Description    Status                  Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline    Completed: read failure       90%     11640         337464416
# 2 Short offline       Completed: read failure       90%      6222         352580271
# 3 Short offline       Completed without error       00%      3963         -
----------------------------------------------
「197 Current_Pending_Sector」に注目です。アラートメールの「4 Offline uncorrectable sectors」が同じ数だけ記録されています。アラートメールで指定されたのは「198 Offline_Uncorrectable」と考えられます。「199 UDMA_CRC_Error_Count」には1とあるので、他ではエラーが起きていない（今のところは）状況だと考えられます。
下段「LBA_of_first_error」に記載されている二つの数字「337464416 / 352580271」が対象の不良ブロックのLBA（論理ブロックアドレス）番号だそうです。この位置が問題発生していると考えられます。
パーティション別に場所を確認してみます。
コマンドはfdiskにオプション「-u」を付加することで、セクタ番号で区切ったパーティションの一覧表示を確認することができます。
なのでコマンドは「fdisk -lu /dev/sdb」です。（-lでパーティションのリスト表示）
----------------------------------------------
$ sudo fdisk -lu /dev/sdb

Disk /dev/sdb: 232.9 GiB, 250059350016 bytes, 488397168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc55026ba

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sdb1 *         2048 480077823 480075776 228.9G 83 Linux
/dev/sdb2       480077824 488397167   8319344     4G 5 Extended
/dev/sdb5       480079872 488396799   8316928     4G 83 Linux
----------------------------------------------
次に対象のアドレスが含まれている箇所を確認するためにブロックサイズを計算するための調査を実行するのですが、参照させていただいた元記事はシングルディスクのようですので、そのまま実行するとRAID1環境ではエラーとなります。
----------------------------------------------
$ sudo tune2fs -l /dev/sdb | grep Block
tune2fs: Bad magic number in super-block while trying to open /dev/sdb
Couldn't find valid filesystem superblock.
----------------------------------------------
RAID1環境ではファイルシステムは/dev/md0のようにmdデバイスと紐づいていますのでコマンドは「tune2fs -l /dev/md0 | grep block」となります。
----------------------------------------------
$ sudo tune2fs -l /dev/md0 | grep Block
Block count:              59976704
Block size:               4096
Blocks per group:         32768
----------------------------------------------
ここまでで必要な数字情報は得られました。次に計算です。

計算する

不良の位置情報を計算します。

LBA of bad sectorからfdisk-luによって表示されたように、パーティションの開始セクタ番号を減算します。それに「Inode blocks per group:512」の512（固定値？）を乗算してFile system block sizeである4096で除算します。

　(337464416 - 2048) * 512 / 4096=42182796
　(352580271 - 2048) * 512 / 4096=44072277

337464416のブロック番号は、42182796
352580271のブロック番号は、44072277
となりました。

導き出されたinode情報をコマンド「debugfs」で確認します。
----------------------------------------------
$ sudo debugfs
debugfs 1.42.12 (29-Aug-2014)
debugfs: open /dev/sdb1
/dev/sdb1: Bad magic number in super-block while opening filesystem
debugfs: open /dev/md0
debugfs: icheck 42182796
Block   Inode number
42182796        <block not found>
debugfs: icheck 44072277
Block   Inode number
44072277        <block not found>
debugfs: ncheck 42182796
Inode   Pathname
debugfs: ncheck 44072277
Inode   Pathname
debugfs: quit
----------------------------------------------
いずれの実行結果も「Inode number＝<block not found>」となっております。どうやらまだ何も書き込まれていない状態のようです。ファイルロストがないということで一安心しました。念のため「ncheck 42182796」と「ncheck 44072277」で対象の箇所にファイルがないことも確認できました。（※InodeとPathname表記が空っぽ）