smartからの手紙「HDDの温度が上昇」とな

SMARTから再び警告の手紙を受け取りました。
SMARTエラーかと思ったのですが…ちょっと違うようです。その内容をご報告します。

エラーの内容

サーバの温度が上昇しているから、気を付けろ、という意味のようです。
----------------------------------------------
件名：SMART error (Usage) detected on host:
----------------------------------------------
This message was generated by the smartd daemon running on:

host name: svr02
DNS domain: [Empty]

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], Failed SMART usage Attribute: 190 Airflow_Temperature_Cel.

Device info:
WDC WD400JD-00, S/N:WD-XXXXXX000000, FW:10.01E01, 40.0 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.
----------------------------------------------

f:id:treedown:20160423111953p:plain
最初は、以前と同様に不良セクタが出たと思って愕然としていました。しかも/dev/sdaなものですから、RAID1とはいえメイン側がやられたのか…とがっかりしていました。

でも、よく見てみると…

「190 Airflow_Temperature_Cel.」

別にハードディスクは壊れちゃいません。温度が上がっただけ。

確認しておきます

いちおう「smartctl -t long /dev/sda」からの「smartctl -A -l selftest /dev/sda」で状況を確認してみました。
----------------------------------------------
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_                                                                             FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail Always       -                                                                                    0
3 Spin_Up_Time            0x0003   166   164   021    Pre-fail Always       -                                                                                    2691
4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -                                                                                    787
5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail Always       -                                                                                    0
7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail Always       -                                                                                    0
9 Power_On_Hours          0x0032   078   078   000    Old_age   Always       -                                                                                    16060
10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail Always       -                                                                                    0
11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -                                                                                    0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -                                                                                    787
190 Airflow_Temperature_Cel 0x0022   048   027   045    Old_age   Always   In_th                                                                             e_past 52
194 Temperature_Celsius     0x0022   091   070   000    Old_age   Always       -                                                                                    52
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -                                                                                    0
197 Current_Pending_Sector 0x0012   200   200   000    Old_age   Always       -                                                                                    0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -                                                                                    0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -                                                                                    0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail Offline      -                                                                                    0

SMART Self-test log structure revision number 1
Num Test_Description    Status                  Remaining LifeTime(hours) LBA                                                                             _of_first_error
# 1 Extended offline    Completed without error       00%     16060         -
# 2 Short offline       Completed without error       00%     15490         -
----------------------------------------------
壊れてはいない、温度が上がっただけ、ですね。
ただ温度は52℃が記録されています。前回調査した別サーバでは49℃だったので、3℃差でSMARTのアラート閾値に引っ掛かったようです。

これは…ちょっと経過観察ですね。あまりに温度が上がるようなら対策がひつようだけど、一時的なものであればそのままです。有効な対処が思いつかない。