Linux當中沒有很多方便又有圖形化的監控工具,大部分的監控可以透過SNMP或是其他工具輔助,像我通常都使用單顆硬碟居多,硬碟的健康狀況總是得偶爾關心一下,而且老實說我真的沒有在意過Linux的硬碟溫度,還真是糟糕.....不過既然得知了有方便的工具,當然得使用看看,多關心硬碟一點,壽命久一點!

「Smartmontools」在Linux當中可以透過yum來安裝即可,我要安裝的同時才發現我系統內原本就有這個套件,之前都沒有用到還真是小小浪費了~來看看除了文件檔之外有哪些檔案呢?

$ rpm -ql smartmontools
/etc/rc.d/init.d/smartd
/etc/smartd.conf
/etc/sysconfig/smartmontools
/usr/sbin/smartctl
/usr/sbin/smartd

原來這工具就是系統內一直都沒開啟使用的「smartd」服務啊^^a,一直以來我都沒有開啟過,一直以來都只開起必要開啟的服務,其他的能省則省。
$ smartctl -a /dev/sdb3
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST31000333AS
Serial Number:    9TE10MBY
Firmware Version: CC1H
User Capacity:    1,000,203,804,160 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Feb  8 14:25:49 2011 CST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled(未啟動HDD SMART)

SMART Disabled. Use option -s with argument 'on' to enable it.(未啟動服務)

啟動服務後就會有一堆硬碟資訊出現了~

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:          ( 625) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 212) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   120   099   006    Pre-fail  Always       -       238786274
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       6
  7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       9620914
  9 Power_On_Hours          0x0032   081   081   000    Old_age   Always       -       16940
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   037   020    Old_age   Always       -       34
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       643
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   044   044   000    Old_age   Always       -       56
190 Airflow_Temperature_Cel 0x0022   059   041   045    Old_age   Always   In_the_past 41 (7 68 41 41)
194 Temperature_Celsius     0x0022   041   059   000    Old_age   Always       -       41 (0 25 0 0)
195 Hardware_ECC_Recovered  0x001a   034   029   000    Old_age   Always       -       238786274
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       16
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       87746181860942
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       702344905
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       353203500

SMART Error Log Version: 1
ATA Error Count: 635 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 635 occurred at disk power-on lifetime: 3659 hours (152 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00  32d+14:50:29.967  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:29.934  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:27.106  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:27.102  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:24.294  READ DMA EXT

Error 634 occurred at disk power-on lifetime: 3659 hours (152 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00  32d+14:50:27.106  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:27.102  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:24.294  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:24.262  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:21.449  READ DMA EXT

Error 633 occurred at disk power-on lifetime: 3659 hours (152 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00  32d+14:50:24.294  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:24.262  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:21.449  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:21.446  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:18.708  READ DMA EXT

Error 632 occurred at disk power-on lifetime: 3659 hours (152 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00  32d+14:50:21.449  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:21.446  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:18.708  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:18.607  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:15.863  READ DMA EXT

Error 631 occurred at disk power-on lifetime: 3659 hours (152 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00  32d+14:50:18.708  READ DMA EXT
  ec 00 00 7a 7f 3d a0 00  32d+14:50:18.607  IDENTIFY DEVICE
  25 00 08 ff ff ff ef 00  32d+14:50:15.863  READ DMA EXT
  35 00 08 ff ff ff ef 00  32d+14:50:15.861  WRITE DMA EXT
  35 00 10 ff ff ff ef 00  32d+14:50:15.860  WRITE DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


資訊真的是多到眼花撩亂,若不是有特殊的監控項目,還真的不知道該如何著手,而且我的Linux是架設在ESXi上,系統的硬碟其實是看不到SMART的資訊,自己Mapping Physical Disk to vmdk的硬碟才可以看的到SMART的訊息,ESXi的硬碟可能還是得從ESXi的方面著手。而我主要要觀察的項目為溫度,在上方資訊要觀察的項目是194這個項次。

自己下指令或是寫script觀察溫度是可行的,不過既然smartd是一個服務,不妨看看設定檔,果然有監控的選項即設定,當發生問題時可以主動發信告知,也可以將一些紀錄寫入syslog,日後有需要查詢時也方便許多。仔細查看/etc/smartd.conf的設定檔,發現可以監控的項目不少,而且預設的範例都有寫在設定檔內,若有一些進階需求,可以翻翻設定檔或許會有解答,而我目前只改了一些基本的設定。

# 全域設定寄發郵件的對象
DEVICESCAN -H -m 電子信箱

# 每小時紀錄194、231、9的訊息到syslog
DEVICESCAN -I 194 -I 231 -I 9

# Monitor all attributes except normalized Temperature (usually 194),
# but track Temperature changes >= 4 Celsius, report Temperatures
# >= 45 Celsius and changes in Raw value of Reallocated_Sector_Ct (5).
# Send mail on SMART failures or when Temperature is >= 55 Celsius.
# 應該是溫度改變超過4度,溫度大於45、55度告警,以及磁區故障系統自動配置備用磁區超過5時告警
/dev/sdb -a -I 194 -W 4,45,55 -R 5 -m 電子信箱

# 默默的檢查,SMART發生異常時才發出通知
/dev/sdb -H -C 0 -U 0 -m 電子信箱

# 更多的設定選項如下
-d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, hpt,L/M/N
-T TYPE set the tolerance to one of: normal, permissive
-o VAL  Enable/disable automatic offline tests (on/off)
-S VAL  Enable/disable attribute autosave (on/off)
-n MODE No check. MODE is one of: never, sleep, standby, idle
-H      Monitor SMART Health Status, report if failed
-l TYPE Monitor SMART log.  Type is one of: error, selftest
-f      Monitor for failure of any 'Usage' Attributes
-m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f
-M TYPE Modify email warning behavior (see man page)
-s REGE Start self-test when type/date matches regular expression (see man page)
-p      Report changes in 'Prefailure' Normalized Attributes
-u      Report changes in 'Usage' Normalized Attributes
-t      Equivalent to -p and -u Directives
-r ID   Also report Raw values of Attribute ID with -p, -u or -t
-R ID   Track changes in Attribute ID Raw value with -p, -u or -t
-i ID   Ignore Attribute ID for -f Directive
-I ID   Ignore Attribute ID for -p, -u or -t Directive
-C ID   Report if Current Pending Sector count non-zero
-U ID   Report if Offline Uncorrectable count non-zero
-W D,I,C Monitor Temperature D)ifference, I)nformal limit, C)ritical limit
-v N,ST Modifies labeling of Attribute N (see man page)
-a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198
-F TYPE Use firmware bug workaround. Type is one of: none, samsung
-P TYPE Drive-specific presets: use, ignore, show, showall

大致上我就只有設定那麼一些些,主要還是藉由服務幫忙監控,若是需要更簡單的方式監控硬碟溫度,可以設定「hddtemp」的服務,不過能偵測的硬碟有限,snmp的資訊似乎也沒溫度的相關訊息,所以暫時就是簡單的設定告警囉!

http://smartmontools.sourceforge.net/


資訊真的是多到眼花撩亂,若不是有特殊的監控項目,還真的不知道該如何著手,而且我的Linux是架設在ESXi上,系統的硬碟其實是看不到SMART的資訊,自己Mapping Physical Disk to vmdk的硬碟才可以看的到SMART的訊息,ESXi的硬碟可能還是得從ESXi的方面著手。而我主要要觀察的項目為溫度,在上方資訊要觀察的項目是194這個項次。

自己下指令或是寫script觀察溫度是可行的,不過既然smartd是一個服務,不妨看看設定檔,果然有監控的選項即設定,當發生問題時可以主動發信告知,也可以將一些紀錄寫入syslog,日後有需要查詢時也方便許多。仔細查看/etc/smartd.conf的設定檔,發現可以監控的項目不少,而且預設的範例都有寫在設定檔內,若有一些進階需求,可以翻翻設定檔或許會有解答,而我目前只改了一些基本的設定。

# 全域設定寄發郵件的對象
DEVICESCAN -H -m 電子信箱

# 每小時紀錄194、231、9的訊息到syslog
DEVICESCAN -I 194 -I 231 -I 9

# Monitor all attributes except normalized Temperature (usually 194),
# but track Temperature changes >= 4 Celsius, report Temperatures
# >= 45 Celsius and changes in Raw value of Reallocated_Sector_Ct (5).
# Send mail on SMART failures or when Temperature is >= 55 Celsius.
# 應該是溫度改變超過4度,溫度大於45、55度告警,以及磁區故障系統自動配置備用磁區超過5時告警
/dev/sdb -a -I 194 -W 4,45,55 -R 5 -m 電子信箱

# 默默的檢查,SMART發生異常時才發出通知
/dev/sdb -H -C 0 -U 0 -m 電子信箱

# 更多的設定選項如下
-d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, hpt,L/M/N
-T TYPE set the tolerance to one of: normal, permissive
-o VAL  Enable/disable automatic offline tests (on/off)
-S VAL  Enable/disable attribute autosave (on/off)
-n MODE No check. MODE is one of: never, sleep, standby, idle
-H      Monitor SMART Health Status, report if failed
-l TYPE Monitor SMART log.  Type is one of: error, selftest
-f      Monitor for failure of any 'Usage' Attributes
-m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f
-M TYPE Modify email warning behavior (see man page)
-s REGE Start self-test when type/date matches regular expression (see man page)
-p      Report changes in 'Prefailure' Normalized Attributes
-u      Report changes in 'Usage' Normalized Attributes
-t      Equivalent to -p and -u Directives
-r ID   Also report Raw values of Attribute ID with -p, -u or -t
-R ID   Track changes in Attribute ID Raw value with -p, -u or -t
-i ID   Ignore Attribute ID for -f Directive
-I ID   Ignore Attribute ID for -p, -u or -t Directive
-C ID   Report if Current Pending Sector count non-zero
-U ID   Report if Offline Uncorrectable count non-zero
-W D,I,C Monitor Temperature D)ifference, I)nformal limit, C)ritical limit
-v N,ST Modifies labeling of Attribute N (see man page)
-a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198
-F TYPE Use firmware bug workaround. Type is one of: none, samsung
-P TYPE Drive-specific presets: use, ignore, show, showall

大致上我就只有設定那麼一些些,主要還是藉由服務幫忙監控,若是需要更簡單的方式監控硬碟溫度,可以設定「hddtemp」的服務,不過能偵測的硬碟有限,snmp的資訊似乎也沒溫度的相關訊息,所以暫時就是簡單的設定告警囉!

http://smartmontools.sourceforge.net/
更多資訊請參考:
最新超值旗艦機開箱
比螺旋燈泡還省電的迷你 NAS
26800mAh筆電行動電源