Как-то раз на просторах ынтернета попался мне совет запустить smartctl с параметром -x.
Я, конечно, как всякий homo sapiens, сначала почитал man:
-x, --xall Prints all SMART and non-SMART information about the device. For ATA devices this is equivalent to ´-H -i -g all -c -A -f brief -l xerror,error -l xselftest,selftest -l selective -l directory -l scttemp -l scterc -l devstat -l sataphy´. and for SCSI, this is equivalent to ´-H -i -A -l error -l selftest -l background -l sasphy´.
Не увидев там ничего стрёмного, выполнил вот такую командочку:
# smartctl -x -a -d cciss,0 /dev/cciss/c0d0 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.13.1.el6.x86_64] (local build) Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net /dev/cciss/c0d0 [cciss_disk_00] [SCSI]: Device open changed type from 'sat,auto' to 'cciss' Vendor: SEAGATE Product: ST91000640SS Revision: 0001 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Logical block size: 512 bytes Logical Unit id: 0x5000c50025fd7283 Serial number: 9XG02CLM00009126234W Device type: disk Transport protocol: SAS Local Time is: Tue Jan 31 15:29:39 2017 UTC Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 22 C Drive Trip Temperature: 68 C Manufactured in week of year 20 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 36 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 36 Elements in grown defect list: 3 Vendor (Seagate) cache information Blocks sent to initiator = 791069177 Blocks received from initiator = 8147385 Blocks read from cache and sent to initiator = 6510918 Number of read and write commands whose size <= segment size = 1294551 Number of read and write commands whose size > segment size = 0 Vendor (Seagate/Hitachi) factory information number of hours powered up = 37972.70 number of minutes until next internal SMART test = 12 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 8169902 0 0 8169902 0 2604.051 0 write: 0 0 0 0 0 4.359 0 Non-medium error count: 1 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] No self-tests have been logged Long (extended) Self Test duration: 12198 seconds [203.3 minutes] Segmentation fault (core dumped)
И... консолька замерла, связь с сервером пропала, пинга нет. Слава Хэнку, что сервер был не из production-кластера. И через пару минут самостоятельно поднялся.
При этом стоит отметить, что командочка smartctl -a -d cciss,0 /dev/cciss/c0d0 (то же самое, но без -x) там же пару минут ранее выполнялась несколько раз без каких-либо проблем. OS – CentOS 6.8 x86_64, RAID-контроллер HP Smart Array E200i.
Мораль: будьте осторожны со smartctl. Я предупредил.
Кстати, похоже на баг в firmware.
Тут вот в обновлении прошивки
http://h20564.www2.hpe.com/hpsc/swd/public/detail?sp4ts.oid=3924068&swItemId=MTX_5e52f965d84f41c2bb65d33b58&swEnvOid=4103#tab3
написано, что пофиксили баг
Problems Fixed:
Running SMARTCTL (smartmontools) on HP Proliant G6/G7 (Px1x) Smart Array controllers that have firmware version 5.70 to 6.62 installed with SATA drives attached may result in system not responding or reboot. Wehn reboot occurred, a reboot 1719 POST error message with lockup 0x15 displayed.