Смотрим какие у нас есть диски.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| root@il-nv-s06:~ # lshw -c disk *-disk:0 description: SCSI Disk product: SMC2108 vendor: SMC physical id : 2.0.0 bus info: scsi@0:2.0.0 logical name: /dev/sda version: 2.90 serial: 0074df64060b7e521510538600800403 size: 2791GiB (2996GB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=02712922-3f89-4077-8a1b-2ed197f3c54c *-disk:1 description: SCSI Disk product: SMC2108 vendor: SMC physical id : 2.1.0 bus info: scsi@0:2.1.0 logical name: /dev/sdb version: 2.90 serial: 00405d940d100d0a1810538600800403 size: 54GiB (58GB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=992168b5-1ecd-4e43-ab0f-f2e0b945ab27 *-disk:2 description: SCSI Disk product: SMC2108 vendor: SMC physical id : 2.2.0 bus info: scsi@0:2.2.0 logical name: /dev/sdc version: 2.90 serial: 00074cce4a116a071810538600800403 size: 7446GiB (7995GB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=92c542ab-7199-4525-89e3-057744b8397d |
1
2
| root@il-nv-s06:~ # cat /proc/devices | grep mega 250 megaraid_sas_ioctl |
1
2
3
4
| root@il-nv-s06:~ # echo 'deb http://hwraid.le-vert.net/ubuntu precise main' > /etc/apt/sources.list.d/raid.list root@il-nv-s06:~ # wget -O - http://hwraid.le-vert.net/debian/hwraid.le-vert.net.gpg.key | sudo apt-key add - root@il-nv-s06:~ # apt-get update root@il-nv-s06:~ # apt-get install megacli |
Проверяем на ошибки физический диск megaraid используя megacli.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| root@il-nv-s06:~ # megacli -pdinfo -physdrv [4:0] -aALL Enclosure Device ID: 4 Slot Number: 0 Drive's position: DiskGroup: 0, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 0 WWN: 5000C5002130CD08 Sequence Number: 2 Media Error Count: 38 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SAS Raw Size: 931.512 GB [0x74706db0 Sectors] Non Coerced Size: 931.012 GB [0x74606db0 Sectors] Coerced Size: 930.390 GB [0x744c8000 Sectors] Sector Size: 0 Firmware state: Online, Spun Up Device Firmware Level: 0005 Shield Counter: 0 Successful diagnostics completion on : N /A SAS Address(0): 0x5000c5002130cd09 SAS Address(1): 0x0 Connected Port Number: 0(path0) Inquiry Data: SEAGATE ST31000424SS 00059WK1D042 FDE Capable: Not Capable FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb /s Link Speed: 6.0Gb /s Media Type: Hard Disk Device Drive: Not Certified Drive Temperature :29C (84.20 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Port-0 : Port status: Active Port's Linkspeed: 6.0Gb /s Port-1 : Port status: Active Port's Linkspeed: Unknown Drive has flagged a S.M.A.R.T alert : No |
Так же нужно мониторить следующие параметры используя команду:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| root@il-nv-s06:~ # megacli -LdPdInfo -aALL | grep -E "(Id|State |Bad Blocks|Firmware state|Error Count|Predictive Failure Count)" # Первый виртуальный диск - он же /dev/sda Virtual Drive: 0 (Target Id: 0) # Статус RAID-a (Degraded - если проблема с одним из дисков; Optimal - нормальный статус) State : Degraded # Наличие бедблоков на виртуальном диске Bad Blocks Exist: No # ID физического диска Device Id: 14 # Количество ошибок, которые нет возможности исправить - самый важный компонент Media Error Count: 0 # Количество иных ошибок не связанных с бедблоками Other Error Count: 0 # Определение количества возможных ошибок Predictive Failure Count: 0 # Статус физического диска (Rebuild - добавляется в RAID; Online - в RAID-e) # Также есть "Failed", "Online, Spun Up", "Online, Spun Down", "Unconfigured(bad)", "Unconfigured(good), Spun down","Hotspare, Spun down", "Hotspare, Spun up" or "not Online". Firmware state: Rebuild Device Id: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Firmware state: Online, Spun Up Device Id: 2 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Firmware state: Online, Spun Up Device Id: 3 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Firmware state: Online, Spun Up Virtual Drive: 1 (Target Id: 1) State : Optimal Bad Blocks Exist: No Device Id: 13 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Firmware state: Online, Spun Up Media Type: Solid State Device Device Id: 12 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Firmware state: Online, Spun Up Media Type: Solid State Device |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
| root@il-nv-s06:~ # cat megaraid.sh #!/bin/bash #Вся информация по физическим и логическим дискам VD_PDID_ERRORS=`megacli -ldpdinfo -aALL | grep -E "(Id|State |Media Error|Firmware state)" ` #Вся информация по батарее BBU_OUT=`megacli -AdpBbuCmd -aAll | grep -E "(Full Charge|^Max Error|Battery State)" ` while read line do #Ловим название (ID) логического диска VD=` echo ${line} | grep -Eo "Virtual Drive: [0-9]" ` #Ловим название (ID) физического диска PD_ID=` echo ${line} | grep -E "Device Id:" ` #Ловим важные ошибки физических дисков PD_ERRORS=` echo ${line} | grep -E "(Media Error)" ` #Ловим статус рейда RAID_STAT=` echo ${line} | grep -E "State" ` #Ловим статус прошивки PD_FIRMWARE=` echo ${line} | grep -E "Firmware" ` if [ -n "${VD}" ] then DRIVE= "${VD} ==> " elif [ -n "${RAID_STAT}" ] then VD_RAID_STAT=` echo "${RAID_STAT}" | awk '{print $3}' ` VD_RAID= "${DRIVE}${RAID_STAT} ==> " #Если статус рейда отличается от нормального - число ошибок растет if [ ${VD_RAID_STAT} != 'Optimal' ] then #echo "Raid with problem" VDRIVE_WITH_FAIL="${VD_RAID} ${VDRIVE_WITH_FAIL}" let "ERROR_COUNT += 1" fi elif [ -n "${PD_ID}" ] then PD_DRIVE= "${DRIVE}${PD_ID} ==> " elif [ -n "${PD_ERRORS}" ] then #Если есть ошибка - ловим их количество PD_ERR=${PD_DRIVE}${PD_ERRORS} let "ERROR_COUNT +=`echo ${PD_ERRORS} | awk '{print $4}'`" TRAP=` echo ${PD_ERRORS} | awk '{print $4}' ` if [ ${TRAP} - ne 0 ] then DISK_WITH_FAIL="${PD_ERR} ${DISK_WITH_FAIL}" fi elif [ -n "${PD_FIRMWARE}" ] then #Проверяем или прошивка в порядке, если нет - число ошибок растет PD_FIRM_STATUS=` echo "${PD_FIRMWARE}" | cut --delimiter= ":" -f2 | sed 's/ //g' ` PD_FIRM=${PD_DRIVE}${PD_FIRMWARE} if [ ${PD_FIRM_STATUS} != "Online,SpunUp" ] then #echo "PD firmware with problem" PDFIRM_WITH_FAIL="${PD_FIRM} ${PDFIRM_WITH_FAIL}" let "ERROR_COUNT += 1" fi fi done <<< "${VD_PDID_ERRORS}" while read bbu_log do BBU_STATE=` echo ${bbu_log} | grep -E "Battery State" ` BBU_ERROR=` echo ${bbu_log} | grep -E "Max Error" ` BBU_CHARGE=` echo ${bbu_log} | grep -E "Full Charge" ` if [ -n "${BBU_STATE}" ] then BBU_ST=` echo "${BBU_STATE}" | awk '{print $3}' ` #echo ${BBU_ST} if [ ${BBU_ST} = "Unknown" ] then #echo "Battery status is Unknown" let "ERROR_COUNT = 250" BBUSU_WITH_FAIL= "${BBU_STATE}" elif [ ${BBU_ST} != "Optimal" ] then #echo "Battery STATUS is BAD" BBUS_WITH_FAIL= "${BBU_STATE}" let "ERROR_COUNT = 251" fi elif [ -n "${BBU_ERROR}" ] then BBU_ER=` echo ${BBU_ERROR} | awk '{print $4}' ` #echo ${BBU_ER} if [ "${BBU_ER}" - ge "11" ] then #echo "Battery has ERRORS" BBUE_WITH_FAIL= "${BBU_ERROR}" let "ERROR_COUNT = 252" fi elif [ -n "${BBU_CHARGE}" ] then BBU_CHAR=` echo ${BBU_CHARGE} | awk '{print $4}' ` #echo ${BBU_CHAR} if [ "${BBU_CHAR}" -lt "675" ] then #echo "Battery has low CHARGE" BBUC_WITH_FAIL= "${BBU_CHARGE}" let "ERROR_COUNT = 253" fi fi done <<< "${BBU_OUT}" if [[ -n $1 ]] && [ $1 == 'log' ] then echo "${VDRIVE_WITH_FAIL} ${DISK_WITH_FAIL} ${PDFIRM_WITH_FAIL} ${BBUS_WITH_FAIL} ${BBUSU_WITH_FAIL} ${BBUE_WITH_FAIL} ${BBUC_WITH_FAIL} " else echo $ERROR_COUNT fi exit 0 |
1
2
3
4
| root@il-nv-s06:~ # ./megaraid.sh 252 root@il-nv-s06:~ # ./megaraid.sh log Max Error = 14 % |
По роботе с magacli есть целая книга-руководство.
Из полезных команд:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| # Просмотр журнала событий BBU, где можно найти информацию по проверкам и автоисправлению битых секторов megacli -fwtermlog -dsply -aall > /tmp/ttylog .txt # Полная информация о всех адаптеров контроллера megacli -AdpAllInfo -aALL # Полная информация о настройках и дисках megacli -CfgDsply -aALL # Информация о последних событиях, где можно найти информацию о сбои в работе дисков megacli -AdpEventLog -GetLatest 4000 -f events.log -aALL megacli -AdpEventLog -GetEvents -f events.log -aALL # Информация о всех доступных корпусах контроллера megacli -EncInfo -aALL # Список всех логических дисков и типе RAID-а в котором они собраны megacli -LDInfo -Lall -aALL # Список всех физических дисков megacli -PDList -aALL # Информация о конкретном физическом диске # Типовая комманда megacli -pdinfo -physdrv [E1:S2] -aALL # E1 - Enclosure Device ID: 1, S2 - Slot Number: 2 # To get it need to run - megacli -LdPdInfo -aALL | grep -E "ID|Slot" megacli -pdinfo -physdrv [4:2] -aALL # Засветить диск #Start blinking megacli -PdLocate -start -physdrv\[4:3\] -aALL megacli -PdLocate -start -physdrv\[4:2\] -aALL megacli -PdLocate -start -physdrv\[4:1\] -aALL #Stop blinking megacli -PdLocate -stop -physdrv\[4:1\] -aALL megacli -PdLocate -stop -physdrv\[4:2\] -aALL megacli -PdLocate -stop -physdrv\[4:3\] -aALL # Проверка состояния BBU (Battery Backup Unit) megacli -adpbbucmd -aall # Посмотреть прогресс добавления диска в RAID megacli -pdrbld -showprog -physdrv[4:0] -aAll |
Мониторинг дисков используя smartctl
Для этого нам понадобиться тот же megacli, используя который, мы узнаем ID физических дисков и соответствующие им логические носители. Начнем.Узнаем ID всех физических дисков за мегарейд контроллером ну и номера соответствующих логических дисков.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| root@il-nv-s06:~ # megacli -LdPdInfo -aALL | grep Id Virtual Drive: 0 (Target Id: 0) Device Id: 0 Device Id: 1 Device Id: 2 Device Id: 3 Virtual Drive: 1 (Target Id: 1) Device Id: 13 Device Id: 12 Virtual Drive: 2 (Target Id: 2) Device Id: 11 Device Id: 10 Device Id: 9 Device Id: 6 Device Id: 7 Device Id: 8 |
- -LdPdInfo — получить информацию(Info) по логическим (Ld) и физическим(Pd) устройствам …
- -aALL — … на всех адаптерах
1
2
| root@il-nv-s06:~ # ls /dev/sd[a-Z] /dev/sda /dev/sdb /dev/sdc |
- Virtual Drive: 0 == /dev/sda и в него входит 4 физических диска с ID=0,1,2,3
- Virtual Drive: 1 == /dev/sdb и в него входит 2 физических диска с ID=13,12
- Virtual Drive: 2 == /dev/sdc и в него входит 6 физических дисков с ID=6,7,8,9,10,11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| root@il-nv-s06:~ # cat smartcheck.sh #!/bin/bash echo "=============================================" echo "================== /dev/sda =================" echo "=============================================" smartctl -d megaraid,0 -a /dev/sda smartctl -d megaraid,1 -a /dev/sda smartctl -d megaraid,2 -a /dev/sda smartctl -d megaraid,3 -a /dev/sda echo "=============================================" echo "================== /dev/sdb =================" echo "=============================================" smartctl -d megaraid,13 -a /dev/sdb smartctl -d megaraid,12 -a /dev/sdb echo "=============================================" echo "================== /dev/sdc =================" echo "=============================================" smartctl -d megaraid,11 -a /dev/sdc smartctl -d megaraid,10 -a /dev/sdc smartctl -d megaraid,9 -a /dev/sdc smartctl -d megaraid,6 -a /dev/sdc smartctl -d megaraid,7 -a /dev/sdc smartctl -d megaraid,8 -a /dev/sdc |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
| root@il-nv-s06:~ # smartctl -d megaraid,0 -a /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.8.0-26-generic] ( local build) Copyright (C) 2002-11 by Bruce Allen, http: //smartmontools .sourceforge.net Vendor: SEAGATE Product: ST31000424SS Revision: 0005 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Logical block size: 512 bytes Logical Unit id : 0x5000c5002130cd0b Serial number: 9WK1D0420000C1051TRW Device type : disk Transport protocol: SAS Local Time is: Fri Feb 7 20:24:25 2014 IST Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 29 C Drive Trip Temperature: 68 C Manufactured in week 32 of year 2010 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 30 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 2 Elements in grown defect list: 0 Vendor (Seagate) cache information Blocks sent to initiator = 920579338 Blocks received from initiator = 3734205770 Blocks read from cache and sent to initiator = 2669309657 Number of read and write commands whose size <= segment size = 101596876 Number of read and write commands whose size > segment size = 1211 Vendor (Seagate /Hitachi ) factory information number of hours powered up = 24230.63 number of minutes until next internal SMART test = 20 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read : 3033913199 210 0 3033913409 3033913469 39052.656 60 write: 0 0 0 0 0 4141.743 0 verify: 75533051 10 0 75533061 75533061 1001.100 0 Non-medium error count: 14 [GLTSD (Global Logging Target Save Disable) set . Enable Save with '-S on' ] SMART Self- test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background long Completed - 24200 - [- - -] Long (extended) Self Test duration: 11100 seconds [185.0 minutes] |
Немного расшифрую выводу ошибок:
Журнал ошибок (если он доступен) отображается в отдельных строках:
- write error counters — ошибки записи
- read error counters — ошибки считывания
- verify error counters (отображаются только когда не нулевое значение) — ошибки выполнения
- non-medium error counter (определенное число) — число восстанавливаемых ошибок отличных от ошибок записи/считывания/выполнения
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| Error 3 occurred at disk power-on lifetime: 23855 hours (993 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 10 51 08 4c 08 0f e0 Error: IDNF at LBA = 0x000f084c = 985164 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command /Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 08 4c 08 0f 00 08 19d+06:08:39.873 WRITE DMA ca 00 08 5c 05 0f 00 08 19d+06:08:39.873 WRITE DMA c8 00 10 9c a0 25 00 08 19d+06:08:39.866 READ DMA c8 00 08 94 a0 25 00 08 19d+06:08:39.866 READ DMA c8 00 08 8c a0 25 00 08 19d+06:08:39.862 READ DMA |
Errors Corrected by ECC, fast [Errors corrected without substantial delay: 00h]. An error correction was applied to get perfect data (a.k.a. ECC on-the-fly). «Without substantial delay» means the correction did not postpone reading of later sectors (e.g. a revolution was not lost). The counter is incremented once for each logical block that requires correction. Two different blocks corrected during the same command are counted as two events.
Errors Corrected by ECC: delayed [Errors corrected with possible delays: 01h]. An error code or algorithm (e.g. ECC, checksum) is applied in order to get perfect data with substantial delay. «With possible delay» means the correction took longer than a sector time so that reading/writing of subsequent sectors was delayed (e.g. a lost revolution). The counter is incremented once for each logical block that requires correction. A block with a double error that is correctable counts as one event and two different blocks corrected during the same command count as two events.
Error corrected by rereads/rewrites [Total (e.g. rewrites and rereads): 02h]. This parameter code specifies the counter counting the number of errors that are corrected by applying retries. This counts errors recovered, not the number of retries. If five retries were required to recover one block of data, the counter increments by one, not five. The counter is incremented once for each logical block that is recovered using retries. If an error is not recoverable while applying retries and is recovered by ECC, it isn’t counted by this counter; it will be counted by the counter specified by parameter code 01h — Errors Corrected With Possible Delays.
Total errors corrected [Total errors corrected: 03h]. This counter counts the total of parameter code errors 00h, 01h and 02h (i.e. error corrected by ECC: fast and delayed plus errors corrected by rereads and rewrites). There is no «double counting» of data errors among these three counters. The sum of all correctable errors can be reached by adding parameter code 01h and 02h errors, not by using this total. [The author does not understand the previous sentence from the Seagate manual.]
Correction algorithm invocations [Total times correction algorithm processed: 04h]. This parameter code specifies the counter that counts the total number of retries, or «times the retry algorithm is invoked». If after five attempts a counter 02h type error is recovered, then five is added to this counter. If three retries are required to get stable ECC syndrome before a counter 01h type error is corrected, then those three retries are also counted here. The number of retries applied to unsuccessfully recover an error (counter 06h type error) are also counted by this counter.
Gigabytes processed {10^9} [Total bytes processed: 05h]. This parameter code specifies the counter that counts the total number of bytes either successfully or unsuccessfully read, written or verified (depending on the log page) from the drive. If a transfer terminates early because of an unrecoverable error, only the logical blocks up to and including the one with the uncorrected data are counted. [smartmontools divides this counter by 10^9 before displaying it with three digits to the right of the decimal point. This makes this 64 bit counter easier to read.]
Total uncorrected errors [Total uncorrected errors: 06h]. This parameter code specifies the counter that contains the total number of blocks for which an uncorrected data error has occurred.
С всего этого нас интересует параметр Total uncorrected errors который показывает количество не исправленных ошибок. Если это число велико, то нужно запускать long тест и проверить, дополнительно, параметры физического диска в Megaraid контроллере.
Мониторинг дисков используя smartd
Предыдущие способы мониторинга дисков были ручными, т.е. нужно вручную запускать проверку дисков находясь на конкретном сервере, или же настроить систему мониторинга, которая будет использовать написанные выше скрипты для сбора информации о состоянии дисков. Но есть еще один способ мониторинга — это использование демона smartd, который будет отправлять нам письма о проблемных дисках. Детально о настройках демона smartd можно почитать здесьДля начала добавим демон в автозагрузку.
1
2
3
| root@il-nv-s06:~ # cat /etc/default/smartmontools start_smartd= yes smartd_opts= "--interval=3600" |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| root@il-nv-s06:~ # cat /etc/smartd.conf #Диски, которые нужно мониторить /dev/sda -d megaraid,0 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sda -d megaraid,1 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sda -d megaraid,2 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sda -d megaraid,3 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdb -d megaraid,13 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdb -d megaraid,12 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdc -d megaraid,11 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdc -d megaraid,10 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdc -d megaraid,9 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdc -d megaraid,6 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdc -d megaraid,7 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) /dev/sdc -d megaraid,8 -o on -S on -m your@emailaddress.com -M diminishing -a -s (S/../../. /00 |L/../.. /7/03 ) root@il-nv-s06:~ # /etc/init.d/smartd restart |
В следующей статье я постараюсь описать решение проблемы с батареей Megaraid та и любого другого RAID-контролера. Потом поговорим о мониторинге дисков под HP контроллером (HP/Compaq SmartArray)
Мониторинг BBU RAID контроллеров
В предыдущей статье
шла речь об установки megacli для мониторинга дисков под LSI 2108
Megaraid контроллеров. Сейчас же я хочу немного описать мониторинг
батареи (BBU) для RAID контроллеров в целом. Какие шаги нужно
предпринимать и как не наделать лишних проблем для себя при
возникновении ошибок или неполадок с BBU (Battery Backup Unit).
Состояние батареи нужно периодически проверять. Для RAID контроллеров этот компонент вообще может отсутствовать, так как его основное предназначение — это держать в кэше данные, которые еще не записались на диск, т.е. сохранение целостности данных при сбое питания (внезапное отключения подачи электричества).
На данный момент, почти все рейд контроллеры поддерживают кэширование данных на уровне контроллера. Т.е. каждый физический диск имеет свой кэш плюс кэш контроллера. Такой подход повышает производительность системы при сохранение большого количества данных или же при очень высоком уровне отдачи контента конечным пользователям.
Если рейд контроллер умеет кэшировать данные, то на нем можно настроить политику считывания, записи и буферизации данных.
Read Policy: Политика считывания указывает каким образом контроллеру нужно считывать сектора логических устройств при поиске нужной информации.
Write Policy: The write policies specify whether the controller sends a write-request completion signal as soon as the data is in the cache or after it has been written to disk.
Cache Policy: The Direct I/O and Cache I/O cache policies apply to reads on a specific virtual disk. These settings do not affect the read-ahead policy. The cache policies are as follows:
Теперь перейдем к практике на примере мегарейд контроллера.
Для начала нужно проверить логи:
Если вывелась куча строк типа:
… первым делом проверяем статус BBU установленной тулзой:
Убеждаемся, что проблема есть по параметрам — Battery Replacement required : Yes и Battery State: Failed. Перед паникой и заменой, нужно дополнительно посмотреть параметр Run time to empty: Battery is not being charged. — Это означает, что нужно ее зарядить, так как она еще ни разу не заряжалась и состояние заряда :
Relative State of Charge: 96 %
Absolute State of charge: 5161 %
Значит первым делом нужно попробовать зарядить BBU и если не помогло — проверить все ли правильно подсоединено. Если проблема не решилась — нужно проводить замену.
Можно пользоваться этими шагами для определения и решения проблемы с BBU:
Type:
F= Fatal. W=Warning. C=Critical.
Indication:
A) Sudden power loss or system hang, when BBU is not fully charged and Write Back mode is forcefully enabled
B) The extended power loss to the system has resulted in the BBU being thoroughly discharged before power recovery.
C) The specific virtual drive configuration may have changed, so that previous virtual drive information cannot be recovered from BBU data
D) BBU failure or it is installed or connected incorrectly.
E) BBU not connected or not fully charged.
F) If WB mode was enabled before BBU charge, then it will be automatically re-enabled after the charge
G) BBU not able to keep cache data long enough during system power off.
H) The battery requires a relearn cycle to re-calibrate itself.
Action
1) Check the BBU status to see if the BBU should be charged or replaced.
2) Check the cable, power connection, backplane, SATA/SAS port, and make sure the BBU is installed and connected correctly.
3) Use RAID Web Console 2 or RAID BIOS Console to initiate a battery re-learn cycle.
4) Wait until the BBU is fully charged before rebooting the system.
5) WB can still be used through Bad BBU mode under RAID Web Console 2 but unexpected
power failure may cause data loss.
6) Check if the remote battery connector cable is properly connected and functional.
Состояние батареи нужно периодически проверять. Для RAID контроллеров этот компонент вообще может отсутствовать, так как его основное предназначение — это держать в кэше данные, которые еще не записались на диск, т.е. сохранение целостности данных при сбое питания (внезапное отключения подачи электричества).
На данный момент, почти все рейд контроллеры поддерживают кэширование данных на уровне контроллера. Т.е. каждый физический диск имеет свой кэш плюс кэш контроллера. Такой подход повышает производительность системы при сохранение большого количества данных или же при очень высоком уровне отдачи контента конечным пользователям.
Если рейд контроллер умеет кэшировать данные, то на нем можно настроить политику считывания, записи и буферизации данных.
Read Policy: Политика считывания указывает каким образом контроллеру нужно считывать сектора логических устройств при поиске нужной информации.
- Read-Ahead. Когда используется политика упреждающего (на перед) чтение, контроллер включает режим последовательного считывания секторов с логических дисков при поиске данных. Производительность повышается, если данных записаны последовательно, сектор за сектором на логические диски.
- No-Read-Ahead. Режим отключения политика последовательного считывания данных на контроллере.
- Adaptive Read-Ahead. Когда включена адаптивная политика упреждающего чтение, контроллер инициализирует упреждающее чтение только если пришел запрос на очень часто считываемые данные, которые записаны последовательно на логический диск. Если же запрашиваются рендомные(записанные в случайной последовательности) данные — контроллер переходит в режим no-read-ahead.
Write Policy: The write policies specify whether the controller sends a write-request completion signal as soon as the data is in the cache or after it has been written to disk.
- Write-Back. When using write-back caching, the controller sends a write-request completion signal as soon as the data is in the controller cache but has not yet been written to disk. Write-back caching may provide improved performance since subsequent read requests can more quickly retrieve data from the controller cache than they could from the disk. Write-back caching also entails a data security risk, however, since a system failure could prevent the data from being written to disk even though the controller has sent a write-request completion signal. In this case, data may be lost. Other applications may also experience problems when taking actions that assume the data is available on the disk.
- Write-Through. When using write-through caching, the controller sends a write-request completion signal only after the data is written to the disk. Write-through caching provides better data security than write-back caching, since the system assumes the data is available only after it has been safely written to the disk.
Cache Policy: The Direct I/O and Cache I/O cache policies apply to reads on a specific virtual disk. These settings do not affect the read-ahead policy. The cache policies are as follows:
- Cache I/O. Specifies that all reads are buffered in cache memory.
- Direct I/O. Specifies that reads are not buffered in cache memory. When using direct I/O, data is transferred to the controller cache and the host system simultaneously during a read request. If a subsequent read request requires data from the same data block, it can be read directly from the controller cache. The direct I/O setting does not override the cache policy settings. Direct I/O is also the default setting.
Теперь перейдем к практике на примере мегарейд контроллера.
Для начала нужно проверить логи:
1
| root@il:~ # megacli -fwtermlog -dsply -aall |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| 02 /13/14 6:30:57: EVT #715494-02/13/14 6:30:57: 150=Battery needs replacement - SOH Bad 02 /13/14 6:32:02: Not enough charge capacity left in battery for expeded data retention duration 02 /13/14 6:32:02: Battery needs replacement 02 /13/14 6:32:02: EVT #715495-02/13/14 6:32:02: 150=Battery needs replacement - SOH Bad 02 /13/14 6:33:07: Not enough charge capacity left in battery for expeded data retention duration 02 /13/14 6:33:07: Battery needs replacement 02 /13/14 6:33:07: EVT #715496-02/13/14 6:33:07: 150=Battery needs replacement - SOH Bad 02 /13/14 6:34:12: Not enough charge capacity left in battery for expeded data retention duration 02 /13/14 6:34:12: Battery needs replacement 02 /13/14 6:34:12: EVT #715497-02/13/14 6:34:12: 150=Battery needs replacement - SOH Bad 02 /13/14 6:35:17: Not enough charge capacity left in battery for expeded data retention duration 02 /13/14 6:35:17: Battery needs replacement 02 /13/14 6:35:17: EVT #715498-02/13/14 6:35:17: 150=Battery needs replacement - SOH Bad 02 /13/14 6:36:22: Not enough charge capacity left in battery for expeded data retention duration 02 /13/14 6:36:22: Battery needs replacement 02 /13/14 6:36:22: EVT #715499-02/13/14 6:36:22: 150=Battery needs replacement - SOH Bad 02 /13/14 6:37:27: Not enough charge capacity left in battery for expeded data retention duration 02 /13/14 6:37:27: Battery needs replacement 02 /13/14 6:37:27: EVT #715500-02/13/14 6:37:27: 150=Battery needs replacement - SOH Bad 02 /13/14 6:38:32: Not enough charge capacity left in battery for expeded data retention duration 02 /13/14 6:38:32: Battery needs replacement 02 /13/14 6:38:32: EVT #715501-02/13/14 6:38:32: 150=Battery needs replacement - SOH Bad 02 /13/14 6:39:37: Not enough charge capacity left in battery for expeded data retention duration |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
| root@il:~ # megacli -adpbbucmd -aall BBU status for Adapter: 0 BatteryType: iBBU Voltage: 4008 mV Current: 0 mA Temperature: 26 C Battery State: Failed BBU Firmware Status: Charging Status : None Voltage : OK Temperature : OK Learn Cycle Requested : No Learn Cycle Active : No Learn Cycle Status : OK Learn Cycle Timeout : No I2c Errors Detected : No Battery Pack Missing : No Battery Replacement required : Yes Remaining Capacity Low : Yes Periodic Learn Required : No Transparent Learn : No No space to cache offload : No Pack is about to fail & should be replaced : No Cache Offload premium feature required : No Module microcode update required : No GasGuageStatus: Fully Discharged : No Fully Charged : Yes Discharging : Yes Initialized : Yes Remaining Time Alarm : No Discharge Terminated : No Over Temperature : No Charging Terminated : No Over Charged : No Relative State of Charge: 96 % Charger System State: 49168 Charger System Ctrl: 0 Charging current: 0 mA Absolute state of charge: 5161 % Max Error: 19 % Battery backup charge time : 0 hours BBU Capacity Info for Adapter: 0 Relative State of Charge: 96 % Absolute State of charge: 5161 % Remaining Capacity: 62706 mAh Full Charge Capacity: 65467 mAh Run time to empty: Battery is not being charged. Average time to empty: Battery is not being charged. Estimated Time to full recharge: Battery is not being charged. Cycle Count: 119 Max Error = 19 % Remaining Capacity Alarm = 120 mAh Remining Time Alarm = 10 Min BBU Design Info for Adapter: 0 Date of Manufacture: 12 /01 , 2010 Design Capacity: 1215 mAh Design Voltage: 3700 mV Specification Info: 33 Serial Number: 3241 Pack Stat Configuration: 0x64a0 Manufacture Name: LS1121001A Firmware Version : Device Name: 3150301 Device Chemistry: LION Battery FRU: N /A Transparent Learn = 0 App Data = 0 BBU Properties for Adapter: 0 Auto Learn Period: 30 Days Next Learn time : Thu Feb 20 06:33:27 2014 Learn Delay Interval:0 Hours Auto-Learn Mode: Enabled |
Relative State of Charge: 96 %
Absolute State of charge: 5161 %
Значит первым делом нужно попробовать зарядить BBU и если не помогло — проверить все ли правильно подсоединено. Если проблема не решилась — нужно проводить замену.
Можно пользоваться этими шагами для определения и решения проблемы с BBU:
Num | Type | Description | Indication | Actions |
2 | F | Unable to recover cache data from TBBU | A,B | 1 |
10 | F | Controller cache discarded due to memory/battery problems | A,B | 1 |
11 | F | Unable to recover cache data due to configuration mismatch | A,B,C | 1 |
146 | W | Battery voltage low | N/A | 1,2 |
162 | W | Current capacity of the battery is below threshold | B | 1,2 |
150 | F | Battery needs replacement — SOH Bad | D | 1,2 |
154 | W | Battery relearn timed out | D | 1,2 |
161 | W | Battery removed | D | 1,2 |
200 | C | Battery/charger problems detected: SOH Bad | D | 1,2 |
211 | C | BBU Retention test failed! | G | 1,2 |
142 | W | Battery Not Present | N/A | 1,2 |
253 | W | Battery requires reconditioning: please initiate a LEARN cycle | N/A | 3 |
307 | W | Periodic Battery Relearn is pending. Please initiate manual leam cycle as Automatic leam is not enabled | H | 3 |
195 | W | BBU disabled: changing WB to WT | E, F | 4,5 |
330 | W | Detected error with the remote battery connector cable | N/A | 6 |
F= Fatal. W=Warning. C=Critical.
Indication:
A) Sudden power loss or system hang, when BBU is not fully charged and Write Back mode is forcefully enabled
B) The extended power loss to the system has resulted in the BBU being thoroughly discharged before power recovery.
C) The specific virtual drive configuration may have changed, so that previous virtual drive information cannot be recovered from BBU data
D) BBU failure or it is installed or connected incorrectly.
E) BBU not connected or not fully charged.
F) If WB mode was enabled before BBU charge, then it will be automatically re-enabled after the charge
G) BBU not able to keep cache data long enough during system power off.
H) The battery requires a relearn cycle to re-calibrate itself.
Action
1) Check the BBU status to see if the BBU should be charged or replaced.
2) Check the cable, power connection, backplane, SATA/SAS port, and make sure the BBU is installed and connected correctly.
3) Use RAID Web Console 2 or RAID BIOS Console to initiate a battery re-learn cycle.
4) Wait until the BBU is fully charged before rebooting the system.
5) WB can still be used through Bad BBU mode under RAID Web Console 2 but unexpected
power failure may cause data loss.
6) Check if the remote battery connector cable is properly connected and functional.