SATA hotswap con mdadm RAID

Publicada en Publicada en bash, Debian, OS, Scripts

Contento de haber recibido, hace rato(más de un año), el nuevo gabinete para realizar hotswap nunca había realizado dicha acción. Hoy pequé no siguiendo una de las frase de IT: Si funciona, NO lo toques.
Como se que la desgracia se sienta al lado nuestro en tiempos que todo explota, pase a probar.

Entorno:
Tengo un fileserver y lamentablemente me quedo un RAID1 con un solo disco. Como lo utilizo como Papelera de Reciclaje de las cuentas de Samba honestamente no me interesa mucho lo que se guarde ahí más sí se rompe no sería grabe el daño.
El estado actual es:

root@server:~# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdf[1]
      976761424 blocks super 1.2 [2/1] [_U]

Así fueron los pasos a seguir:

  1. Desmonto la unidad:
    umount /mnt/md0/
  2. Hago un stop al raid:
    mdadm --manage /dev/md0 --stop
  3. Ponemos el disco rígido en standby:
    hdparm -Y /dev/sdf
  4. Y lo sacamos en caliente.

Leí por ahí que debemos:

echo 1 > /sys/block/sdb/device/delete

Just in case the kernel has not realized, that the drive has been physically pulled out. (Source: http://serverfault.com/questions/5336/how-do-i-make-linux-recognize-a-new-sata-dev-sda-drive-i-hot-swapped-in-without)

Como no hice esto, revise los logs:

Dec  2 12:51:45 server kernel: [793838.144423] ata6: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
Dec  2 12:51:45 server kernel: [793838.144481] ata6: irq_stat 0x00400040, connection status changed
Dec  2 12:51:45 server kernel: [793838.144528] ata6: SError: { PHYRdyChg 10B8B DevExch }
Dec  2 12:51:45 server kernel: [793838.144572] ata6: hard resetting link
Dec  2 12:51:46 server kernel: [793838.863111] ata6: SATA link down (SStatus 0 SControl 300)
Dec  2 12:51:51 server kernel: [793843.852521] ata6: hard resetting link
Dec  2 12:51:51 server kernel: [793844.171849] ata6: SATA link down (SStatus 0 SControl 300)
Dec  2 12:51:51 server kernel: [793844.171860] ata6: limiting SATA link speed to 1.5 Gbps
Dec  2 12:51:56 server kernel: [793849.161272] ata6: hard resetting link
Dec  2 12:51:57 server kernel: [793849.480594] ata6: SATA link down (SStatus 0 SControl 310)
Dec  2 12:51:57 server kernel: [793849.480603] ata6.00: disabled
Dec  2 12:51:57 server kernel: [793849.480614] ata6: EH complete
Dec  2 12:51:57 server kernel: [793849.480623] ata6.00: detaching (SCSI 5:0:0:0)
Dec  2 12:51:57 server kernel: [793849.480940] sd 5:0:0:0: [sdf] Synchronizing SCSI cache
Dec  2 12:51:57 server kernel: [793849.480977] sd 5:0:0:0: [sdf]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Dec  2 12:51:57 server kernel: [793849.480981] sd 5:0:0:0: [sdf] Stopping disk
Dec  2 12:51:57 server kernel: [793849.480988] sd 5:0:0:0: [sdf] START_STOP FAILED
Dec  2 12:51:57 server kernel: [793849.480990] sd 5:0:0:0: [sdf]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Como tengo mucha curiosidad probé los pasos:

  1. Desmonto la unidad:
    umount /mnt/md0/
  2. Hago un stop al raid:
    mdadm --manage /dev/md0 --stop
  3. Ponemos el disco rígido en standby:
    hdparm -Y /dev/sdf
  4. Borramos físicamente del kernel el disco:
    echo 1 > /sys/block/sdf/device/delete
  5. Y lo sacamos en caliente.

Logs:

Dec  2 14:36:15 srv-it kernel: [800094.929215] sd 5:0:0:0: [sdf] Synchronizing SCSI cache
Dec  2 14:36:15 srv-it kernel: [800094.929259] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Dec  2 14:36:15 srv-it kernel: [800094.929328] ata6.00: waking up from sleep
Dec  2 14:36:15 srv-it kernel: [800094.929367] ata6: hard resetting link
Dec  2 14:36:16 srv-it kernel: [800095.245764] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  2 14:36:16 srv-it kernel: [800095.246223] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20110623/psargs-359)
Dec  2 14:36:16 srv-it kernel: [800095.246232] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT5._GTF] (Node ffff8806060c0588), AE_NOT_FOUND (20110623/psparse-536)
Dec  2 14:36:16 srv-it kernel: [800095.246854] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20110623/psargs-359)
Dec  2 14:36:16 srv-it kernel: [800095.246860] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT5._GTF] (Node ffff8806060c0588), AE_NOT_FOUND (20110623/psparse-536)
Dec  2 14:36:16 srv-it kernel: [800095.247041] ata6.00: configured for UDMA/133
Dec  2 14:36:16 srv-it kernel: [800095.247045] ata6.00: retrying FLUSH 0xea Emask 0x0
Dec  2 14:36:16 srv-it kernel: [800095.247112] ata6: EH complete
Dec  2 14:36:16 srv-it kernel: [800095.247186] sd 5:0:0:0: [sdf] Stopping disk
Dec  2 14:36:16 srv-it kernel: [800095.247235] sdf: detected capacity change from 0 to 1000204886016
Dec  2 14:36:20 srv-it kernel: [800099.758977] ata6.00: disabled
Dec  2 14:36:38 srv-it kernel: [800117.854892] ata6: exception Emask 0x10 SAct 0x0 SErr 0x4090000 action 0xe frozen
Dec  2 14:36:38 srv-it kernel: [800117.854950] ata6: irq_stat 0x00400040, connection status changed
Dec  2 14:36:38 srv-it kernel: [800117.854997] ata6: SError: { PHYRdyChg 10B8B DevExch }
Dec  2 14:36:38 srv-it kernel: [800117.855043] ata6: hard resetting link
Dec  2 14:36:39 srv-it kernel: [800118.576295] ata6: SATA link down (SStatus 0 SControl 300)
Dec  2 14:36:39 srv-it kernel: [800118.576306] ata6: EH complete

 
Ahora a conectarlo, conectamos el disco y luego ejecutamos:

  1. Levantamos el raid ya configurado:
    mdadm -A /dev/md0
  2. Verificamos sí se levanto:
    cat /proc/mdstat
  3. En caso de que figure (auto-read-only) ejecutamos:
    mdadm --readwrite /dev/md0
  4. Ya podemos montar la unidad:
    mount -a

Finalmente quedaría algo así:

root@server:~# mdadm -A /dev/md0
mdadm: /dev/md0 has been started with 1 drive (out of 2).
root@server:~# cat /proc/mdstat
Personalities : [raid1]
md0 : active (auto-read-only) raid1 sdf[1]
      976761424 blocks super 1.2 [2/1] [_U]
root@server:~# mdadm --readwrite /dev/md0
root@server:~# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdf[1]
      976761424 blocks super 1.2 [2/1] [_U]
root@server:~# mount -a
root@server:~# dfc
FILESYSTEM               (=) USED      FREE (-) %USED AVAILABLE     TOTAL MOUNTED ON
/dev/md0                 [==------------------]    5%    870.1G    916.9G /mnt/md0
root@server:~#

Log de conexión:

Dec  2 14:50:21 server kernel: [800938.675339] ata6: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
Dec  2 14:50:21 server kernel: [800938.675397] ata6: irq_stat 0x00000040, connection status changed
Dec  2 14:50:21 server kernel: [800938.675444] ata6: SError: { CommWake DevExch }
Dec  2 14:50:21 server kernel: [800938.675485] ata6: hard resetting link
Dec  2 14:50:27 server kernel: [800944.425243] ata6: link is slow to respond, please be patient (ready=0)
Dec  2 14:50:28 server kernel: [800945.934056] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec  2 14:50:28 server kernel: [800945.950971] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20110623/psargs-359)
Dec  2 14:50:28 server kernel: [800945.950980] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT5._GTF] (Node ffff8806060c0588), AE_NOT_FOUND (20110623/psparse-536)
Dec  2 14:50:28 server kernel: [800945.951226] ata6.00: ATA-8: ST1000DM003-9YN162, CC4B, max UDMA/133
Dec  2 14:50:28 server kernel: [800945.951229] ata6.00: 1953525168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Dec  2 14:50:28 server kernel: [800945.951618] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20110623/psargs-359)
Dec  2 14:50:28 server kernel: [800945.951624] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT5._GTF] (Node ffff8806060c0588), AE_NOT_FOUND (20110623/psparse-536)
Dec  2 14:50:28 server kernel: [800945.951826] ata6.00: configured for UDMA/133
Dec  2 14:50:28 server kernel: [800945.951832] ata6: EH complete
Dec  2 14:50:28 server kernel: [800945.951931] scsi 5:0:0:0: Direct-Access     ATA      ST1000DM003-9YN1 CC4B PQ: 0 ANSI: 5
Dec  2 14:50:28 server kernel: [800945.952113] sd 5:0:0:0: [sdf] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
Dec  2 14:50:28 server kernel: [800945.952116] sd 5:0:0:0: [sdf] 4096-byte physical blocks
Dec  2 14:50:28 server kernel: [800945.952158] sd 5:0:0:0: Attached scsi generic sg5 type 0
Dec  2 14:50:28 server kernel: [800945.952240] sd 5:0:0:0: [sdf] Write Protect is off
Dec  2 14:50:28 server kernel: [800945.952245] sd 5:0:0:0: [sdf] Mode Sense: 00 3a 00 00
Dec  2 14:50:28 server kernel: [800945.952286] sd 5:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Dec  2 14:50:28 server kernel: [800945.968907]  sdf: unknown partition table
Dec  2 14:50:28 server kernel: [800945.969128] sd 5:0:0:0: [sdf] Attached SCSI disk

Fuentes:
http://blog.kihltech.com/2012/12/sata-hotswap-drive-in-mdadm-raid-array/