FS#9122 - Proposition of a new raid hook & few assorted changes

Attached to Project: Arch Linux
Opened by Michal Soltys (msoltyspl) - Monday, 07 January 2008, 18:10 GMT
Last edited by Tobias Powalowski (tpowa) - Thursday, 12 March 2009, 17:47 GMT
Task Type Feature Request
Category System
Status Closed
Assigned To Tobias Powalowski (tpowa)
Simo Leone (neotuli)
Aaron Griffin (phrakture)
Thomas Bächler (brain0)
Architecture All
Severity Medium
Priority Normal
Reported Version 2007.08-2
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 1
Private No

Details

[ I've sent this to arch-general@ as well, but it still hasn't shown up, so I'm posting it here as well ]

Hello

For some time I've been using different hooks for raid assembly along with
few assorted changes. I planned to post it on bugs.archlinux.org, but
I'm not sure if it's a proper place for things like this one.

Anyway, for your consideration.

Advantages of this approach:

- only required modules are loaded
- single hook for both partitionable and non-partitionable raids
- no more tedious listing of components during boot, as
everything is assembled as per mdadm.conf
- if user insists on creating initramfs w/o mdadm.conf, it will
be created on the fly during the booting. User can check it out and adjust
the root parameter, should it be inappropriate (if booting fails)

Below, there're udev rules compatible with default (and recommended in man) mdadm behaviour. Current udev rules ignore partitionable raids completely, and get in a way of default mdadm behaviour (potentially in both raid cases).

# md block devices

## regular non-partitionable raid
KERNEL=="md[0-9]*", NAME="md/%n", SYMLINK="%k"

## partitionable raid, volume or partition
KERNEL=="md_d[0-9]*", PROGRAM="/bin/sh -c 'K=%k ; echo $${K#md_}'", NAME="md/%c", SYMLINK="%k"

Alternatively, when rules create nodes in /dev instead of /dev/md (in this case, mdadm itself doesn't create any extra links).

# md block devices (legacy)

## regular non-partitionable raid
KERNEL=="md[0-9]*", NAME="%k", SYMLINK="md/%n"

## partitionable raid, volume or partition
KERNEL=="md_d[0-9]*", PROGRAM="/bin/sh -c 'K=%k ; echo $${K#md_}'", NAME="%k", SYMLINK="md/%c"

Also, one could write a bit more complex rule, that would use helper to check mdadm.conf, and basing on the nodes there, choose the correct approach. OTOH, providing two files, i.e. 60-md-new.rules and 60-md-legacy.rules is just simpler though. Let the user choose his approach, and just stick to it, IMO. Btw - why are udev rules kept in a big, single file ? nn-something.rules scheme makes it easier for users to introduce their own changes, and impose proper order in which they are parsed.

Above rules work fine both during initramfs and rc.sysinit. They won't collide with custom naming if such is used in mdadm.conf, usually with 1.x superblocks - unless the names are deliberately chosen to cause conflicts.

Finally - initialization in rc.sysinit can be greately simplified:

# If necessary, find md devices and manually assemble RAID arrays
if grep -q ^ARRAY /etc/mdadm.conf 2>/dev/null ; then
status "Activating RAID arrays" /sbin/mdadm -As
fi

It's all that is required.
The previous comment 'udev won't create these md nodes, so we do it ourselves' wasn't really relevant, as it was mdadm's default job, unless explicitely blocked (which Arch didn't do).

#RAID2 hook for mkinitcpio

r2_am ()
{
modprobe -q "${1}" >/dev/null 2>&1
eval "${1}=1"
}

run_hook ()
{
## 1) discover which modules are actually needed
fil=/etc/mdadm.conf
if [ ! -f "${fil}" ] ; then
#If user didn't supply proper mdadm.conf, create one.
#Partitions in this case will be assembled as partitionable,
#user SHOULD provide proper mdadm.conf for their system and
#rebuild initramfs afterwards.
echo "CREATE mode=0660 owner=0 group=6 auto=part16 symlinks=yes" >"${fil}"
mdadm -Es >>"${fil}"
fi
while read dat ; do
#get raid type per device
lev="${dat##*level=}"
lev="${lev%% metadata*}"
#and load required module
case "${lev}" in
raid0 ) [ -z "${raid0}" ] && r2_am raid0 ;;
raid1 ) [ -z "${raid1}" ] && r2_am raid1 ;;
raid4 ) [ -z "${raid456}" ] && r2_am raid456 ;;
raid5 ) [ -z "${raid456}" ] && r2_am raid456 ;;
raid6 ) [ -z "${raid456}" ] && r2_am raid456 ;;
linear ) [ -z "${linear}" ] && r2_am linear ;;
multipath ) [ -z "${multipath}" ] && r2_am multipath ;;
faulty ) [ -z "${faulty}" ] && r2_am faulty ;;
esac
done <"${fil}"
## 2) modules ready, now assembly
# things to watch out for:
# raid-partitions require access to device, to make linux actually
# notice exisitng partitions. Without that, there will be no uevents
# right after (dis)assembly, and lvm using sysfs will not check them
# for lvm volumes (unless sysfs_scan = 0). The easiest thing to do
# that, is just to run mdadm -As twice.
# Neil knows about it, and according to him, current mdadm (2.6.4)
# doesn't do mentioned access in one case. It should get fixed in
# one of the future versions. If udev rules access device later
# while processing the rules, i.e. through vol_id helper,
# it would "self" trigger partitions uevents.
# Either way, it's safer to just do it explicitely.

/sbin/mdadm -As >/dev/null 2>&1
/sbin/mdadm -As >/dev/null 2>&1
}

#RAID2 install for mkinitcpio

install ()
{
MODULES=" $(checked_modules "drivers/md/*" | grep -v "dm-") "
BINARIES="mdadm"
FILES=""
SCRIPT="raid2"
add_dir "/dev/md"
add_file "/etc/mdadm.conf"
}

help ()
{
cat<<HELPEOF
This hook is responsible for raid initialisation. It supports both
partitionable and non-partitionable arrays.

Proper mdadm.conf SHOULD be included. If it's not, all raid devices
will be detected during run and assembled as partitionable devices.
User should rebuild initramfs with proper mdadm.conf.
HELPEOF
}


Regarding LVM - in its hook, there's no need for creating mapper/control.

--- lvm2_hook 2008-01-05 14:21:14.000000000 +0100
+++ lvm2 2008-01-05 14:21:30.000000000 +0100
@@ -3,13 +3,11 @@
{
/sbin/modprobe -q dm-mod >/dev/null 2>&1
if [ -e "/sys/class/misc/device-mapper" ]; then
- read dev_t < /sys/class/misc/device-mapper/dev
- /bin/mknod "/dev/mapper/control" c $(/bin/replace "${dev_t}" ':')

[ "${quiet}" = "y" ] && LVMQUIET=">/dev/null"

msg "Scanning logical volumes..."
- eval /bin/lvm vgscan --ignorelockingfailure $LVMQUIET
+ eval /bin/lvm vgscan --mknodes --ignorelockingfailure $LVMQUIET
msg "Activating logical volumes..."
eval /bin/lvm vgchange --ignorelockingfailure -ay $LVMQUIET
fi
This task depends upon

Closed by  Tobias Powalowski (tpowa)
Thursday, 12 March 2009, 17:47 GMT
Reason for closing:  Implemented
Additional comments about closing:  2.6.8-2
Comment by Michal Soltys (msoltyspl) - Monday, 07 January 2008, 18:15 GMT
Hmm, looks like the code lost all the formatting, so i'm attaching the file with it.
   code (3.6 KiB)
Comment by Michal Soltys (msoltyspl) - Friday, 11 January 2008, 08:31 GMT
I've made one fix and some obvious simplifications. All the current code is in attached file (hook, install, udevs, diffs). Please use this one :)
   code2 (3.6 KiB)
Comment by Michal Soltys (msoltyspl) - Tuesday, 22 January 2008, 07:06 GMT
So, any comments about the idea in general and example code ('code2') in my previous entry ?
Comment by Aaron Griffin (phrakture) - Monday, 23 February 2009, 18:03 GMT
I wish I was more versed in RAID stuff, but I am not.

So a few questions: are there any cases where this could fail? RAID handling has always seemed very complex, leading me to believe that it requires a lot of stuff that can't be autodetected. This hook looks very simple, which is why I fear it :)

re: the lvm device mapper creation - are you sure it's not needed anymore? Maybe you have another hook that is creating it, and using LVM by itself (no RAID) may still need that code.
Comment by Michal Soltys (msoltyspl) - Tuesday, 03 March 2009, 23:23 GMT
Regarding lvm - manual creation is already removed from hooks as of recent mkinitcpio version(s).

As for mdadm and device nodes creation - both mdadm and its simplified version - mdassemble.auto (but /not/ plain mdassemble) will create device nodes and potentially partitions. Generally it goes like this:

- unless explicitly blocked with option auto=no, mdadm will create the device of required name. If auto=mdpN / partN / pN (all 3 do the same) - partitionable raid will be assembled. If N is ommited - 4 device nodes for partitions will be created, otherwise N. As of recent kernel, every raid is partitionable (so both mdN and md_dN naming schemes, or equivalent md/N and md/dN), but mdadm will itself create partition nodes for old partitionable raid naming scheme only (it will actually complain if you try e.g. /dev/md1 with auto=mdp4). Kind of a moot point these days, when udev takes care of creating nodes, but it still can matter for udevless initramfses.

- if the required name makes sense from kernel perspective, node and device will be the same - otherwise /sys/block will have something like e.g. md_d127, if one requested something like /dev/mybackupraid (and md_d127p1 / mybackupraid1 etc. respectively).

- udev will do whatever well, its rules tell it to do. As of current version - it creates kernel named device nodes, and makes symlinks from /dev/md/<customname> to /dev/m..., where <customname> is the one recorded in 1.x superblock (0.9x superblocks don't record any names). The latter doesn't work with stock rules at the moment due to bad environment variable checked (should get fixed, I have tiny patch ready)

- mdadm itself can create names in two ways - directly under /dev, or under /dev/md/ and with legacy symlinks from /dev - the latter is the one I used in the above patches, but they ideologically conflict with what stock udev rules prefer. It's harmless though, still the above patch would have to be adjusted to keep things clean and simple. mdadm -Es will always output the /dev/md/* scheme.

- if the main device ends with letter, partition numbers will be directly after that (e.g. /dev/myraid1, /dev/myraid2 ...) but if it ends with number, then it will have 'p' inbetween (e.g. /dev/myraid1p1, /dev/myraid1p2 ...)

Well, can it fail ? Not really, although one can create pathological case where names can be confused (for example - create /dev/md0 with 1.x superblock, change the name in superblock from 0 to 1, then create another raid array as /dev/md1 and leave the default name and change it to 0. Udev would create "xwinged" symlinks then - /dev/md/1 -> /dev/md0 and /dev/md/0 -> /dev/md1 - now have root on let's say /dev/md0p1 and boot with root=/dev/md/0p1 :) ). Or one can accidentally leave auto=no in CREATE line of mdadm.conf pulled into initramfs - then any raid assembly without explicitly precreated main device node will fail. I could possibly think about other creative ways to make it more or less confused.

If the udev rules don't touch activated raid array in any way (stock ones do of course for sake of symlinks, especially /dev/disk/...), then you still need mdadm -As followed by some access to the device to trigger uevents for partitions (That's why I simply use mdadm -As twice).

Another conflicts that can (usually harmless as well) follow are different views on mode/group/owner from udev and mdadm perspective (the latter controls that through CREATE line in mdadm.conf).

What else is there, hmmm - mdadm array names and their device nodes will "stack" in sysfs - they are not removed when array is stopped. They will be gone if once can rmmod appropriate modules. It will possibly change when mdadm 3.x is introduced.
Comment by Tobias Powalowski (tpowa) - Sunday, 08 March 2009, 18:24 GMT
could you give mdadm from testing a shot with mdadm.hook, i haven't tested raid partitions at the moment so be carefull.

Loading...