Speeding up RAID resync http://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html Current RAID settings: --------------------------------------------------------------------S root# sysctl -a | grep raid dev.raid.speed_limit_max = 200000 dev.raid.speed_limit_min = 1000 --------------------------------------------------------------------E The /proc/sys/dev/raid/speed_limit_max reflects the current "goal" rebuild speed for times when no non-rebuild activity is current on an array. My default is 200,000. The /proc/sys/dev/raid/speed_limit_min reflects the current "goal" rebuild speed for times when non-rebuild activity is current on an array. The speed is in Kbytes per second, and is a per-device rate, not a per-array rate. My default is 1000 -- set it to 50000: --------------------------------------------------------------------S root# sysctl -w dev.raid.speed_limit_min=50000 --------------------------------------------------------------------E If you want to override the default you could add this to /etc/sysctl.conf: --------------------------------------------------------------------S dev.raid.speed_limit_min = 50000 --------------------------------------------------------------------E Someone has a similar drive setup (5x 2TB in a RAID5). He tweaks the drive and volume like so: --------------------------------------------------------------------S #!/bin/bash blockdev --setra 16384 /dev/sd[abcdefg] echo 1024 > /sys/block/sda/queue/read_ahead_kb echo 1024 > /sys/block/sdb/queue/read_ahead_kb echo 1024 > /sys/block/sdc/queue/read_ahead_kb echo 1024 > /sys/block/sdd/queue/read_ahead_kb echo 1024 > /sys/block/sde/queue/read_ahead_kb echo 256 > /sys/block/sda/queue/nr_requests echo 256 > /sys/block/sdb/queue/nr_requests echo 256 > /sys/block/sdc/queue/nr_requests echo 256 > /sys/block/sdd/queue/nr_requests echo 256 > /sys/block/sde/queue/nr_requests # Set read-ahead. echo "Setting read-ahead to 64 MiB for /dev/md0" blockdev --setra 65536 /dev/md0 # Set stripe-cache_size for RAID5. echo "Setting stripe_cache_size to 16 MiB for /dev/md0" echo 16384 > /sys/block/md0/md/stripe_cache_size echo 8192 > /sys/block/md0/md/stripe_cache_active # Disable NCQ on all disks. echo "Disabling NCQ on all disks..." echo 1 > /sys/block/sda/device/queue_depth echo 1 > /sys/block/sdb/device/queue_depth echo 1 > /sys/block/sdc/device/queue_depth echo 1 > /sys/block/sdd/device/queue_depth echo 1 > /sys/block/sde/device/queue_depth exit 0 --------------------------------------------------------------------E Ran this version for now: --------------------------------------------------------------------S #!/bin/sh # raise some limits on RAID drives export PATH=/bin:/sbin:/usr/sbin:/usr/bin blockdev --setra 16384 /dev/sd[cdefghi] # contradicts read_ahead_kb! for dsk in sdc sdd sde sdf sdg sdh sdi do echo 256 > /sys/block/$dsk/queue/nr_requests echo 1024 > /sys/block/$dsk/queue/read_ahead_kb done exit 0 --------------------------------------------------------------------E Set up partitions --------------------------------------------------------------------S root# cat bigfs #!/bin/ksh -x # make filesystems with fewer inodes, larger files. export PATH=/sbin:/bin:/usr/bin date; mkfs.ext3 /dev/sdj1 date; mkswap -L SWAP-sdj2 /dev/sdj2 date; mkfs.ext3 -J size=400 -i 65536 -m 2 /dev/sdj3; date exit 0 root# ./bigfs + PATH=/sbin:/bin:/usr/bin + export PATH + date Wed Apr 13 23:57:13 EDT 2011 + mkfs.ext3 /dev/sdj1 mke2fs 1.39 (29-May-2006) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 2375680 inodes, 4751215 blocks 237560 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 145 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 32 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. + date Wed Apr 13 23:59:32 EDT 2011 + mkswap -L SWAP-sdj2 /dev/sdj2 Setting up swapspace version 1, size = 1028153 kB LABEL=SWAP-sdj2, no uuid + date Wed Apr 13 23:59:32 EDT 2011 + mkfs.ext3 -J size=400 -i 65536 -m 2 /dev/sdj3 mke2fs 1.39 (29-May-2006) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 22577152 inodes, 361205460 blocks 7224109 blocks (2.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 11024 block groups 32768 blocks per group, 32768 fragments per group 2048 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848 Writing inode tables: done Creating journal (102400 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 30 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. + date Thu Apr 14 00:10:27 EDT 2011 + exit 0 --------------------------------------------------------------------E Duplicate partition layout over all RAID drives --------------------------------------------------------------------S root# sfdisk -d /dev/sdj # partition table of /dev/sdj unit: sectors /dev/sdj1 : start= 63, size= 38009727, Id=83, bootable /dev/sdj2 : start= 38009790, size= 2008125, Id=82 /dev/sdj3 : start= 40017915, size=2889643680, Id=83 /dev/sdj4 : start= 0, size= 0, Id= 0 root# foreach x (c d e f g h i) foreach> sed -e "s/sdj/sd$x/" sdj > sd$x foreach> end root# cat sdc sde # partition table of /dev/sdc unit: sectors /dev/sdc1 : start= 63, size= 38009727, Id=83, bootable /dev/sdc2 : start= 38009790, size= 2008125, Id=82 /dev/sdc3 : start= 40017915, size=2889643680, Id=83 /dev/sdc4 : start= 0, size= 0, Id= 0 # partition table of /dev/sde unit: sectors /dev/sde1 : start= 63, size= 38009727, Id=83, bootable /dev/sde2 : start= 38009790, size= 2008125, Id=82 /dev/sde3 : start= 40017915, size=2889643680, Id=83 /dev/sde4 : start= 0, size= 0, Id= 0 --------------------------------------------------------------------E Resulting partitions look like this: --------------------------------------------------------------------S Filesystem 1M-blocks Used Available Use% Mounted on /dev/sdj1 18269 173 17169 1% /mnt /dev/sdj3 1408111 470 1379422 1% /mnt --------------------------------------------------------------------E Used the new partition tables to set up all remaining RAID drives. Left out the warning "If you created or changed a DOS partition..." --------------------------------------------------------------------S root# foreach x (c d e f g h i) foreach> sfdisk /dev/sd$x < sd$x foreach> end Checking that no-one is using this disk right now ... OK Disk /dev/sdc: 182363 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdc1 * 0+ 16 17- 136521 83 Linux /dev/sdc2 17 2507 2491 20008957+ 83 Linux /dev/sdc3 2508 2632 125 1004062+ 82 Linux swap / Solaris /dev/sdc4 2633 182362 179730 1443681225 5 Extended /dev/sdc5 2633+ 3629 997- 8008371 83 Linux /dev/sdc6 3630+ 182362 178733- 1435672791 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdc1 * 63 38009789 38009727 83 Linux /dev/sdc2 38009790 40017914 2008125 82 Linux swap / Solaris /dev/sdc3 40017915 2929661594 2889643680 83 Linux /dev/sdc4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).) Checking that no-one is using this disk right now ... OK Disk /dev/sdd: 182363 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdd1 * 0+ 16 17- 136521 83 Linux /dev/sdd2 17 2507 2491 20008957+ 83 Linux /dev/sdd3 2508 2632 125 1004062+ 82 Linux swap / Solaris /dev/sdd4 2633 182362 179730 1443681225 5 Extended /dev/sdd5 2633+ 3629 997- 8008371 83 Linux /dev/sdd6 3630+ 182362 178733- 1435672791 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdd1 * 63 38009789 38009727 83 Linux /dev/sdd2 38009790 40017914 2008125 82 Linux swap / Solaris /dev/sdd3 40017915 2929661594 2889643680 83 Linux /dev/sdd4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... Checking that no-one is using this disk right now ... OK Disk /dev/sde: 182363 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sde1 * 0+ 16 17- 136521 83 Linux /dev/sde2 17 2507 2491 20008957+ 83 Linux /dev/sde3 2508 2632 125 1004062+ 82 Linux swap / Solaris /dev/sde4 2633 182362 179730 1443681225 5 Extended /dev/sde5 2633+ 3629 997- 8008371 83 Linux /dev/sde6 3630+ 182362 178733- 1435672791 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sde1 * 63 38009789 38009727 83 Linux /dev/sde2 38009790 40017914 2008125 82 Linux swap / Solaris /dev/sde3 40017915 2929661594 2889643680 83 Linux /dev/sde4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... Checking that no-one is using this disk right now ... OK Disk /dev/sdf: 182363 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdf1 * 0+ 16 17- 136521 83 Linux /dev/sdf2 17 2507 2491 20008957+ 83 Linux /dev/sdf3 2508 2632 125 1004062+ 82 Linux swap / Solaris /dev/sdf4 2633 182362 179730 1443681225 5 Extended /dev/sdf5 2633+ 3629 997- 8008371 83 Linux /dev/sdf6 3630+ 182362 178733- 1435672791 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdf1 * 63 38009789 38009727 83 Linux /dev/sdf2 38009790 40017914 2008125 82 Linux swap / Solaris /dev/sdf3 40017915 2929661594 2889643680 83 Linux /dev/sdf4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... Checking that no-one is using this disk right now ... OK Disk /dev/sdg: 182363 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdg1 * 0+ 16 17- 136521 83 Linux /dev/sdg2 17 2507 2491 20008957+ 83 Linux /dev/sdg3 2508 2632 125 1004062+ 82 Linux swap / Solaris /dev/sdg4 2633 182362 179730 1443681225 5 Extended /dev/sdg5 2633+ 3629 997- 8008371 83 Linux /dev/sdg6 3630+ 182362 178733- 1435672791 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdg1 * 63 38009789 38009727 83 Linux /dev/sdg2 38009790 40017914 2008125 82 Linux swap / Solaris /dev/sdg3 40017915 2929661594 2889643680 83 Linux /dev/sdg4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... Checking that no-one is using this disk right now ... OK Disk /dev/sdh: 182363 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdh1 * 0+ 16 17- 136521 83 Linux /dev/sdh2 17 2507 2491 20008957+ 83 Linux /dev/sdh3 2508 2632 125 1004062+ 82 Linux swap / Solaris /dev/sdh4 2633 182362 179730 1443681225 5 Extended /dev/sdh5 2633+ 3629 997- 8008371 83 Linux /dev/sdh6 3630+ 182362 178733- 1435672791 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdh1 * 63 38009789 38009727 83 Linux /dev/sdh2 38009790 40017914 2008125 82 Linux swap / Solaris /dev/sdh3 40017915 2929661594 2889643680 83 Linux /dev/sdh4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... Checking that no-one is using this disk right now ... OK Disk /dev/sdi: 182363 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdi1 * 0+ 16 17- 136521 83 Linux /dev/sdi2 17 2507 2491 20008957+ 83 Linux /dev/sdi3 2508 2632 125 1004062+ 82 Linux swap / Solaris /dev/sdi4 2633 182362 179730 1443681225 5 Extended /dev/sdi5 2633+ 3629 997- 8008371 83 Linux /dev/sdi6 3630+ 182362 178733- 1435672791 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdi1 * 63 38009789 38009727 83 Linux /dev/sdi2 38009790 40017914 2008125 82 Linux swap / Solaris /dev/sdi3 40017915 2929661594 2889643680 83 Linux /dev/sdi4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... --------------------------------------------------------------------E The kernel log for sdc looks like this. Others are similar: --------------------------------------------------------------------S Apr 14 16:01:33 bk002 kernel: SCSI device sdc: 2929666048 512-byte hdwr sectors (1499989 MB) Apr 14 16:01:33 bk002 kernel: sdc: Write Protect is off Apr 14 16:01:33 bk002 kernel: sdc: Mode Sense: 23 00 00 00 Apr 14 16:01:33 bk002 kernel: SCSI device sdc: drive cache: none Apr 14 16:01:33 bk002 kernel: sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 sdc6 > Apr 14 16:01:36 bk002 kernel: SCSI device sdc: 2929666048 512-byte hdwr sectors (1499989 MB) Apr 14 16:01:36 bk002 kernel: sdc: Write Protect is off Apr 14 16:01:36 bk002 kernel: sdc: Mode Sense: 23 00 00 00 Apr 14 16:01:36 bk002 kernel: SCSI device sdc: drive cache: none Apr 14 16:01:36 bk002 kernel: sdc: sdc1 sdc2 sdc3 --------------------------------------------------------------------E Simple performance test --------------------------------------------------------------------S me% dd if=/dev/zero of=test.dat bs=1M count=1024 oflag=direct 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 94.4423 s, 11.4 MB/s me% dd if=test.dat of=/dev/null bs=1M iflag=direct 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 5.68675 s, 189 MB/s me% dd if=/dev/zero of=test.dat bs=4K count=256K oflag=direct 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 2301.75 s, 466 kB/s me% dd if=test.dat of=/dev/null bs=4K iflag=direct 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 35.7054 s, 30.1 MB/s --------------------------------------------------------------------E Compare to a plain partition: --------------------------------------------------------------------S me% dd if=/dev/zero of=test.dat bs=1M count=1024 oflag=direct 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 173.661 s, 6.2 MB/s me% dd if=test.dat of=/dev/null bs=1M iflag=direct 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 9.43734 s, 114 MB/s me% dd if=/dev/zero of=test.dat bs=4K count=256K oflag=direct 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 2209.6 s, 486 kB/s me% dd if=test.dat of=/dev/null bs=4K iflag=direct 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 33.5308 s, 32.0 MB/s --------------------------------------------------------------------E The kernlog directory holds kernel-log output during testing. Apparently I'm tickling a bug: --------------------------------------------------------------------S kernel: md: bug in file drivers/md/md.c, line 1659 --------------------------------------------------------------------E According to some similar Google hits, a counter is reaching zero and confusing the rest of the system. It's not really a bug, but it still might be time to try a later version of mdadm. Then try creating a filesystem using stride, stripe-width, and a larger journal. Test script --------------------------------------------------------------------S #!/bin/ksh # time tests with small and large blocks export PATH=/usr/local/bin:/bin:/usr/bin:/sbin runs=' bs=1M count=1K bs=256K count=4K bs=64K count=16K bs=16K count=64K bs=4K count=256K ' echo "$runs" | while read bs count do test -z "$bs" && continue echo $bs $count dd if=/dev/zero of=test.dat $bs $count oflag=direct dd if=test.dat of=/dev/null $bs iflag=direct echo done exit 0 --------------------------------------------------------------------E Try a smaller stripesize and higher commit time: --------------------------------------------------------------------S root# lvcreate --size 4G --name lvt3 --stripesize 128 --stripes 4 vg1 Logical volume "lvt3" created root# mkfs.ext3 -v -i 65536 /dev/vg1/lvt3 root# mount -o commit=60 /dev/vg1/lvt3 /lv3 --------------------------------------------------------------------E Results: --------------------------------------------------------------------S bs=1M count=1K 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 50.4096 s, 21.3 MB/s [ 0.8 min ] 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 4.47079 s, 240 MB/s bs=256K count=4K 4096+0 records in 4096+0 records out 1073741824 bytes (1.1 GB) copied, 84.8414 s, 12.7 MB/s [ 1.4 min ] 4096+0 records in 4096+0 records out 1073741824 bytes (1.1 GB) copied, 5.69598 s, 189 MB/s bs=64K count=16K 16384+0 records in 16384+0 records out 1073741824 bytes (1.1 GB) copied, 157.154 s, 6.8 MB/s [ 2.6 min ] 16384+0 records in 16384+0 records out 1073741824 bytes (1.1 GB) copied, 8.3078 s, 129 MB/s bs=16K count=64K 65536+0 records in 65536+0 records out 1073741824 bytes (1.1 GB) copied, 807.51 s, 1.3 MB/s [ 13.5 min ] 65536+0 records in 65536+0 records out 1073741824 bytes (1.1 GB) copied, 15.2816 s, 70.3 MB/s --------------------------------------------------------------------E OK, we have a good stripesize: 64, the default when lv2 was created. Test copying over the network to any drive. Then create two physical volumes instead of one; md1-2 for the first, md3-4 for the second. This has two advantages: * Drive failure in md1-2 does NOT affect the other physical volume, so I could do something like move logical volumes elsewhere until the drive was repaired. * The two sets of 4 drives can be treated as separate devices, so I should be able to run two sets of network copies at the same time as long as I aim at different volumes.