### Better RAIDs

Regardless of what the hard disk vendors claim to be the mean time before failure, based on roughly twenty of my most recent disks from several different vendors I'ld claim a more realistic half-life of a normal off-the-shelf PC hard disks is at most four years, or approximately 35000 hours. Assuming there's one million hard disks in Finland, on average one hard disk dies while I engulf a fluffy Big Mac. Sorry, I'll try to cut down on junk food.

I personally replicate my essential data with unison on three hard disks on separate locations, but this occasionally leads to annoying conflicts and even data losses when I forget to promptly synchronize the replacas. And for bulk data, such as scanned images, digital camera pictures and videos, triplicating data becomes unnecessarily expensive.

Engineers have found smarter ways to improve the reliability of disk systems by organizing them into redundant arrays of independent disks, RAIDs. A typical RAID-4 might consist of four disks which contain the data and a fifth disk containing the parity (bitwise exclusive-or) of the other disks. If any of the disks crash, it can be replaced and its data can be reconstructed from the other disks. In RAID-5 the role of the parity disk is distributed among all the disks in order to avoid performance bottlenecks in updating the parity information.

According to theory, when one of the disks die, you hot replace it from a reserve of pristine disks, and maybe in an hour you've reached the original reliability. So, if individual disks would be replaced by RAIDs of four data and one parity disk, the MTBF would be increased from 35 thousand to over 60 million hours. Or in other words, the number of permanent data losses in Finland would be reduced to below one every two days. Sound comforting, but in practice any less than a professional sysadmin might need days or until the next paycheck to buy a new disk, and the MTBF reduces in proportion to how much longer than one hour the replacement drags.

Suppose we had more than five disks, could we do better? That's the question I felt compelled to study (motivated also by a possible future hobby project). A toy example demonstrates a positive answer: ten disks set up as two independent RAID-5s can recover only from those two-disk crashes where the two crashed disks reside in different sub-RAIDs. There's only a 50% chance of that. But if we use one of the data disks in

Guaranteeing recovery from two simultaneous disk crashes, in other words reaching RAID level 6, is actually harder than one might think. Given one data disk, simple mirroring on two additional disks would do. Up to four data disks, three parity disks is sufficient. The parities could, for example, be computed like this:

Four parity disks is sufficient to recover any two-disk crash in a system of eleven data disks.

Five parity disks is sufficient to reover any two-disk crash in a system of twenty disks.

For those who need truly high reliability, nested RAIDs are the way to go. Here one builds a RAID-5 of other RAID-5 elements instead of individual disks. For example, the twenty data disks are placed in four RAID-5s, and a fifth RAID-5 (of five data and one parity disk) is allocated to serve as a parity for those sub-RAIDs. Yes, it is reliable: it can handle all three-disk crashes and 99.45% of all four disk crashes. This happens at the expense of disk space: for twenty disks containing user data such a nested RAID consumes ten disks for redundancy.

But also this nested RAID can be improved. The set of parity equations below almost doubles the reliability, in particular now 99.72% of four-disk crashes can be recovered.

To generalize, scientists have developed numerous other kinds of Forward Error Correcting codes. The Reed-Solomon is probably the most widely known with uses ranging from CDs and xDSL-lines to communicating with space probes. Reed-Solomon becomes computationally expensive for a single large disk block, but we could employ it "horizontally" so that the i'th s-bits of each disk block form a message which is then appended with Reed-Solomon parity information. Because all errors will be erasures in our case, we can expect to recover from as many disk crashes as we have redundant disks. Despite this advantage, recomputation costs of an incremental update, overhead caused by shortening or rounding s upwards, and perhaps also complexity has kept Reed-Solomon codes out of RAIDs. The only work I know of is this.

What I've done in this blog entry is essentially equivalent to the basic ideas of LDPC-codes invented by Gallager in the early sixties, but with the exception that I employ a genetic algorithm to find the excellent encodings. Current research focus seems to be in developing codes for media delivery where the number of blocks can be tens or hundreds of thousands, and consequently the choice of the equations can be random. Unfortunately the domain is patent-ridden, but the LDGM approach seems to be free.

Oh, my hobby project? I won't reveal that yet as I haven't really decided whether I'll pursue it. But suffice it to say, that if all Finns would share 1% of their disk space for parity information with each other and recovering a disk would take one hour, we would have to wait over 10^750000 years for the next loss of data.

(Unfortunately all the assumptions in that calculation can't be taken into practice without significantly reducing the glamour of the result.)

I personally replicate my essential data with unison on three hard disks on separate locations, but this occasionally leads to annoying conflicts and even data losses when I forget to promptly synchronize the replacas. And for bulk data, such as scanned images, digital camera pictures and videos, triplicating data becomes unnecessarily expensive.

Engineers have found smarter ways to improve the reliability of disk systems by organizing them into redundant arrays of independent disks, RAIDs. A typical RAID-4 might consist of four disks which contain the data and a fifth disk containing the parity (bitwise exclusive-or) of the other disks. If any of the disks crash, it can be replaced and its data can be reconstructed from the other disks. In RAID-5 the role of the parity disk is distributed among all the disks in order to avoid performance bottlenecks in updating the parity information.

According to theory, when one of the disks die, you hot replace it from a reserve of pristine disks, and maybe in an hour you've reached the original reliability. So, if individual disks would be replaced by RAIDs of four data and one parity disk, the MTBF would be increased from 35 thousand to over 60 million hours. Or in other words, the number of permanent data losses in Finland would be reduced to below one every two days. Sound comforting, but in practice any less than a professional sysadmin might need days or until the next paycheck to buy a new disk, and the MTBF reduces in proportion to how much longer than one hour the replacement drags.

Suppose we had more than five disks, could we do better? That's the question I felt compelled to study (motivated also by a possible future hobby project). A toy example demonstrates a positive answer: ten disks set up as two independent RAID-5s can recover only from those two-disk crashes where the two crashed disks reside in different sub-RAIDs. There's only a 50% chance of that. But if we use one of the data disks in

*both*parity disks, then we reach a 73.3% chance of recovering from a two-disk crash.Guaranteeing recovery from two simultaneous disk crashes, in other words reaching RAID level 6, is actually harder than one might think. Given one data disk, simple mirroring on two additional disks would do. Up to four data disks, three parity disks is sufficient. The parities could, for example, be computed like this:

x0 = d0 ^ d1 ^ d3The equations are laid out so that no two disks occur in edgemost expressions in all equations. I prefer this formulation because it also states the recovery algorithm by repeated solving for only one unknown and substitution to other equations.

x1 = d0 ^ d2 ^ d3 = d1 ^ d2 ^ x0

x2 = d0 ^ d1 ^ d2 = d1 ^ d3 ^ x1

Four parity disks is sufficient to recover any two-disk crash in a system of eleven data disks.

x0 = d0 ^ d1 ^ d2 ^ d4 ^ d6 ^ d7 ^ d8This scheme can also recover a three-disk crash with a 84.84% probability and a four-disk crash with a 39.78% probability.

x1 = d0 ^ d4 ^ d5 ^ d8 ^ d9 ^ d10 ^ x0

x2 = d0 ^ d2 ^ d3 ^ d6 ^ d9 ^ x0 ^ x1

x3 = d0 ^ d1 ^ d3 ^ d5 ^ d6 ^ d8 ^ d9

Five parity disks is sufficient to reover any two-disk crash in a system of twenty disks.

x0 = d0 ^ d1 ^ d5 ^ d6 ^ d8 ^ d10 ^ d12 ^ d13 ^ d18 ^ d19A three-disk crash is recovered with a 93.22% and a four-disk crash with a 66.57% probability. Notice that this scheme reaches the same storage overhead as the typical four data and one parity disk RAID-5, but with a hugely improved reliability - less than one irrecoverable crash every five years in one million such installments - of course making the lofty assumptions the disk losses are independent and not caused by an EMP or a virus.

x1 = d1 ^ d2 ^ d3 ^ d6 ^ d7 ^ d11 ^ d12 ^ d13 ^ d14 ^ d16

x2 = d0 ^ d2 ^ d4 ^ d6 ^ d8 ^ d9 ^ d13 ^ d14 ^ d15 ^ d19 ^ x1

x3 = d5 ^ d7 ^ d9 ^ d12 ^ d14 ^ d16 ^ d17 ^ d18 ^ d19 ^ x2

x4 = d0 ^ d1 ^ d2 ^ d3 ^ d4 ^ d5 ^ d6 ^ d7 ^ d10 ^ d12 ^ d19 ^ x2 ^ x3

For those who need truly high reliability, nested RAIDs are the way to go. Here one builds a RAID-5 of other RAID-5 elements instead of individual disks. For example, the twenty data disks are placed in four RAID-5s, and a fifth RAID-5 (of five data and one parity disk) is allocated to serve as a parity for those sub-RAIDs. Yes, it is reliable: it can handle all three-disk crashes and 99.45% of all four disk crashes. This happens at the expense of disk space: for twenty disks containing user data such a nested RAID consumes ten disks for redundancy.

But also this nested RAID can be improved. The set of parity equations below almost doubles the reliability, in particular now 99.72% of four-disk crashes can be recovered.

x0 = d0 ^ d1 ^ d3 ^ d6 ^ d11 ^ d15 ^ d18 ^ d19Alternatively, one could add three data disks and still have a higher reliability than with the smaller nested RAID scheme:

x1 = d7 ^ d8 ^ d9 ^ d13 ^ d18 ^ x0

x2 = d2 ^ d6 ^ d9 ^ d11 ^ d14 ^ d15 ^ d16 ^ d17

x3 = d0 ^ d3 ^ d4 ^ d7 ^ d12 ^ d15 ^ d17 ^ d18

x4 = d1 ^ d3 ^ d8 ^ d10 ^ d15 ^ d16 ^ d17 ^ d18 ^ x1

x5 = d0 ^ d5 ^ d6 ^ d12 ^ d13 ^ d14 ^ d15 ^ d16 ^ d19 ^ x0

x6 = d6 ^ d7 ^ d9 ^ d10 ^ d15 ^ x2

x7 = d3 ^ d4 ^ d5 ^ d8 ^ d9 ^ d18 ^ x6

x8 = d0 ^ d3 ^ d4 ^ d7 ^ d13 ^ d16 ^ d17 ^ x2 ^ x5

x9 = d4 ^ d6 ^ d7 ^ d12 ^ d18 ^ d19 ^ x5 ^ x7

x0 = d2 ^ d4 ^ d5 ^ d9 ^ d10 ^ d16 ^ d18 ^ d20 ^ d21 ^ d22

x1 = d0 ^ d1 ^ d2 ^ d3 ^ d5 ^ d6 ^ d9 ^ d10 ^ d13 ^ d17 ^ d20

x2 = d1 ^ d3 ^ d5 ^ d7 ^ d8 ^ d15 ^ d17 ^ x0

x3 = d0 ^ d10 ^ d11 ^ d13 ^ d15 ^ d16 ^ d17 ^ d19 ^ x0

x4 = d0 ^ d2 ^ d5 ^ d8 ^ d11 ^ d12 ^ d16 ^ d21 ^ d22

x5 = d0 ^ d1 ^ d2 ^ d6 ^ d7 ^ d8 ^ d12 ^ d14 ^ d18 ^ d19 ^ d21 ^ x0

x6 = d4 ^ d5 ^ d6 ^ d7 ^ d12 ^ d17 ^ d20 ^ x2 ^ x5

x7 = d0 ^ d3 ^ d4 ^ d5 ^ d6 ^ d8 ^ d9 ^ d11 ^ d13 ^ d18 ^ d20 ^ d22 ^ x6

x8 = d3 ^ d8 ^ d9 ^ d14 ^ d16 ^ d18 ^ d20 ^ d22 ^ x0 ^ x4 ^ x6

x9 = d0 ^ d1 ^ d16 ^ d17 ^ d20 ^ d21 ^ x5

To generalize, scientists have developed numerous other kinds of Forward Error Correcting codes. The Reed-Solomon is probably the most widely known with uses ranging from CDs and xDSL-lines to communicating with space probes. Reed-Solomon becomes computationally expensive for a single large disk block, but we could employ it "horizontally" so that the i'th s-bits of each disk block form a message which is then appended with Reed-Solomon parity information. Because all errors will be erasures in our case, we can expect to recover from as many disk crashes as we have redundant disks. Despite this advantage, recomputation costs of an incremental update, overhead caused by shortening or rounding s upwards, and perhaps also complexity has kept Reed-Solomon codes out of RAIDs. The only work I know of is this.

What I've done in this blog entry is essentially equivalent to the basic ideas of LDPC-codes invented by Gallager in the early sixties, but with the exception that I employ a genetic algorithm to find the excellent encodings. Current research focus seems to be in developing codes for media delivery where the number of blocks can be tens or hundreds of thousands, and consequently the choice of the equations can be random. Unfortunately the domain is patent-ridden, but the LDGM approach seems to be free.

Oh, my hobby project? I won't reveal that yet as I haven't really decided whether I'll pursue it. But suffice it to say, that if all Finns would share 1% of their disk space for parity information with each other and recovering a disk would take one hour, we would have to wait over 10^750000 years for the next loss of data.

(Unfortunately all the assumptions in that calculation can't be taken into practice without significantly reducing the glamour of the result.)

## 2 Comments:

Traditionally Reed-Solomon codes are calculated using Vandermonde matrices, but this is inefficient to implement in software. Reasonably recently proposed alternative is to transform problem to Cauchy matrix form. Some solutions are more computationally efficient than others - but one can't achieve considerable optimizations without concentrating on specific parts of the algorithm, like only encoding on specific M,N pair.

Cauchy solutions achieve better encoding and decoding bandwidth, but other problems are unaffected: particularly, tradeoff between per-site disk usage vs. global bandwidth usage. To some extent this could be avoided by using multicast, but multicast isn't really usable for peer-to-peer applications in its current form (if it works at all).

Some links related to Cauchy-based Reed-Solomon codes:

http://www.icsi.berkeley.edu/~luby/

http://www.cs.utk.edu/~plank/plank/papers/CS-05-569.html

http://www.cs.utk.edu/~plank/plank/papers/CS-05-570.html

In conjunction with RAIDs the messages are so small that computational efficiency certainly won't be an issue of Reed-Solomon codes. With s=4 you could already implement a 15-disk RAID, and the net is abundant with efficient s=8 implementations.

But as you too point out, the bigger issue is in the cost of updating a block containing old data O and new data N. With XOR-based schemes you simply XOR the value O XOR N with the existing parity blocks (one in a RAID-5, three in a nested RAID), but most specifically you don't have to read in the data of other data blocks which affect the parity block(s). Quite often you also have the old data O in your buffer cache, so the only read activity is in reading in the old parity block. I've also seen special-purpose disks with two heads per platter that can perform a XOR at normal write speeds and without sending the block to the host computer, but I don't know to what extent they are used today. AFAIK, with RS-codes: you will have to recompute the parity blocks by first reading in all other blocks that the parity (well, "parity") checks before you can compute the new parity. Should there be any ideas to alleviate this issue, I'ld be more than happy to know.

Avoiding stepping on patents is also an issue should I embark on that hobby project.

Post a Comment

<< Home