A Complete Guide to FreeNAS Hardware Design, Part III: Pools, Performance, and Cache

Written by Joshua Paetzel on .

ZFS Pool Configuration

ZFS storage pools are comprised of vdevs which are striped together. vdevs can be single disks, N-way mirrors, RAIDZ (Similar to RAID5), RAIDZ2 (Similar to RAID6), or RAIDZ3 (there is no hardware RAID analog to this, but it’s a triple parity stripe essentially). A key thing to know here is a ZFS vdev gives the IOPs performance of one device in the vdev. That means that if you create a RAIDZ2 of ten drives, it will have the capacity of 8 drives but it will have the IOPs performance of a single drive. The need for IOPs becomes important when providing storage to things like database servers or virtualization platforms. These use cases rarely utilize sequential transfers. In these scenarios, you’ll find larger numbers of mirrors or very small RAIDZ groups are appropriate choices. At the other end of the scale, a single user trying to do a sequential read or write will benefit from a larger RAIDZ[1|2|3] vdev. Many home media server applications do quite well with a pool comprising a single 3-8 drive RAIDZ[1|2|3] vdev.

FreeNAS Volumes
RAIDZ1 gets a special note here. When a RAIDZ1 loses a drive, all the other drives in the vdev become single points of failure. A ZFS storage pool will not operate if a vdev fails. This means if you have a pool made up of a single 10 drive RAIDZ vdev and one drive fails, pool operation depends on none of the remaining 9 drives failing. In addition, with modern drives being as large as they are, rebuild times are not trivial. During the rebuild period, all of the drives are doing increased I/O as the array rebuilds. This additional stress can cause additional drives in the array to fail. Since a degraded RAIDZ1 can withstand no additional failures, you are very close to “game over” there. Powers of 2 pool configuration: there is much wisdom out there on the internet about the value of configuring ZFS vdevs in a power of two. This made some sense when building ZFS pools that did not utilize compression. Since FreeNAS utilizes compression by default (and there are 0 cases where it makes sense to change the default!), any attempts to optimize ZFS with the vdev configuration are foiled by the compressor. Pick your vdev configuration based on the IOPs needed, space required, and desired resilience. In most cases, your performance will be limited by your networking anyway.

ZIL Devices

ZFS can use dedicated devices for its ZIL (ZFS intent log). This is essentially the write cache for synchronous writes. Some workflows generate very little traffic that would benefit from a dedicated ZIL, others use synchronous writes exclusively and, for all practical purposes, require a dedicated ZIL device. The key thing to remember here is the ZIL always exists in memory. If you have a dedicated device, the memory ZIL is mirrored to the dedicated device, otherwise it is mirrored to your pool. By using an SSD, you reduce latency and contention by not utilizing your data pool (which is presumably comprised of spinning disks) for mirroring the in-memory ZIL. There’s a lot of confusion surrounding ZFS and ZIL device failure. When ZFS was first released, dedicated ZIL devices were essential to data pool integrity. A missing ZIL vdev would render the entire pool unusable. With these older versions of ZFS, mirroring the ZIL devices was essential to prevent a failed ZIL device from destroying the entire pool. This is no longer the case with ZFS. Missing ZIL vdevs will impact performance but will not cause the entire pool to become unavailable. However, the conventional wisdom that the ZIL must be mirrored to prevent data loss in the case of ZIL failure lives on. Keep in mind that the dedicated ZIL device is merely mirroring the real in-memory ZIL. Data loss can only occur if your dedicated ZIL device fails and the system crashes with writes in transit in the unmirrored memory ZIL. As soon as the dedicated ZIL device fails, the mirror of the in-memory ZIL moves to the pool (in practice, this means you have a window of a few seconds where a system is vulnerable to data loss following a ZIL device failure). After a crash, ZFS will attempt to replay the ZIL contents. SSDs themselves have a volatile write cache, so they may lose data during a bad shutdown. To ensure the ZFS write cache replay has all of your inflight writes, the SSD devices used for dedicated ZIL devices should have power protection. HGST makes a number of devices that are specifically targeted as dedicated ZFS ZIL devices. Other manufacturers such as Intel offer appropriate devices as well. In practice, only the designer of the system can determine if the use case warrants a professional enterprise grade SSD with power protection or if a consumer-level device will suffice. The primary characteristics here are low latency, high random write performance, high write endurance, and, depending on the situation, power protection.

L2ARC Devices

ZFS allows you to equip your system with dedicated read cache devices. Typically, you’ll want these devices to be lower latency than your main storage pool. Remember that the primary read cache used by the system is system RAM, which is orders of magnitude faster than any SSD. If you can satisfy your read cache requirements with RAM, you’ll enjoy better performance than if you use SSD read cache. In addition, there is a scenario where an L2ARC read cache can actually drop performance. Consider a system with 6GB of memory cache (ARC) and a working set that is 5.9 GB. This system might enjoy a read cache hit ratio of nearly 100%. If SSD L2ARC is added to the system, the L2ARC requires space in RAM to map its address space. This space will come at the cost of evicting data from memory and placing it in the L2ARC. The ARC hit rate will drop, and misses will be satisfied from the (far slower) SSD L2ARC. In short, not every system can benefit from an L2ARC. FreeNAS includes tools in the GUI and at the command line that can determine ARC sizing and hit rates. If the ARC size is hitting the maximum allowed by RAM, and if the hit rate is below 90%, the system can benefit from L2ARC. If the ARC is smaller than RAM or if the hit rate is 99.X%, adding L2ARC to the system will not improve performance. As far as selecting appropriate devices for L2ARC, they should be biased towards random read performance. The data on them is not persistent, and ZFS behaves quite well when faced with L2ARC device failure. There is no need or provision to mirror or otherwise make L2ARC devices redundant, nor is there a need for power protection on these devices.

Joshua Paetzel
iXsystems Director of IT

<< Part 2/4 of A Complete Guide to FreeNAS Hardware Design: Hardware Specifics

Part 4/4 of A Complete Guide to FreeNAS Hardware Design: Network Notes & Conclusion >>

A Complete Guide to FreeNAS Hardware Design, Part II: Hardware Specifics

Written by Joshua Paetzel on .

General Hardware Recommendations

I’ve built a lot of ZFS storage hardware and have two decades of experience with FreeBSD. The following are some thoughts on hardware.

SONY DSC

Intel Versus AMD

FreeNAS is based on FreeBSD. FreeBSD has a long history of working better on Intel than AMD. Things like (but not limited to) the watchdog controllers, USB controllers, and temperature monitoring all have a better chance of being well supported when they are on an Intel platform. This is not to say that AMD platforms won’t work, that there aren’t AMD platforms that work flawlessly with FreeNAS, or even that there aren’t Intel platforms that are poor choices for FreeNAS, but all things being equal, you’ll have better luck with Intel than AMD.

The Intel Avoton platforms are spendy but attractive: ECC support, low power, AES-NI support (a huge boon for encrypted pools). On the desktop side of things, there are Core i3 platforms with ECC support, and of course there are many options in the server arena. The single socket E3 Xeons are popular in the community, and of course for higher end systems, the dual package Xeon platforms are well supported.

Storage Controllers

LSI is the best game in town for add-on storage controllers. Avoid their MegaRAID solutions and stick with their HBAs. You’ll see three generations of HBAs commonly available today. The oldest (and slowest) are the SAS 2008 based I/O controllers such as the 9211 or the very popular IBM M1015. The next generation of these controllers was based on the 2308 which added PCI 3.0 support and increased CPU horsepower on the controller itself. An example here is the 9207. Both the 2008 and 2308 based solutions are 6Gbps SAS parts. The newest generation of controllers are 12Gbps parts such as the 9300. The FreeNAS driver for the 6 Gbps parts is based on version 16 of the stock LSI driver with many enhancements that LSI never incorporated into their driver. In addition, many of the changes after version 16 were specifically targeted at the Integrated RAID functionality that can be flashed onto these cards. As a result, “upgrading” the driver manually to the newer versions found on the LSI website can actually result in downgrading its reliability or performance. I highly recommend running version 16 firmware on these cards. It’s the configuration tested by LSI, and it’s the configuration tested by the FreeNAS developers. Running newer firmware should work, however running older firmware is not recommended or supported as there are known flaws that can occur by running the FreeNAS driver against a controller with an older firmware. FreeNAS will warn you if the firmware on an HBA is incompatible with the driver. Heed this warning or data loss can occur. The newer 12Gbps parts use version 5 of the LSI driver. Cards using this driver should use version 5 of the firmware.

Most motherboards have some number of SATA ports built in. There are certain models of Marvell and J-Micron controllers that are used on motherboards that have large numbers of SATA ports. Some of these controllers have various compatibility issues with FreeNAS, and some of these controllers also have forms of RAID on them. As a general rule, the integrated chipset AHCI SATA ports have no issues when used with FreeNAS, they just tend to be limited to 10 ports (and often far fewer) on most motherboards.

Hard Drives

Desktop drives should be avoided whenever possible. In a desktop, if an I/O fails, all is lost. For this reason, desktop drives will retry I/Os endlessly. In a storage device, you want redundancy at the storage level. If an individual drive fails an I/O, ZFS will retry the I/O on a different drive. The faster that happens, the faster the array will be able to cope with hardware faults. For larger arrays, desktop drives (yes, I’ve seen attempts to built 1PB arrays with ZFS and desktop drives) are simply not usable in many cases. For small to medium size arrays, a number of manufacturers produce a “NAS” hard drive that is rated for arrays of modest size (typically 6-8 drives or so). These drives are worth the additional cost.

At the high end, if you are building an array with SAS controllers and expanders, consider getting the nearline 7200 RPM SAS drives. These drives are a very small premium over Enterprise SATA drives. However, running SATA drives in SAS expanders –while supported– is a less desirable configuration than using SAS end to end due to the difficulty of translating SATA errors across the SAS bus.

Josh Paetzel
iXsystems Director of IT

<< Part 1/4 of A Complete Guide to FreeNAS Hardware Design, Purpose and Best Practices

Part 3/4 of A Complete Guide to FreeNAS Hardware Design, Pools, Performance, and Cache >>

A Complete Guide to FreeNAS Hardware Design, Part I: Purpose and Best Practices

Written by Joshua Paetzel on .

A guide to selecting and building FreeNAS hardware, written by the FreeNAS Team, is long past overdue by now. For that, we apologize. The issue was the depth and complexity of the subject, as you’ll see by the extensive nature of this four part guide, due to the variety of ways FreeNAS can be utilized. There is no “one-size-fits-all” hardware recipe. Instead, there is a wealth of hardware available, with various levels of compatibility with FreeNAS, and there are many things to take into account beyond the basic components, from use case and application to performance, reliability, redundancy, capacity, budget, need for support, etc. This document draws on years of experience with FreeNAS, ZFS, and the OS that lives underneath FreeNAS, FreeBSD. Its purpose is to give guidance on intelligently selecting hardware for use with the FreeNAS storage operating system, taking the complexity of its myriad uses into account, as well as providing some insight into both pathological and optimal configurations for ZFS and FreeNAS. freenashome

A word about software defined storage:

FreeNAS is an implementation of Software Defined Storage; although software and hardware are both required to create a functional system, they are decoupled from one another. We develop and provide the software and leave the hardware selection to the user. Implied in this model is the fact that there are a lot of moving pieces in a storage device (figuratively, not literally). Although these parts are all supposed to work together, the reality is that all parts have firmware, many devices require drivers, and the potential for there to be subtle (or gross) incompatibilities is always present.

Best Practices

ECC RAM or Not?

This is probably the most contested issue surrounding ZFS (the filesystem that FreeNAS uses to store your data) today. I’ve run ZFS with ECC RAM and I’ve run it without. I’ve been involved in the FreeNAS community for many years and have seen people argue that ECC is required and others argue that it is a pointless waste of money. ZFS does something no other filesystem you’ll have available to you does: it checksums your data, and it checksums the metadata used by ZFS, and it checksums the checksums. If your data is corrupted in memory before it is written, ZFS will happily write (and checksum) the corrupted data. Additionally, ZFS has no pre-mount consistency checker or tool that can repair filesystem damage. This is very nice when dealing with large storage arrays as a 64TB pool can be mounted in seconds, even after a bad shutdown. However if a non-ECC memory module goes haywire, it can cause irreparable damage to your ZFS pool that can cause complete loss of the storage. For this reason, I highly recommend the use of ECC RAM with “mission-critical” ZFS. Systems with ECC RAM will correct single bit errors on the fly, and will halt the system before they can do any damage to the array if multiple bit errors are detected. If it’s imperative that your ZFS based system must always be available, ECC RAM is a requirement. If it’s only some level of annoying (slightly, moderately…) that you need to restore your ZFS system from backups, non-ECC RAM will fit the bill.

How Much RAM is needed?

FreeNAS requires 8 GB of RAM for the base configuration. If you are using plugins and/or jails, 12 GB is a better starting point. There’s a lot of advice about how RAM hungry ZFS is, how it requires massive amounts of RAM, an oft quoted number is 1GB RAM per TB of storage. The reality is, it’s complicated. ZFS does require a base level of RAM to be stable, and the amount of RAM it needs to be stable does grow with the size of the storage. 8GB of RAM will get you through the 24TB range. Beyond that 16GB is a safer minimum, and once you get past 100TB of storage, 32GB is recommended. However, that’s just to satisfy the stability side of things. ZFS performance lives and dies by its caching. There are no good guidelines for how much cache a given storage size with a given number of simultaneous users will need. You can have a 2TB array with 3 users that needs 1GB of cache, and a 500TB array with 50 users that need 8GB of cache. Neither of those scenarios are likely, but they are possible. The optimal cache size for an array tends to increase with the size of the array, but outside of that guidance, the only thing we can recommend is to measure and observe as you go. FreeNAS includes tools in the GUI and the command line to see cache utilization. If your cache hit ratio is below 90%, you will see performance improvements by adding cache to the system in the form of RAM or SSD L2ARC (dedicated read cache devices in the pool).

RAID vs. Host Bus Adapters (HBAs)

ZFS wants direct control of the underlying storage that it is putting your data on. Nothing will make ZFS more unstable than something manipulating bits underneath ZFS. Therefore, connecting your drives to an HBA or directly to the ports on the motherboard is preferable to using a RAID controller; fortunately, HBAs are cheaper than RAID controllers to boot! If you must use a RAID controller, disable all write caching on it and disable all consistency checks. If the RAID controller has a passthrough or JBOD mode, use it. RAID controllers will complicate disk replacement and improperly configuring them can jeopardize the integrity of your volume (Using the write cache on a RAID controller is an almost sure-fire way to cause data loss with ZFS, to the tune of losing the entire pool).

Virtualization vs. Bare Metal

FreeBSD (the underlying OS of FreeNAS) is not the best virtualization guest: it lacks some virtio drivers, it lacks some OS features that make it a better behaved guest, and most importantly, it lacks full support from some virtualization vendors. In addition, ZFS wants direct access to your storage hardware. Many virtualization solutions only support hardware RAID locally (I’m looking at you, VMware) thus leading to enabling a worst case scenario of passing through a virtual disk on a datastore backed by a hardware RAID controller to a VM running FreeNAS. This puts two layers between ZFS and your data, one for the Host Virtualization’s filesystem on the datastore and another on the RAID controller. If you can do PCI passthrough of an HBA to a FreeNAS VM, and get all the moving pieces to work properly, you can successfully virtualize FreeNAS. We even include the guest VM tools in FreeNAS for VMware, mainly because we use VMware to do a lot of FreeNAS development. However if you have problems, there are no developer assets running FreeNAS as a production VM and help will be hard to come by. For this reason, I highly recommend that FreeNAS be run “On the Metal” as the only OS on dedicated hardware.

Josh Paetzel
iXsystems Director of IT

Part 2/4 of A Complete Guide to FreeNAS Hardware Design: Hardware Specifics >>