Improving the reliability of Flash

Flash storage is ubiquitous in our modern world, used in everything from our smartphones, to laptops, to servers running cloud applications. It’s so prevalent that most of us don’t realise that Flash memory is an inherently unreliable medium. In fact, Flash cells have a limited service life, and the nature of Flash means that strong wear-leveling technology is required for modern Flash to perform as well as it does.

The good news is, wear-leveling technology in modern Flash controllers has advanced to overcome weaknesses inherent in the underlying Flash medium and emphasise the benefits of Flash. For modern Flash systems, the choice of Flash is not as important as the selection of the Flash controller. By selecting the proper Flash controller for the application, improved endurance and reliability can be achieved.

For end-users and device-makers alike this is a boon, as it means that lower-cost, higher-capacity multi-level cell MLC Flash can increasingly be used for more critical applications, as long as it is paired with the proper high-quality controllers.

Flash reliability challenges

With Flash being used in nearly every electronic device we touch today, it’s easy to forget that it is an inherently finicky medium that faces a number of reliability challenges.

While Flash cells can be read from near-indefinitely, they have a limited number of times they can be programmed or erased. This P/E endurance varies based on Flash type, but generally is in the thousands of cycles per cell for commercial MLC-type Flash used in most NAND-flash storage devices such as SSDs or eMMC today.

To make matters worse, while Flash can be read from without much issue, the process for writing to Flash is decidedly more involved. Flash can be written to at the page level, which is sized in the kilobytes. However, a page must be clear before it can be written to properly. Unfortunately, Flash can only be erased a block at a time, with a block being sized in the megabytes. Writing to Flash therefore requires first erasing the larger block of memory which the page is inside. This results in reduced overall service life, since updating a single Flash cell requires updating to all cells in the block. This is commonly understood as Write Amplification.

To reduce wear on Flash cells, all flash storage devices must use wear-leveling techniques. These techniques aim to spread wear around a drive evenly in order to maximise the systems’ endurance. Temporary buffers in DRAM, SRAM or unused Flash cells are used to keep track of where the drive should write next, as well as old locations that need to be erased.

Another major issue for Flash drives is power failure protection. The temporary buffers containing metadata about where the drive should write next, as well as old locations that must be erased, may be stored in volatile memory. If this is the case, a sudden power loss can result in the buffer being erased, causing catastrophic damage to the drive data.

A final issue affecting Flash reliability is the increasing error counts as lithography decreases and Flash increases in density and performance. While the original Flash drives used single-level cell (SLC) Flash where each cell stored one bit, modern Flash drives generally split a single Flash cell into multiple bits – MLC/TLC Flash. Associating more bits to each physical cell increases storage density but decreases the threshold between on/off states for each bit. This increases not only the bit error rate but also lowers the service life. As lithography decreases in process size, Flash densities are increasing even further, but error rates are increasing.

Advanced controller technology

Despite all the fundamental challenges of Flash storage reliability, we’re still able to use it in everyday consumer devices, business applications and even mission-critical use cases. This is in large part thanks to advanced Flash controller technology. These controllers incorporate advances in wear leveling, power failure management and error correction, which allows us to use today’s high-density Flash safely and reliably.

Wear Leveling

The Flash translation layer (FTL) is one of the most important aspects of the Flash controller. By translating logical addresses from the host to physical addresses on the Flash, it enables the SSD to perform wear leveling. If a host system updates data at the same address, for instance, the FTL will translate that logical address to a new physical address in order to spread wear evenly around the Flash drive and maximise endurance.

The granularity of the mapping of logical to physical addresses in the FTL can have a big impact on performance as well as endurance. Block-based mapping used by simpler Flash media such as consumer USB drives and SD cards performs mapping at the block (sized in the megabytes) level. Wear leveling happens at the block level, but no optimisation happens at the page level, since each logical page is simply mapped to a fixed physical page.

Since blocks are the same size as the minimum size for erase operation, this kind of mapping is very simple to implement and has low overhead. However, this simplistic approach also generates a great deal of write amplification, and shortens the service life of the device.

Page-based mapping is commonly used in modern SSDs. This maps a more granular logical page of data (measured in kilobytes) to a physical page of data. With this type of mapping, logical pages can be mapped to any physical page within a block, allowing for both block and page-level wear leveling. However in other form factors, SSD page-based mapping is not yet widely used.

More granular approaches like page mapping require more computational power and have to store a larger mapping table; however, increasing granularity can greatly reduce write amplification. Especially for industrial and embedded or IoT applications where smaller, random I/O operations are the norm, granular, page mapping can create a huge reduction in write amplification and improve the service life of the device.

Power Loss Protection

Because mapping information for the SSD’s wear-leveling algorithms is often stored in volatile DRAM, power failures can cause catastrophic loss of information and damage to the drive. To protect against this possibility, many industrial SSDs will incorporate super-capacitors to store backup power in case of a power failure, allowing time for the DRAM content to be flushed to non-volatile Flash storage.

While this method is feasible, it is non-ideal. By relying on backup power from super-capacitors, these SSDs not only add cost but also introduce another possible failure point, which may impact reliability and service life. Smaller form factors like µSD make it almost completely infeasible to integrate DRAM and a capacitor.

Storage devices using Flash controllers with hyMap technology from Hyperstone store mapping information in non-volatile memory directly. This not only eliminates DRAM and capacitor costs but also ensures data safety at all times and under all circumstances.

Error Correction

Error correction is the final piece of the puzzle for Flash storage reliability. While older-generation Flash could use simplistic Hamming-code error correction codes (ECC), newer-generation, high-density MLC Flash requires stronger error correction. Modern MLC ECC must be capable of correcting multiple bits per sector.

While consumer SSDs may choose to use lower quality and cost LDPC-code implementations to perform this type of ECC, the requirements of industrial Flash favor the BCH or other higher reliability methods. With 96-bit BCH ECC, multiple-bit error correction can be provided and without overhead for I/O operations.

Controlling Flash reliability

Creating reliable Flash is fraught with challenges. Though solid-state storage has no moving parts and is physically more reliable than hard drives, the limited lifespan of Flash cells, power failure issues, and error correction issues of Flash create challenges for data reliability, especially over the long service lives of embedded and industrial drives.

In the past, simply buying SLC-type Flash storage was enough to ensure a relatively reliable system. However, with shrinking process geometries and increasing Flash densities, today the difference in reliability and error rates between Flash mediums is less stark than before. Instead the greatest determinant of the reliability of today’s storage systems is the Flash controller design.

For applications requiring reliability and long service life, it’s essential to choose controllers targeting the embedded industrial markets, rather than ones designed to maximise performance at the cost of service life or data integrity. Through advanced wear-leveling techniques, power failure-proof design, and strong ECC, storage devices based on Hyperstone controllers enable highly reliable solutions.

www.hyperstone.com

Check Also

Precision timing with an integrated clock chip for AI datacentres

SiTime Corporation, the precision timing company, has introduced its Chorus™ family of clock generators for …