You need to decide which factor is your primary concern, data durability (data loss), or data availability. As mentioned backups dramatically improve data durability. But if you are after data availability, you'll need to handle all the hardware factors power supplies (as mentioned), memory (ECC and DIMM fail/sparing), cooling, and probably networking (lacp, etc).
The sparing process can be scripted. As a subject matter expert, and your vast experience, this will be straight forward. Perl and python are available in the Nerd Tools. This may allow you to worry less while working.
However, I am not sure it would be "hot" as the array must shutdown to reassign the drive. You could implement NetApp's maintenance garage function, to test, and then resume or fail the drive.