It may be of interest that in a product that sold purely on performance (of a standard spec), we coded and debugged a fully general implementation, and THEN optimised special cases. (as opposed to creating the easy/fast special cases, and then adding the hard cases)