CPU: optimize generic SWAB32 implementation, and prefer it over builtin for Cortex-M3