This may be a bit elementary for this crowd but regarding the balance of data co...

This may be a bit elementary for this crowd but regarding the balance of data cost vs capturing the most significant features. We use a simple decision tree as a significance cluster and optimize data munging around these clusters.

On some levels it is anti-diversity but given real world constraints it has yielded the best results. Any thoughts or links regarding this topic would be appreciated.