Help needed: benchmarking tests for improving neThing.xyz

one of the hard problems with text-to-CAD is that the “evaluation criteria” isn’t trivial.

does the code compile? yes.
is it the “right” shape? not sure, that’s much harder.

i’ve gone into more detail about this here:

and i even asked jerry liu how he might approach eval (he wasn’t sure):

i’m realizing that much of what i want to add to neThing is going to first require a benchmark to evaluation against. this is where i need your help.

what prompts have you given neThing where it failed, but you really feel that it should have gotten the answer correct?