New idea : Intelligent interpolating.
There's always some undetected or unproperly detected faces. Some are fixable with manual extract, some aren't.
So in my idea we should have a way to mark the unfixable faces, for example in manual re-extract GUI. Let's say DEL or any other shortcut marks these pics.
Then, in merging, there's a new option to either basically cut these frames from the merging process (and cutting the audio track in the same place too) or, merger will replace these unfixable frames with a smooth morph/transition/interpolation. I guess it would use a system similar to Hybrid or Twixtor, but just working for intervals of a few frames of unfixable faces. If there's 6 unfixable faces, this option would cut these and replace the 6 frames with 1-3 frames of transition/morph between first ok face before unfixable and next ok frame after.
For example :
ok 01 - unfix01 - unfix02 - unfix03 - unfix04 - unfix05 - unfix06 - ok 02
is replaced by
ok 01 - transi01 - transi02 - ok 02
Also while cutting the audio track approriately.
This is all stuff you can do in external apps, but having a way to merge without these unfixable frames would accelerate the workflow. No need to view the whole vid and search for those frame to cut.
Anyways, probably a lot of complex coding required for this, but if is out of ideas one day...