So ... the results are interesting of course, and I do understand the dataset/training aspect at a contextual level. But are there any breakdowns available out there explaining how the pictures are actually being blended, resulting in the final pixels ? By that I mean, an actual breakdown that illustrates what is actually happening - not just some vague description.
Now I do understand that merely overlapping many pictures at low opacity can actually give some interesting results (as seen a few years ago already with the "perfect faces" experiments), yet the outcome of such approach is always fuzzy and soft. So, what is actually going on here behind the scene ?
That's sort of the thing with ML techniques - nobody really knows what happens inside the model..
I've only dabbled and am very much not an expert in this but I'm fairly confident the following is not complete horseshit...
You show your ML model lots of Disney princess pictures, it detects that they have common features like big eyes, being white and 14 years old etc. and stores these features in a big array of matrices - this is a sort of intermediate space that you could consider similar to what happens with projection matrices in 3d graphics
You then show it your source image, it identifies the same set of features and transforms them into the same intermediate space
You then basically ask it to lerp between the source array and the (lets say) average of the Disney princess array and put the results back into the space of the original image (i believe that's a simple reversal of the source image's original transformation)
The key parts are detecting features - plenty of algorithms for detecing faces/noses etc floating around, this is easy storing and filtering the data - where the clever shit really lies and is what you'd call the model.
Well, the above is pretty much exactly what I understand already My question is more about the actual mixing technique being used.
For instance I have a fairly clear understanding of how a trained ML model could work for, say, a super human run of a Mario level - since the winning conditions (progressing to the right), the failure points (either specific things like hitting enemies, or more globally anything that makes the life counter drop), and all the other inputs (distance to a wall, and so on) can be clearly defined. And then it's just a matter of selecting the best outcome.
But for anything image-related, I can't shake the feeling that there is some clever, non-trivial math that would make sense if I could see it explained in layman terms and illustrated with simple examples (like a rectangle fusing with a circle). In other words I am not talking about the training itself (isolating the good solutions and rejecting the bad ones) - I am really just talking about the image manipulation side of things. Since sure enough, if it was just a random process at the pixel level it would never lead to any cohesive results ...
I guess what I am asking is : starting from two images with identified noses/ears/eyes/mouth, what is the image processing technique used to blend the two ? Perhaps the same treatment used in image morphing programs from 10ish years ago ? And then, how are things like "potato" (without identifiable features) thrown into the mix ?
short answer - it literally repaints the image using the transformed data.
For positional stuff (eg. eye size/position) it transforms UV space - in the same way as a flowmap shader or warp node in designer does. edit: the model is what informs the amount of change at a given pixel For Color its another space transform - as I'm sure you know, a Hue shift requires that you convert rgb to a cylindrical space, rotate it and convert back
style is a lot more abstract and requires more dimensions - this is where the intermediate space comes in and I 'm unqualified to explain the decisions that go into designing that space but the output is always going to be a combination of color and UV space transforms applied to the original image. it'll be more like a 90s style morph effect than anything else
Replies
Some more:
Pretty amazing.
Now I do understand that merely overlapping many pictures at low opacity can actually give some interesting results (as seen a few years ago already with the "perfect faces" experiments), yet the outcome of such approach is always fuzzy and soft. So, what is actually going on here behind the scene ?
I've only dabbled and am very much not an expert in this but I'm fairly confident the following is not complete horseshit...
You show your ML model lots of Disney princess pictures, it detects that they have common features like big eyes, being white and 14 years old etc. and stores these features in a big array of matrices - this is a sort of intermediate space that you could consider similar to what happens with projection matrices in 3d graphics
You then show it your source image, it identifies the same set of features and transforms them into the same intermediate space
You then basically ask it to lerp between the source array and the (lets say) average of the Disney princess array and put the results back into the space of the original image (i believe that's a simple reversal of the source image's original transformation)
The key parts are
detecting features - plenty of algorithms for detecing faces/noses etc floating around, this is easy
storing and filtering the data - where the clever shit really lies and is what you'd call the model.
For instance I have a fairly clear understanding of how a trained ML model could work for, say, a super human run of a Mario level - since the winning conditions (progressing to the right), the failure points (either specific things like hitting enemies, or more globally anything that makes the life counter drop), and all the other inputs (distance to a wall, and so on) can be clearly defined. And then it's just a matter of selecting the best outcome.
But for anything image-related, I can't shake the feeling that there is some clever, non-trivial math that would make sense if I could see it explained in layman terms and illustrated with simple examples (like a rectangle fusing with a circle). In other words I am not talking about the training itself (isolating the good solutions and rejecting the bad ones) - I am really just talking about the image manipulation side of things. Since sure enough, if it was just a random process at the pixel level it would never lead to any cohesive results ...
I guess what I am asking is : starting from two images with identified noses/ears/eyes/mouth, what is the image processing technique used to blend the two ? Perhaps the same treatment used in image morphing programs from 10ish years ago ? And then, how are things like "potato" (without identifiable features) thrown into the mix ?
short answer - it literally repaints the image using the transformed data.
For positional stuff (eg. eye size/position) it transforms UV space - in the same way as a flowmap shader or warp node in designer does.
edit: the model is what informs the amount of change at a given pixel
For Color its another space transform - as I'm sure you know, a Hue shift requires that you convert rgb to a cylindrical space, rotate it and convert back
style is a lot more abstract and requires more dimensions - this is where the intermediate space comes in and I 'm unqualified to explain the decisions that go into designing that space
but
the output is always going to be a combination of color and UV space transforms applied to the original image. it'll be more like a 90s style morph effect than anything else
I remember this being quite enlightening
https://www.youtube.com/watch?v=HLRdruqQfRk&t=1026s