Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ready Player Me with Microsoft Speech SDK #3

Open
sabithpocker opened this issue Jul 28, 2023 · 4 comments
Open

Ready Player Me with Microsoft Speech SDK #3

sabithpocker opened this issue Jul 28, 2023 · 4 comments

Comments

@sabithpocker
Copy link

Hey,

First of all, really amazing work that you are doing, I came here from some of your youtube videos, very interesting stuff with RPM and Unity.

Following this tutorial I did an example of me talking to OpenAI directly using Microsoft Speech SDK.

Most of the work is done, but my lipsync is not that great, Microsoft gives me LipSync as an array with FrameIndexes:

{
    "FrameIndex":0,
    "BlendShapes":[
        [0.021,0.321,...,0.258],
        [0.045,0.234,...,0.288],
        ...
    ]
}

https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?pivots=programming-language-csharp&tabs=3dblendshapes#viseme-id

Do you suggest any way to use it with the Ready Player Me model?

Please ignore this if this is not something that interests you!!

@srcnalt
Copy link
Owner

srcnalt commented Jul 28, 2023

Hi @sabithpocker, this project was to work directly with Google Mediapipe. But at the moment, I have access to Azure, and I think I can try this. There might be some mapping differences, if MS provides a way to map these array elements to blend shapes that should be easy.

@sabithpocker
Copy link
Author

@sabithpocker
Copy link
Author

sabithpocker commented Aug 8, 2023

Here is some relevant code if you want to take a quick look:

 useFrame(() => {
      if(audioPlaying && player && masterViseme && masterViseme.length > 0) {
        if(player.privIsPaused) {
          player.resume();
        }
        blendShapeFrame = Math.round(audioFrametoBlendShapeFrame(player.currentTime, 0, duration.duration, 0, masterViseme.length));
        headMesh[0].morphTargetInfluences = masterViseme[blendShapeFrame] && masterViseme[blendShapeFrame].length > 0 ? masterViseme[blendShapeFrame] : Array(52).fill(0);

  }); 

VISEME RECIEVED EVENT

    synthesizer.visemeReceived = function (s: any, e: any) {
      let animationData: {BlendShapes: number [], FrameIndex: number} = JSON.parse(e.animation);
      masterViseme.push(...animationData.BlendShapes);
    };

Sample response for blendshapes:

{"FrameIndex":249,"BlendShapes":[[0.423,0.215,0,0.008,0,0.208,0,0.423,0.214,0.119,0,0,0.208,0,0.05,0.021,0,0.172,0.132,0.116,0.065,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.076,0.076,0.106,0,0,0.016,0.041,0.044,0.029,0.029,0,0.015,0,0.005],[0.502,0.282,0,0.002,0,0.222,0,0.502,0.281,0.112,0,0,0.223,0,0.05,0.021,0,0.172,0.133,0.116,0.066,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.074,0.074,0.111,0,0,0.016,0.041,0.044,0.029,0.029,0,0.017,0,0.006],[0.464,0.247,0,0.011,0,0.23,0,0.464,0.247,0.122,0,0,0.23,0,0.05,0.021,0,0.173,0.133,0.116,0.067,0.008,0.003,0.005,0.015,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.113,0,0,0.016,0.041,0.044,0.029,0.029,0,0.017,0.001,0.006],[0.35,0.186,0,0.012,0,0.234,0,0.35,0.186,0.123,0,0,0.234,0,0.05,0.021,0,0.173,0.133,0.117,0.067,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.043,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.114,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.004],[0.229,0.12,0,0.017,0,0.233,0,0.229,0.119,0.128,0,0,0.233,0,0.05,0.021,0,0.173,0.134,0.117,0.068,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.039,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.114,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.003],[0.142,0.063,0,0.027,0,0.225,0,0.143,0.063,0.139,0,0,0.225,0,0.05,0.021,0,0.174,0.134,0.117,0.069,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.178,0.173,0.015,0.015,0.072,0.072,0.113,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.002],[0.103,0.032,0,0.022,0,0.213,0,0.103,0.032,0.134,0,0,0.213,0,0.05,0.021,0,0.174,0.135,0.117,0.07,0.008,0.003,0.005,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.072,0.072,0.111,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0.001],[0.072,0.012,0,0.019,0,0.203,0,0.072,0.012,0.131,0,0,0.203,0,0.05,0.021,0,0.174,0.135,0.117,0.07,0.008,0.003,0.006,0.014,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.073,0.073,0.108,0,0,0.016,0.041,0.044,0.029,0.029,0,0.019,0,0],[0.04,0.001,0,0.016,0,0.195,0,0.04,0.001,0.128,0,0,0.195,0,0.05,0.021,0,0.175,0.136,0.117,0.071,0.008,0.003,0.006,0.015,0.018,0.012,0.042,0.038,0.092,0.074,0.055,0.044,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.075,0.075,0.104,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,0,0],[0.022,0,0,0.016,0.001,0.188,0,0.022,0,0.128,0,0.001,0.188,0,0.05,0.021,0,0.175,0.137,0.116,0.073,0.008,0.003,0.007,0.016,0.018,0.012,0.042,0.039,0.092,0.074,0.056,0.045,0.014,0.075,0.017,0.018,0.177,0.172,0.015,0.015,0.08,0.08,0.099,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0],[0.012,0,0,0.012,0.002,0.182,0,0.012,0,0.125,0,0.002,0.182,0,0.05,0.021,0,0.178,0.14,0.116,0.076,0.008,0.003,0.007,0.016,0.019,0.013,0.042,0.039,0.092,0.074,0.057,0.046,0.014,0.075,0.017,0.018,0.177,0.171,0.015,0.015,0.085,0.085,0.096,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0],[0.007,0,0,0.01,0.005,0.177,0,0.007,0,0.123,0,0.005,0.178,0,0.05,0.021,0,0.178,0.141,0.117,0.077,0.008,0.003,0.007,0.015,0.019,0.013,0.042,0.038,0.092,0.074,0.057,0.045,0.014,0.075,0.017,0.018,0.176,0.171,0.015,0.015,0.088,0.088,0.093,0,0,0.016,0.041,0.044,0.029,0.029,0,0.018,-0,-0]]}

Trying to distribute the blend frames received into the duration of the audio:

const audioFrametoBlendShapeFrame =  (audioFrame: number, audioMin = 0, audioMax: number, blendFrameMin = 0, belndFrameMax: number) :number => {
  return (audioFrame - audioMin) * (belndFrameMax - blendFrameMin) / (audioMax - audioMin) + blendFrameMin;
}

@srcnalt
Copy link
Owner

srcnalt commented Aug 9, 2023

Thanks for the details, I took a look at the Azure blendahpes and seems like they better mapped on VIseme blendshapes and not ARKit ones.

Sadly this is not gonna be 100% accurate but should help
https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-speech-synthesis-viseme?tabs=visemeid&pivots=programming-language-csharp

Also you can pass group names as morphTarget value to shorten the URL
https://docs.readyplayer.me/ready-player-me/api-reference/rest-api/avatars/get-3d-avatars#examples-7

public static Dictionary<int, int> VisemeMap = new Dictionary<int, int>()
{
    {0, 0},   // viseme_sil
    {1, 10},  // viseme_aa
    {2, 10},  // viseme_aa
    {3, 13},  // viseme_OO
    {4, 11},  // viseme_E
    {5, 11},  // viseme_E
    {6, 12},  // viseme_I
    {7, 14},  // viseme_U
    {8, 13},  // viseme_O
    {9, 10},  // viseme_aa
    {10, 13}, // viseme_OO
    {11, 10}, // viseme_aa
    {12, 3},  // viseme_TH
    {13, 13}, // viseme_O
    {14, 12}, // viseme_I
    {15, 7},  // viseme_SS
    {16, 6},  // viseme_CH
    {17, 4},  // viseme_DD
    {18, 2},  // viseme_FF
    {19, 8},  // viseme_nn
    {20, 5},  // viseme_kk
    {21, 1},  // viseme_PP 
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants