Skip to content

Lip sync

misae edited this page Aug 22, 2024 · 1 revision

Lip Sync System

This page details the lip-sync system used in the application for animating avatar mouth movements in synchronization with audio. Covered here are the technical aspects of the implementation, including data generation, viseme mapping, and integration with the avatar animation system.

Table of Contents

Overview

The lip-sync system uses viseme data generated from audio clips to animate the avatar's mouth movements. It leverages the morph targets (blend shapes) provided by Ready Player Me avatars to create realistic lip movements synchronized with speech.

Viseme Data Generation

We use Rhubarb Lip Sync to generate viseme data from audio clips. The output is a JSON file containing timing information for each viseme.

Example of generated viseme data:

{
  "mouthCues": [
    {"start": 0.00, "end": 0.05, "value": "X"},
    {"start": 0.05, "end": 0.10, "value": "A"},
    // ... more cues ...
  ],
  "metadata": {
    "duration": 1.5
  }
}

Viseme to Blend Shape Mapping

Ready Player Me avatars come with a set of blend shapes compatible with the Oculus LipSync SDK. We map Rhubarb's visemes to these blend shapes:

const visemeToBlendShape: { [key: string]: string[] } = {
  'X': ['viseme_sil'],
  'A': ['viseme_PP'],
  'B': ['viseme_kk'],
  'C': ['viseme_I'],
  'D': ['viseme_aa'],
  'E': ['viseme_O'],
  'F': ['viseme_U'],
  'G': ['viseme_FF'],
  'H': ['viseme_TH'],
};

Available blend shapes include:

  • Basic visemes: viseme_sil, viseme_PP, viseme_FF, viseme_TH, viseme_DD, viseme_kk, viseme_CH, viseme_SS, viseme_nn, viseme_RR, viseme_aa, viseme_E, viseme_I, viseme_O, viseme_U
  • Additional shapes: mouthOpen, mouthSmile, eyesClosed, eyesLookUp, eyesLookDown

Implementation Details

The lip-sync animation is implemented in the animateLipsync function:

const animateLipsync = (delta: number) => {
  if (!lipsyncDataRef.current || !isLipsyncPlaying || lipsyncStartTimeRef.current === null) {
    return;
  }

  const currentTime = performance.now() / 1000 - lipsyncStartTimeRef.current;
  const currentCue = lipsyncDataRef.current.mouthCues.find((cue: any) => 
    currentTime >= cue.start && currentTime < cue.end
  );

  if (currentCue && currentCue !== currentCueRef.current) {
    const blendShapes = visemeToBlendShape[currentCue.value];
    if (blendShapes) {
      Object.values(visemeToBlendShape).flat().forEach(shape => {
        targetValuesRef.current[shape] = blendShapes.includes(shape) ? 1 : 0;
      });
    }
    currentCueRef.current = currentCue;
    lerpFactorRef.current = 0;
  }

  // Interpolate between current and target values
  lerpFactorRef.current = Math.min(lerpFactorRef.current + delta * 5, 1);
  const shapes = Object.keys(targetValuesRef.current);
  const values = shapes.map(shape => {
    const current = currentValuesRef.current[shape] || 0;
    const target = targetValuesRef.current[shape] || 0;
    const value = THREE.MathUtils.lerp(current, target, lerpFactorRef.current);
    currentValuesRef.current[shape] = value;
    return value;
  });

  setBlendShapes(shapes, values);
};

This function is called every frame when lip-sync is active, updating the blend shape values based on the current viseme.

Teeth Animation

For teeth animation, we use a simplified approach:

if (meshName === 'Wolf3D_Teeth') {
  // For teeth, only use 'mouthOpen'
  const mouthOpenIndex = mesh.morphTargetDictionary?.['mouthOpen'];
  if (mouthOpenIndex !== undefined && mesh.morphTargetInfluences) {
    let mouthOpenValue = 0;
    shapes.forEach((shape, index) => {
      if (teethMovingVisemes.includes(shape)) {
        mouthOpenValue = Math.max(mouthOpenValue, values[index]);
      }
    });
    mesh.morphTargetInfluences[mouthOpenIndex] = mouthOpenValue * 1;
  }
}

This approach uses the 'mouthOpen' morph target for teeth movement, as most visemes don't significantly affect teeth visibility.

Synchronization with Audio

Audio synchronization is achieved by using the timing information from the viseme data:

lipsyncStartTimeRef.current = performance.now() / 1000;

We use this start time to calculate the current position in the audio playback and apply the appropriate viseme.

Extensibility

The current system can be adapted to use different viseme sets or audio processing methods:

  1. Modify the visemeToBlendShape mapping to accommodate new viseme sets.
  2. Update the viseme data loading and parsing if using a different audio processing tool.
  3. Adjust the animateLipsync function to handle different data formats if necessary.

Performance Considerations

The lip-sync system is designed to be performant:

  1. It uses efficient lerping between viseme states.
  2. Blend shape calculations are optimized to run every frame without significant overhead.
  3. The system only updates when audio is playing, reducing unnecessary computations.

Integration with Other Animations

The lip-sync system is designed to work seamlessly with other facial and full-body animations:

  1. Lip-sync blend shapes are applied independently of other animations.
  2. The useFrame hook in Three.js ensures that lip-sync updates are synchronized with the render loop.
useFrame((state, delta) => {
  if (mixerRef.current) {
    mixerRef.current.update(delta);
  }
  if (isLipsyncPlaying) {
    animateLipsync(delta);
  }
});

This approach allows lip-sync to be active during any full-body animation without conflicts.

Clone this wiki locally