42.uk Research

This New Mocha Wan Model is INSANE (ComfyUI → Promptus Workflow Tutorial)

2,866 words 15 min read SS N/A

Promptus UI

markdown

This New Mocha Wan Model is INSANE (ComfyUI → Promptus Workflow Tutorial)

Imagine a world where you can seamlessly swap the main character in any video, while preserving the original lighting, expressions, and movements with uncanny accuracy. Sounds like science fiction, right? Well, it's not anymore. Enter Mocha, a groundbreaking open-source AI model that's poised to revolutionize video editing and content creation.

In this comprehensive guide, we'll dive deep into the capabilities of Mocha (also known as Mocha Wan), exploring its potential to replace human actors in various projects, from short films to creative videos. We'll demonstrate how to harness its power within the popular ComfyUI environment, leveraging the intuitive workflow provided by Promptus. Get ready to witness AI in action and discover how you can integrate this game-changing technology into your own creative process – all using free tools!

This post will cover:

What is Mocha (Wan) and Why is it a Game Changer?

Mocha, developed by the Orange-3DV-Team, is a cutting-edge AI model designed for video character replacement. It leverages advanced deep learning techniques to analyze video footage, identify the target character, and seamlessly replace them with a new character or identity, all while maintaining visual consistency.

But what makes Mocha so revolutionary?

The core principle behind Mocha's success lies in its ability to:

  1. Analyze the original video: Mocha first analyzes the video to understand the scene's lighting, camera angles, and the character's movements.
  2. Track facial and body features: Advanced tracking algorithms identify and track key facial features and body movements of the target character.
  3. Generate a new character or identity: Based on the user's input, Mocha generates a new character or identity that matches the style and characteristics of the original video.
  4. Seamlessly integrate the new character: The new character is then seamlessly integrated into the video, with careful attention paid to lighting, shadows, and reflections to ensure a realistic appearance.
  5. Maintain consistency: Through its tracking data, Mocha can maintain the original expressions, head movements, and body language, ensuring the new character acts naturally within the scene.

Installing and Running Mocha in Promptus + ComfyUI: A Step-by-Step Guide

Now, let's get our hands dirty! This section will guide you through the process of installing and running Mocha within ComfyUI using the Promptus workflow.

Prerequisites:

Step 1: Installing the Mocha Node for ComfyUI

Unfortunately, as of this writing, there isn't a dedicated ComfyUI node specifically labeled "Mocha." However, Mocha's functionality relies heavily on other tools and techniques widely available in ComfyUI, particularly those related to face swapping, video processing, and image manipulation. Therefore, we'll be using existing ComfyUI nodes to achieve the Mocha-like effect.

This will involve using nodes like:

Important: You'll likely need to install several custom nodes to achieve the desired Mocha-like effect. Use the ComfyUI Manager (if you have it installed) or manually clone the necessary repositories from GitHub into your ComfyUI/custom_nodes directory.

Step 2: Setting up the ComfyUI Workflow in Promptus

  1. Open ComfyUI through Promptus: Launch ComfyUI using the Promptus interface. This provides a more organized and user-friendly environment for building your workflow.
  2. Create a New Workflow: Start with a blank canvas in ComfyUI.
  3. Load the Video: Add a "Load Video" node to your workflow. Configure it to load the video file containing the character you want to replace.
  4. Frame Extraction: Add nodes to extract individual frames from the video. You might need a "Frame Sampler" or similar node to process the video frame by frame.
  5. Face Detection: Add a face detection node (e.g., from ComfyUI-FaceDetailer or similar). Connect the output of the frame extraction node to the input of the face detection node. Configure the face detection node to accurately identify the face of the target character in each frame.
  6. Face Analysis (Optional): If you want more control over the expressions and features of the replaced face, you can add nodes to analyze the face and extract key landmarks or facial features.
  7. Face Swapping: Add a face swap node (e.g., from ComfyUI-Reactor or another face swap custom node).
    • Connect the output of the face detection node (containing the detected face) to the input of the face swap node.
    • Provide the face swap node with the image of the person you want to use as the replacement. This could be a static image or even a dynamically generated image from a Stable Diffusion model.
    1. Image Compositing: Add image compositing nodes to seamlessly blend the swapped face back into the original frame. Pay attention to color correction, lighting, and blending modes to ensure a natural look.
    2. Video Encoding: Add nodes to re-encode the processed frames back into a video file. You'll need to specify the output format, codec, and frame rate.
    3. ControlNet Integration (Optional but Recommended): Incorporating ControlNet can significantly improve the consistency and stability of the face swap. Use ControlNet models like "ControlNet Tile" or "ControlNet Face" to guide the image generation process and ensure that the swapped face aligns perfectly with the original character's head position, pose, and expressions.
    4. VAE Encode/Decode: Ensure you are properly encoding and decoding images between pixel space and latent space where necessary, especially when using Stable Diffusion models for generating the replacement face.

    Step 3: Configuring the Nodes

    This is where the magic happens. You'll need to carefully configure each node to achieve the desired results. Here are some key considerations:

    • Face Detection Accuracy: Experiment with different face detection models and parameters to ensure accurate and reliable face detection in all frames of the video.
    • Face Swap Model Selection: Choose a face swap model that produces realistic and high-quality results. Consider factors like model size, training data, and performance.
    • Image Compositing Techniques: Experiment with different blending modes, color correction techniques, and masking strategies to seamlessly integrate the swapped face into the original frame.
    • ControlNet Settings: Fine-tune the ControlNet settings to achieve the desired level of control and consistency. Pay attention to the ControlNet strength and the specific ControlNet model being used.
    • Prompting and Seed Values: If you're using Stable Diffusion to generate the replacement face, experiment with different prompts and seed values to achieve the desired appearance and style.
    • Frame-by-Frame Adjustments: In some cases, you may need to make frame-by-frame adjustments to the face swap or image compositing to address any inconsistencies or artifacts.

    Step 4: Running the Workflow

    Once you've configured all the nodes, click the "Queue Prompt" button in ComfyUI (or the equivalent button in Promptus) to start the workflow. ComfyUI will process the video frame by frame, performing the face swap and image compositing operations.

    Step 5: Reviewing and Refining the Results

    After the workflow has finished running, review the output video carefully. Look for any inconsistencies, artifacts, or areas that need improvement. You may need to adjust the node settings and re-run the workflow multiple times to achieve the desired results.

    Example Workflow (Simplified):

    [Load Video] --> [Frame Sampler] --> [Face Detection] --> [Face Swap] --> [Image Compositing] --> [Video Encode]

    Important Notes:

    • This is a complex process, and achieving perfect results may require significant experimentation and fine-tuning.
    • The specific nodes and parameters you need to use will depend on the characteristics of your video and the desired outcome.
    • Be prepared to troubleshoot and debug your workflow as you go.
    • Refer to the documentation and tutorials for the individual ComfyUI nodes you are using for more detailed information.

    Side-by-Side Demos of Real vs AI-Swapped Footage: Seeing is Believing!

    While we can't directly embed interactive demos within this text-based format, we can describe the kind of results you can expect and what to look for when evaluating the quality of the AI-swapped footage.

    What to Look For:

    • Consistency: Does the swapped face maintain a consistent appearance throughout the video? Are there any noticeable changes in skin tone, lighting, or facial features?
    • Realism: Does the swapped face look natural and believable? Does it blend seamlessly with the original video footage?
    • Expression Matching: Does the swapped face accurately reflect the expressions and emotions of the original character?
    • Motion Tracking: Does the swapped face track the movements of the original character's head and face accurately? Are there any noticeable jitters or distortions?
    • Lighting and Shadows: Does the swapped face interact realistically with the lighting and shadows in the scene?
    • Artifacts: Are there any noticeable artifacts, such as blurring, ghosting, or color bleeding?

    Typical Results:

    With careful configuration and fine-tuning, you can achieve surprisingly realistic results. The best results are typically obtained when:

    • The original video has good lighting and stable camera angles.
    • The replacement face is of high quality and matches the style of the original video.
    • The face swap model is well-trained and accurate.
    • The image compositing is done carefully and with attention to detail.

    Potential Challenges:

    • Occlusion: When the original character's face is partially obscured by objects or other characters, the face swap may be less accurate.
    • Extreme Angles: When the original character's face is viewed from extreme angles, the face swap may be distorted.
    • Fast Motion: When the original character is moving quickly, the face swap may be blurry or jittery.
    • Lighting Changes: Dramatic changes in lighting can make it difficult to maintain consistency in the swapped face.

    Despite these challenges, Mocha (through this ComfyUI/Promptus implementation) offers a powerful tool for video character replacement, enabling creators to achieve impressive results with relatively little effort.

    The Workflow Setup for Perfect Facial and Hand Tracking

    Achieving perfect facial and hand tracking is crucial for seamless character replacement. Here's a breakdown of the key elements and techniques involved:

    1. Robust Face Detection and Tracking:

    • Choose the Right Model: Select a face detection model that is specifically designed for video and can handle variations in lighting, pose, and expression. Models like RetinaFace or similar advanced detectors are often preferred.
    • Implement Tracking Algorithms: Use tracking algorithms, such as Kalman filters or optical flow, to maintain a consistent track of the face throughout the video. This helps to prevent the face detection from "drifting" or losing track of the face.
    • Handle Occlusion: Implement techniques to handle occlusion, such as using multiple face detectors or interpolating the face position when it is temporarily obscured.

    2. Precise Facial Landmark Detection:

    • High-Resolution Landmarks: Use a facial landmark detection model that provides a high number of landmarks (e.g., 68 landmarks or more). This allows for more accurate and detailed tracking of facial features.
    • Robustness to Expression: Choose a landmark detection model that is robust to variations in facial expression.
    • Calibration: Calibrate the landmark detection model to the specific characteristics of your video. This can involve training the model on a dataset of images that are similar to your video.

    3. Hand Tracking and Pose Estimation:

    • Dedicated Hand Tracking Models: Utilize dedicated hand tracking models like MediaPipe Hands or similar to accurately detect and track hands in the video.
    • Pose Estimation: Incorporate pose estimation models to understand the overall body pose of the character. This can help to improve the accuracy of hand tracking, especially when the hands are interacting with the body.
    • 3D Hand Reconstruction: Consider using 3D hand reconstruction techniques to estimate the 3D pose of the hands. This can be particularly useful for creating realistic hand interactions with virtual objects.

    4. Data Smoothing and Filtering:

    • Apply Smoothing Filters: Apply smoothing filters, such as Kalman filters or moving average filters, to the tracking data to reduce noise and jitter.
    • Outlier Removal: Implement techniques to identify and remove outliers from the tracking data.
    • Temporal Consistency: Ensure that the tracking data is temporally consistent, meaning that the movements are smooth and natural over time.

    5. Integration with Character Replacement:

    • Map Tracking Data to Character: Map the tracking data from the original character to the replacement character. This involves aligning the coordinate systems and scaling the movements appropriately.
    • Retargeting: Use retargeting techniques to transfer the movements of the original character to the replacement character.
    • Blending: Blend the movements of the original character with the movements of the replacement character to create a seamless transition.

    Practical Example: Using MediaPipe Hands with ComfyUI

    While a direct MediaPipe Hands node might not be readily available, you can integrate MediaPipe functionality through custom Python nodes within ComfyUI. This would involve:

    1. Installing MediaPipe: pip install mediapipe
    2. Creating a Custom Node: Write a Python script that uses MediaPipe Hands to detect and track hands in the video frames. This script would take a frame as input and output the hand landmark coordinates.
    3. Integrating the Node in ComfyUI: Use the "Execute Custom Node" node in ComfyUI to run your Python script. Connect the output of the frame extraction node to the input of your custom node.
    4. Using the Hand Landmarks: Use the hand landmark coordinates to control the movement of virtual hands or objects in the scene. You can use these coordinates to drive the animation of a 3D hand model or to manipulate the position and orientation of virtual objects.

    By implementing these techniques, you can achieve highly accurate and realistic facial and hand tracking, which is essential for seamless and believable character replacement.

    Conclusion: The Future of Video Editing is Here

    Mocha, combined with the power of ComfyUI and the user-friendliness of Promptus, represents a significant leap forward in video editing and content creation. While still in its early stages, this technology has the potential to revolutionize the way we create videos, opening up new possibilities for creativity and expression.

    From replacing actors in short films to generating personalized avatars for virtual meetings, the applications of Mocha are vast and diverse. By embracing this open-source AI model and exploring its capabilities, you can gain a competitive edge in the rapidly evolving world of video production.

    Key Takeaways:

    • Mocha enables seamless character replacement in videos while preserving visual consistency.
    • ComfyUI and Promptus provide a powerful and intuitive environment for working with Mocha.
    • Accurate facial and hand tracking is crucial for achieving realistic results.
    • The potential applications of Mocha are vast and diverse.

    Call to Action:

    • Explore the Resources: Dive into the resources mentioned in this post, including the Mocha GitHub repository and the Promptus setup guide.
    • Mocha GitHub: https://github.com/Orange-3DV-Team/MoCha
    • ComfyUI with Promptus Setup Guide: www.promptus.ai/blog/how-to-use-promptus-offline"https://www.promptus.ai/blog/how-to-use-promptus-offline
    • Experiment with ComfyUI and Promptus: Download ComfyUI and Promptus and start experimenting with the workflows described in this post.
    • Join the Community: Connect with other creators and share your experiences in the Promptus Discord community.
    • Discord / Community: https://discord.com/invite/gTTKzXKNay
    • Share Your Creations: Show us what you've created using Mocha, ComfyUI, and Promptus! Share your videos on social media and tag us so we can see your amazing work. Use the hashtags: #aitools #MochaAI #promptusai #comfyui #aianimation #aivideo #aifilmmaking

    The future of video editing is here. Are you ready to be a part of it?