Machine Learning Scientist – Vision-Language-Action (VLA) for Humanoids (Junior/Senior)

  • Hanoi / Ho Chi Minh City
  • Fulltime

Overview

You will design and train next-generation Vision-Language-Action (VLA) models that let humanoid robots understand instructions, perceive complex scenes, and act safely in real industrial environments.

Your focus is learning from limited real-world teleoperation data, and closing the distribution shift between low-data real demos and rich synthetic worlds. You’ll explore new model architectures, training schemes, and loss functions, and combine them with randomized, high-fidelity simulation and world-model–based data generation (e.g., NVIDIA Cosmos, Isaac/Omniverse) to build generalizable VLA policies for humanoids in factories and logistics.

You’ll work closely with our Teleoperation, RL & Controls, Simulation, and Platform teams to bring these models from research into production robots.

Key Responsibilities

Design and implement VLA architectures for humanoids

  • Build multi-modal policies that ingest RGB/Depth, language, robot state, and task history to generate actions (pose targets, motion primitives, or low-level controls).

  • Explore transformers, diffusion-style policies, hierarchical VLA, recurrent memory, and world-model–augmented controllers.

Learn effectively from scarce, noisy teleoperation data

  • Work with the teleop team to define data schemas, logging, and dataset curation from real humanoid operators.

  • Develop training strategies for low-data regimes: strong augmentations, self-/semi-supervised pretraining, contrastive objectives, multi-task learning, and behavior cloning / offline RL hybrids.

  • Propose loss designs and regularizers (e.g., action smoothness, safety margins, temporal consistency, language-grounding consistency) to mitigate overfitting and distribution shift.

Tackle distribution shift between real-world demos and simulation / synthetic data

  • Design domain randomization and sim parameter sampling (lighting, materials, sensor noise, robot dynamics, task layouts, human styles) to cover real-world variation.

  • Set up pipelines where VLA policies are trained jointly on real teleop demos and large synthetic datasets.

  • Analyze failure modes (out-of-distribution visual scenes, unseen language instructions, contact edge cases) and iteratively refine data, models, and objectives.

Build synthetic and simulated data pipelines (Isaac / Omniverse / Cosmos)

  • Configure high-fidelity humanoid simulation environments (manipulation cells, factory workcells, shared spaces with humans).

  • Integrate or prototype workflows that use world foundation models (e.g., NVIDIA Cosmos Predict/Transfer/Reason) to generate diverse video and interaction data for downstream VLA training and evaluation.

  • Automate large-scale curriculum & scenario generation (edge cases, rare events, long-horizon tasks).

Evaluation, benchmarking, and deployment support

  • Define metrics and test suites: task success, safety violations, instruction following, sim-to-real gap, robustness to visual/language perturbations.

  • Run structured ablations (architecture × data mix × losses) and communicate findings with clear plots, reports, and logs.

  • Collaborate with RL/Controls and Platform teams to integrate VLA policies into the humanoid stack and run on real robots under safety constraints.

 Required Qualifications

Core skills

  • Strong background in deep learning for sequence / multimodal modeling (e.g., transformers, diffusion models, recurrent architectures, latent world models).

  • Hands-on experience building and training vision-language or VLA-style models (e.g., VLMs, embodied LLMs, policy networks conditioned on language).

  • Solid understanding of at least one of:

    • Imitation learning / behavior cloning

    • Offline / batch RL

    • Inverse RL or preference-based learning

  • Proven ability to work in low-data regimes: data augmentation, self-supervised representation learning, regularization, careful validation design.

  • Experience with robot learning from demonstration or teleoperation data (any platform; humanoids is a plus).

  • Strong engineering skills in Python and modern ML frameworks (PyTorch preferred; JAX/TF is a plus), including:

    • Writing clean training loops and data pipelines

    • Profiling and debugging training/inference

    • Managing experiments at scale (config systems, logging, basic MLOps)

General

  • Bachelor’s/Master’s/Ph.D. in Computer Science, Robotics, EE, or related field; or equivalent industry experience.

  • Ability to work cross-functionally with controls, hardware, and teleoperation teams.

Preferred Qualifications

  • Experience with NVIDIA physical-AI stacks: Isaac (Sim/Lab), Omniverse, or NVIDIA Cosmos world foundation models for synthetic data generation and sim-to-real workflows. Comfortable designing synthetic datasets: specifying scenario distributions, parameter ranges, and validation protocols.

  • Prior work on humanoid robots (control, perception, or policy learning) or other complex articulated robots in industrial settings.

  • Contributions to embodied AI / robot learning research: publications, open-source projects, or widely-used codebases.

  • Familiarity with safety-critical robotics (safe action constraints, human-in-the-loop supervision, fallbacks).

  • Experience deploying models on GPU clusters and edge devices (profiling latency, memory usage, batching, mixed precision).

Similar job opportunities

Liên hệ

Business Cooperation