Machine Learning Scientist – Vision-Language-Action (VLA) for Humanoids (Junior/Senior)
- Hanoi / Ho Chi Minh City
- Fulltime
Overview
You will design and train next-generation Vision-Language-Action (VLA) models that let humanoid robots understand instructions, perceive complex scenes, and act safely in real industrial environments.
Your focus is learning from limited real-world teleoperation data, and closing the distribution shift between low-data real demos and rich synthetic worlds. You’ll explore new model architectures, training schemes, and loss functions, and combine them with randomized, high-fidelity simulation and world-model–based data generation (e.g., NVIDIA Cosmos, Isaac/Omniverse) to build generalizable VLA policies for humanoids in factories and logistics.
You’ll work closely with our Teleoperation, RL & Controls, Simulation, and Platform teams to bring these models from research into production robots.
Key Responsibilities
Design and implement VLA architectures for humanoids
Build multi-modal policies that ingest RGB/Depth, language, robot state, and task history to generate actions (pose targets, motion primitives, or low-level controls).
Explore transformers, diffusion-style policies, hierarchical VLA, recurrent memory, and world-model–augmented controllers.
Learn effectively from scarce, noisy teleoperation data
Work with the teleop team to define data schemas, logging, and dataset curation from real humanoid operators.
Develop training strategies for low-data regimes: strong augmentations, self-/semi-supervised pretraining, contrastive objectives, multi-task learning, and behavior cloning / offline RL hybrids.
Propose loss designs and regularizers (e.g., action smoothness, safety margins, temporal consistency, language-grounding consistency) to mitigate overfitting and distribution shift.
Tackle distribution shift between real-world demos and simulation / synthetic data
Design domain randomization and sim parameter sampling (lighting, materials, sensor noise, robot dynamics, task layouts, human styles) to cover real-world variation.
Set up pipelines where VLA policies are trained jointly on real teleop demos and large synthetic datasets.
Analyze failure modes (out-of-distribution visual scenes, unseen language instructions, contact edge cases) and iteratively refine data, models, and objectives.
Build synthetic and simulated data pipelines (Isaac / Omniverse / Cosmos)
Configure high-fidelity humanoid simulation environments (manipulation cells, factory workcells, shared spaces with humans).
Integrate or prototype workflows that use world foundation models (e.g., NVIDIA Cosmos Predict/Transfer/Reason) to generate diverse video and interaction data for downstream VLA training and evaluation.
Automate large-scale curriculum & scenario generation (edge cases, rare events, long-horizon tasks).
Evaluation, benchmarking, and deployment support
Define metrics and test suites: task success, safety violations, instruction following, sim-to-real gap, robustness to visual/language perturbations.
Run structured ablations (architecture × data mix × losses) and communicate findings with clear plots, reports, and logs.
Collaborate with RL/Controls and Platform teams to integrate VLA policies into the humanoid stack and run on real robots under safety constraints.
Required Qualifications
Core skills
Strong background in deep learning for sequence / multimodal modeling (e.g., transformers, diffusion models, recurrent architectures, latent world models).
Hands-on experience building and training vision-language or VLA-style models (e.g., VLMs, embodied LLMs, policy networks conditioned on language).
Solid understanding of at least one of:
Imitation learning / behavior cloning
Offline / batch RL
Inverse RL or preference-based learning
Proven ability to work in low-data regimes: data augmentation, self-supervised representation learning, regularization, careful validation design.
Experience with robot learning from demonstration or teleoperation data (any platform; humanoids is a plus).
Strong engineering skills in Python and modern ML frameworks (PyTorch preferred; JAX/TF is a plus), including:
Writing clean training loops and data pipelines
Profiling and debugging training/inference
Managing experiments at scale (config systems, logging, basic MLOps)
General
Bachelor’s/Master’s/Ph.D. in Computer Science, Robotics, EE, or related field; or equivalent industry experience.
Ability to work cross-functionally with controls, hardware, and teleoperation teams.
Preferred Qualifications
Experience with NVIDIA physical-AI stacks: Isaac (Sim/Lab), Omniverse, or NVIDIA Cosmos world foundation models for synthetic data generation and sim-to-real workflows. Comfortable designing synthetic datasets: specifying scenario distributions, parameter ranges, and validation protocols.
Prior work on humanoid robots (control, perception, or policy learning) or other complex articulated robots in industrial settings.
Contributions to embodied AI / robot learning research: publications, open-source projects, or widely-used codebases.
Familiarity with safety-critical robotics (safe action constraints, human-in-the-loop supervision, fallbacks).
Experience deploying models on GPU clusters and edge devices (profiling latency, memory usage, batching, mixed precision).
Similar job opportunities
