Edge Optimization Engineer (Junior/Senior)

  • Hanoi / Ho Chi Minh City
  • Fulltime

Overview

You’ll own end-to-end performance on the robot edge: building deep-dive profilers, pinpointing kernel/op/memory bottlenecks across the full data path (model -> dataloader/IPC -> I/O -> control loops), and shipping deterministic builds that stay rock-solid under real duty cycles (thermal, brownouts, EMI). Your scope spans GPU/NPU edge targets (e.g., Nvidia Jetson Orin, TI EVMs, etc.) with clear latency budgets per subsystem (perception/VLA, policy/RL, control).

Key Responsibilities

  • Own end-to-end profiling. Run Nsight/PyTorch Profiler/tegrastats traces across the full perception -> policy ->control path; attribute stalls to specific ops/kernels, memory moves, cache misses, or IPC.

  • Set and enforce budgets. Define per-module latency/throughput/energy targets (e.g., camera→VLA ≤X ms, policy step ≤Y ms, control loop 1 kHz) and guard them with automated checks.

  • Kill bottlenecks fast. Apply op fusion, graph capture, async pipelines, memory tiling, pinned buffers, and stream concurrency; rewrite hot paths in CUDA/Triton when needed.

  • Tune per device. Generate deterministic builds for edge devices; measure kernel occupancy, tensor-core usage, DMA overlap; lock clocks when appropriate.

  • Package optimized artifacts. Emit TensorRT engines/ONNX bundles, quantized weights, calibration sets, and config manifests; version everything for rollback.

  • On-call for performance. Triage field logs, root-cause perf regressions, and ship hotfix engines without violating safety or RT guarantees.

  • Continuously raise the bar. Track new compiler/runtime features (TensorRT/TVM/Inductor), add them behind flags, and land wins once validated against budgets.

Required Qualifications

  • Profiling & diagnosis. Expert with Nsight, PyTorch Profiler, tegrastats; you attribute latency/throughput regressions to exact kernels/ops and memory traffic, then present an ROI-clear fix plan.

  • Real-time mindset. You set/meet per-module latency budgets and protect hard real-time control paths during optimization and deployment.

  • Deployment engineering. Containerized, deterministic builds that emit TRT engines/quantized weights; repeatable, one-command device bring-up.

  • Hardware breadth. Hands-on with Nvidia Jetson Orin, TI EVMs, etc.; comfortable validating under thermal/brownout/EMI stress so behavior stays stable.

Preferred Qualifications

  • Compiler/runtime chops (TensorRT, ONNX Runtime, TVM, Triton, Executorch) and op-fusion/graph-capture to reduce stalls and memory traffic. (Aligns with packaging TRT engines & deterministic pipelines.)

  • VLA/RL awareness to set subsystem budgets and validate perception-to-action latency end-to-end.

Cơ hội việc làm tương tự

Liên hệ

Trở thành Đối tác của chúng tôi