of offer. This position pays at $27.00 per hour USD This position pays at $27.00 per hour CAD TD SYNNEX – Mai 2026 Stages hybrides... d’une équipe dynamique chez TD SYNNEX entièrement dédiée à « Making IT Personal ». À partir de mai, vous aurez l’occasion...
Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training...
. You'll also execute business plans to ensure continuous operations. Partner closely with MAI Leadership to drive strategic..., as well as budget management. Work in close collaboration with the broader business management community to align with MAI...
Applies engineering principles and AI techniques to solve complex problems through sound and creative engineering. Works with appropriate stakeholders to determine user requirements for a feature. Quickly learns new engineering methods, eme...
Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters. Build and evolve telemetry systems to provide visibility into infrastructure & ML model performance, utilization, and ...
Design and develop features for our capacity management portal Design and develop features to provide visibility into model performance and quality across our fleet Partner with ML researchers and PMs to translate functional requirements in...
Design, develop and maintain large-scale multimodal data processing pipelines. Design, develop and maintain large-scale multimodal model pretraining and post-training frameworks. Design, develop and maintain large-scale multimodal model inf...
Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks. Drive architectural decisions by analyzing rea...
Team leadership: Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems. Observability: Design and help maintain monitoring, alerting, and logging systems to provide real-...
Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking. Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity tracke...