xiaopeng(X) Logo
 xiaopeng(X) Logo
Staff AI Infrastructure Site Reliability Engineer
Full-timeMore than 100K RMB per monthUnited States - Santa Clara, CA
xiaopeng(X)
Refresh at 22 days ago 119 views

Job Responsibilities

Architect and lead the development of scalable, secure AI infrastructure on cloud-native platforms to support autonomous driving technologies Collaborate closely with ML teams to facilitate seamless integration and optimal performance of AI algorithms Identify and address system bottlenecks and instabilities, applying innovative solutions to enhance system reliability and efficiency Foster technological advancements through research and implementation of state-of-the-art AI tools and methodologies Act as a key technical leader and mentor, promoting a culture of technical excellence and collaborative innovation within the AI infrastructure team

Job Requirements

Minimum Skill Requirements: Bachelor's or Master's in Computer Science, Engineering, or related technical field 5 years + of experience in in designing, deploying, and managing GPU clusters for high-performance computing in AI applications, particularly within cloud environments Proficient in cloud services (AWS, Azure, ALI Cloud) and building containerized applications using Kubernetes and Docker Strong programming skills in Python, Golang, and experience with AI/ML frameworks (TensorFlow, PyTorch) Preferred Skill Requirements: Expertise in designing and managing high-availability, high-throughput systems that support machine learning and deep learning workloads Demonstrable leadership skills with a track record of mentoring and leading technical teams In-depth understanding of data structures, algorithms, and software engineering principles relevant to AI and autonomous systems

Required Languages

English

Job Details

Position type

Other

Experience

5~10 years

Similar jobs

People also viewed