Challenges of Modern AI Infrastructure: A Failure Analysis Approach to Enhance System Reliability

Thursday, November 20, 2025: 8:00 AM
2 (Pasadena Convention Center)
Mr. Huei Hao Yap , Meta, Menlo Park, CA, Meta, Menlo Park, CA
Mr. Pradip Sairam Pichumani , Meta, Menlo Park, CA
Mr. Nikesh Tayi Dhar , Meta, Menlo Park, CA
Mr. Dustin Kendig, BS EE , Microsanj LLC, Santa Clara, CA
Mr. Jeff Kuo , Microsanj LLC, Santa Clara, CA

Summary:

The increasing power demands of modern Artificial Intelligence (AI) infrastructure pose significant challenges to design, manufacturing, and assembly quality, leading to reduced system lifetimes. This study leverages the EZ100A-irOpen TEST system and 3D X-Ray Circular Scan to identify and mitigate issues affecting AI system reliability, focusing on three key areas: GPU board signal intermittent and mismatch issues caused by poor soldering, high-speed PCIe connector cable power-up and signal integrity issues potentially affected by copper via voids, and NPU switching issues resulting from electrical overstress in the field. By addressing these issues, this study aims to enhance the reliability specification and lifespan of AI infrastructure, ultimately supporting the development of more robust and efficient AI systems.
See more of: Boards and System Level FA
See more of: Technical Program