⚙️ P-1 AI Develops Engineering AGI for Physical Systems

May 25, 2025

Silicon minds learn
To weave physics and purpose—
Wisdom flows like code

Listen to full episode on HELIOX Podcast

With every article and podcast episode, we provide comprehensive study materials: References, Executive Summary, Briefing Document, Quiz, Essay Questions, Glossary, Timeline, Cast, FAQ, Table of Contents, Index, Polls, 3k Image, Fact Check and
Comic at the bottom of the page.

We're standing at the edge of something unprecedented in human history. Not another technological breakthrough that makes our phones faster or our videos sharper, but a fundamental shift in how we solve the complex problems that shape our physical world.

While everyone's been obsessing over ChatGPT writing emails and generating cat poetry, a quieter revolution has been brewing in the engineering world. It's called Engineering Artificial General Intelligence, or E-AGI, and it promises to do something that should terrify and exhilarate us in equal measure: think like the best human engineers, but without the coffee breaks.

The Messy Reality of Building Things

Here's what the tech bros pumping out another social media app don't understand: engineering physical systems is hard. Like, really hard. When you're designing a plane, a power plant, or even a better air conditioning system, you're not just writing code that lives in a pristine digital world. You're wrestling with physics that don't care about your feelings, materials that have their own stubborn personalities, and constraints that change faster than a toddler's mood.

Everything is connected to everything else in ways that would make a conspiracy theorist weep with joy. Change the temperature by a few degrees, and suddenly your structural integrity is compromised. Adjust the motor speed, and your efficiency plummets. It's a house of cards made of equations, and every calculation matters.

The tools we have now—CAE, MBSE, all those acronyms that make engineers sound like they're speaking in code—are powerful, sure. But they're still essentially fancy calculators. They can crunch numbers until the cows come home, but they can't think. They can't look at a problem and say, "Hey, what if we tried something completely different?"

Enter the Thinking Machine

This is where E-AGI comes in, and why it's different from the AI that's been dominating headlines. We're not talking about another chatbot that can write your English essays (though it probably could). We're talking about AI that can understand the bone-deep complexity of physical systems and reason about them the way a seasoned engineer does.

Imagine an AI that doesn't just know Ohm's law—it understands when to apply it, when to question it, and when to throw it out the window because the real world is messier than textbooks admit. An AI that can look at a half-finished design and see not just what's there, but what's missing. That can suggest solutions no human thought of because it can hold thousands of variables in its mind simultaneously without breaking a sweat.

The researchers behind this concept have done something brilliant: they've borrowed from education theory to create a framework for evaluating how well an AI can think like an engineer. Using Bloom's taxonomy—that pyramid of learning that teachers love—they've mapped out six levels of cognitive ability that an E-AGI needs to master.

At the bottom, you have simple recall: knowing facts, formulas, the basics that every engineer memorizes in school. Move up a level, and you get understanding—grasping what those facts mean in context. Then application: using that knowledge to solve real problems.

But here's where it gets interesting. The higher levels—analysis, creation, reflection—that's where human engineers earn their keep. That's where experience, intuition, and creativity come together to solve problems that don't have obvious answers.

The Race to Build It

While academics debate frameworks, P1AI is actually trying to build this thing. Their AI, called Archie, isn't just a thought experiment—it's a $23 million bet that we can create artificial engineers who think like us, but better.

The company's approach is clever in its simplicity: if you don't have enough real engineering data to train on (and you don't, because it's all locked up in corporate vaults), then generate synthetic data based on the fundamental laws of physics. Create artificial problems, solve them computationally, and use that to teach the AI how engineering thinking works.

Their first target? Data center cooling systems. It's a perfect test case: complex, important, full of competing constraints and trade-offs. The kind of problem that keeps human engineers up at night, but exactly the sort of challenge that could demonstrate E-AGI's potential.

The Uncomfortable Questions

But here's what nobody's talking about in the press releases and funding announcements: what happens to us when the machines get good at thinking like engineers?

This isn't the usual "robots will take our jobs" panic. This is different. More subtle. More unsettling.

If E-AGI can analyze, create, and reflect at the level of expert human engineers, then what exactly is our role? Do we become supervisors, making sure the AI doesn't design bridges that fall down? Do we focus on defining problems while the AI solves them? Do we become translators between the messy human world and the precise digital one?

The optimists say we'll be freed up for more creative work, that we'll work alongside AI partners rather than being replaced by them. The pessimists worry we're training our own replacements, creating systems that will eventually outthink us in the very domains where human intelligence has always reigned supreme.

The Deeper Implications

The development of E-AGI represents something more profound than just another AI breakthrough. It's the moment when artificial intelligence stops being about pattern matching and starts being about understanding—really understanding—how the physical world works.

When an AI can look at a design problem and not just calculate the optimal solution, but understand why that solution is optimal, what trade-offs it involves, and what assumptions it's based on, we've crossed a line. We've created something that doesn't just mimic intelligence—it embodies it.

This is the boundary dissolution that defines our era: the lines between human and artificial cognition becoming increasingly blurred. The adaptive complexity that characterizes our modern world finding its match in systems that can think as flexibly as we do. The embodied knowledge of generations of engineers being captured and amplified by silicon minds.

What We're Really Building

Perhaps the most unsettling aspect of E-AGI isn't what it can do, but what it represents. We're essentially trying to capture the essence of human engineering expertise—all the accumulated wisdom, intuition, and creative problem-solving that defines our profession—and transfer it to systems that never had to learn the hard way.

There's something both beautiful and melancholy about that. Beautiful because it could democratize engineering expertise, making advanced design capabilities available to anyone. Melancholy because it suggests we're nearing the point where human intelligence, at least in this domain, might become optional rather than essential.

The future that E-AGI promises isn't one where engineers disappear. It's one where the nature of engineering work fundamentally changes. Where the value of human engineers lies not in their ability to perform cognitive tasks, but in their ability to define problems, navigate ethical complexities, and manage the human elements that no AI can fully grasp.

Maybe that's not such a bad future. Maybe it's one where engineers can focus on the parts of their work that are most uniquely human: the creativity, the ethical judgment, the ability to understand not just how to build something, but whether it should be built at all.

The question isn't whether E-AGI will change engineering. It will. The question is whether we'll be ready for what comes next, when our thinking machines start thinking for themselv

Join Heliox’s subscriber chat

Available in the Substack app and on web

Link References

On the Evaluation of Engineering Artificial General Intelligence

Episode Links

Youtube

BuzzSprout

Substack

Other Links to Heliox Podcast

BuzzSprout

YouTube
Substack
Podcast Providers
Spotify
Apple Podcasts
Patreon
FaceBook Group

BlueSky

STUDY MATERIALS

Briefing Document

Executive Summary:

P-1 AI, a new company co-founded by industry veterans from Airbus, United Technologies, and DeepMind, has emerged from stealth with the goal of building Engineering Artificial General Intelligence (eAGI) for physical systems. Their flagship product, Archie, is an AI agent designed to automate cognitive tasks currently performed by human engineers. A key challenge in developing and evaluating such systems is the scarcity of relevant training data and the lack of robust evaluation frameworks for engineering intelligence. P-1 AI addresses the data challenge by creating synthetic datasets. To address evaluation, they propose a novel, extensible framework grounded in Bloom's Taxonomy, tailored to the complexities of engineering design. This framework defines six levels of engineering cognition and uses a secondary taxonomy of domain-specific metadata to generate comprehensive and nuanced evaluations.

Key Themes and Ideas:

The Need for Engineering AGI: The sources highlight that engineering the physical world is a highly complex, iterative, and expertise-driven process that remains largely manual, unlike software engineering. Current narrow AI tools and general-purpose LLMs are insufficient for the cognitive demands of comprehensive engineering tasks. There is a significant need for AI that can automate and augment the work of human engineers in designing, analyzing, and synthesizing physical systems.

"Where is the AI that will ultimately help us design and build starships, Matrioshka Brains, and Dyson Spheres? P-1 AI, Inc. aims to answer this question with Archie, an AI agent aimed at cognitive automation of tasks that human engineers perform today when designing physical systems."
"Unlike software engineering, where modularity and abstraction have enabled powerful automation, physical systems engineering remains largely manual, iterative, and expertise-driven."

P-1 AI's Mission and Approach: P-1 AI is specifically focused on "engineering AGI," distinct from general AGI or narrow engineering tools. Their approach involves developing an AI architecture ("Archie") capable of "cognitive automation" for tasks like distilling requirements, concept development, design trades, and tool selection. They explicitly state Archie is not designed to replace or improve existing engineering tools but to automate the cognitive processes around them.

"Our aim is that every engineering team at every major industrial company has an Archie as a team member, focusing initially on the dull and repetitive tasks... ultimately helping humankind build things we don’t know how to build today," said Paul Eremenko, co-founder and CEO at P-1 AI."
"We are on a mission to solve engineering AGI," said Aleksa Gordić, co-founder and head of AI at P-1 AI."
"Archie is not designed to replace, improve, or compete with any existing engineering tool—the focus is on cognitive automation."

Addressing Data Scarcity: A major hurdle for training AI models in physical product domains is the limited availability of unique, well-structured, and non-proprietary design data. P-1 AI tackles this by generating "large, physics-based and supply chain-informed synthetic design datasets" that efficiently sample the product design space.

"The main obstacle to creating AI models that reason over physical product domains is scarcity of training data."
"P-1 AI addresses this challenge by creating large, physics-based and supply chain-informed synthetic design datasets that efficiently sample the product design space."

Proposed eAGI Evaluation Framework: Recognizing the limitations of existing benchmarks (which often test shallow knowledge or rigid, domain-specific problems), the authors from P-1 AI propose a comprehensive framework for evaluating eAGI agents. This framework is considered a critical enabler for developing eAGI.

"Given the breadth of tasks expected of an eAGI agent, evaluating its capabilities becomes a foundational challenge."
"Existing benchmarks either test shallow factual knowledge (e.g., textbook questions) or rely on rigid, domain-specific problems with fixed outputs. Neither of these approaches captures the range of reasoning, modeling, and synthesis skills needed for effective engineering intelligence."

Bloom's Taxonomy for Engineering Cognition: The core of the evaluation framework is a 6-level hierarchy grounded in Bloom's Taxonomy, adapting it to engineering cognition. These levels progress from basic recall to complex reflection and abstraction:

Level 1 (Remember): Accurate recall of factual information (definitions, equations, standards, component properties).
Level 2 (Understand): Semantic understanding of designs (identifying components, spatial/functional topology, causal behavior).
Level 3 (Apply): Design evaluation and operational reasoning (predicting performance, component substitution, using external tools).
Level 4 (Analyze): Design in-filling and error diagnosis (completing designs, detecting errors, proposing corrections). This level introduces inverse reasoning.
Level 5 (Create): Design synthesis from requirements (generating new designs, adapting existing ones, exploring design space). This level requires creative synthesis.
Level 6 (Reflection and Abstraction): Meta-reasoning and critique (critiquing decisions, inferring principles, identifying limitations, explaining decisions). This level represents expert-level reasoning.
"We propose a 6-level hierarchy (Table 1) grounded in Bloom’s taxonomy to characterize engineering cognition levels and systematically evaluate eAGI agents."
"These levels reflect ascending competencies—starting from factual recall and culminating in design synthesis, abstraction, and self-awareness."

Dimensions of Engineering Complexity: The framework considers three dimensions that influence the complexity of engineering problems and map to the Bloom's levels:

Directionality: Forward (analysis) vs. Inverse (synthesis). Levels 1-3 are primarily forward, 4-6 involve inverse reasoning.
Design Behavior: Static vs. Dynamic analysis. Higher levels involve reasoning over both static and dynamic behavior.
Design Scope: Closed-world (bounded decisions/requirements) vs. Open-world (underspecified objectives, conflicting constraints). Levels 1-4 are closed-world, Level 5 is semi-open-world, and Level 6 is fully open-world.
"We identify three major dimensions to measure the complexity of engineering problems."
"These levels can be located along the three dimensions for measuring the complexity of engineering problems identified earlier."

Domain-Specific Metadata Taxonomy: Complementing the cognitive hierarchy, a secondary taxonomy uses structured metadata tags (System Type, Design Scope, Domain, Modeling Requirements, Applicable Standards) to ensure comprehensive coverage and allow for targeted, context-aware evaluation generation.

"To achieve this, we complement the cognitive taxonomy discussed above with a secondary taxonomy that includes aspects of engineering domain coverage."
"These metadata tags serve not only as classification tools but also as dynamic control parameters for adaptive test generation pipelines."

Evaluation Generation and Scoring: The framework outlines how evaluations are generated using reusable templates combined with system context, cognitive targets, task types, and metadata tags. Scoring varies by level, ranging from automated methods (Levels 1-3, partial for 4-5) to human-in-the-loop review (Level 6). Simulation-based validation is crucial for higher levels.

"Each evaluation prompt is synthesized by selecting from structured templates that integrate: A System-level Context... A Cognitive Target... A Task Type..."
"Levels 1 to 3 (Remember, Understand, Apply) are largely objectively scorable... These levels typically involve deterministic or well-constrained answers, making full automation feasible."
"Level 5 (Create) involves open-ended design synthesis. Automated scoring here focuses on partial correctness and feasibility..."
"Level 6 (Reflect) At Level 6, an eAGI demonstrates meta-cognitive capabilities... These often require human-in-the-loop scoring..."

Application to Propeller-Motor Matching (eVTOL): The paper grounds the evaluation framework using the example of propeller-motor sizing and matching for an eVTOL aircraft, demonstrating sample questions and expected answers for each of the six cognitive levels in this specific domain.

"To ground our approach to evaluating eAGI, we chose an electric vertical takeoff and landing (eVTOL) aircraft as a representative example."
"Specifically, for this discussion we limit our scope to a propellor-motor sizing and matching problem for an evTOL aircraft."

Most Important Facts:

P-1 AI's Core Mission: Build Engineering AGI (eAGI) for physical systems with an AI agent named Archie.
Leadership: Co-founded by Paul Eremenko (former CTO of Airbus/UTC, DARPA exec) and Aleksa Gordić (former DeepMind/Microsoft researcher, co-developer of llm.c).
Funding: Announced a $23 million seed round led by Radical Ventures, with participation from notable investors including Jeff Dean and Peter Welinder (OpenAI).
Initial Focus: Data center cooling systems, with planned expansion into other domains like industrial, building, automotive, heavy machinery, and aerospace/defense.
Data Strategy: Overcome data scarcity by creating large, synthetic physics-based and supply chain-informed datasets.
Evaluation Framework: Proposing a novel, extensible framework for eAGI evaluation based on a 6-level Bloom's Taxonomy adaptation for engineering tasks and a secondary domain-specific metadata taxonomy.
Cognitive Levels: The six levels are Remember, Understand, Apply, Analyze, Create, and Reflect, representing increasing complexity and generalization in engineering reasoning.
Evaluation Metrics: Range from objective scoring for lower levels to simulation-based validation and expert human review for higher, more open-ended tasks.

Implications:

P-1 AI's emergence signifies a focused effort on applying AI to the highly complex domain of physical systems engineering. Their approach to both data generation and evaluation highlights the unique challenges of this field compared to software or general language tasks. The proposed evaluation framework, grounded in established educational psychology but tailored to engineering, could provide a much-needed standard for benchmarking and developing eAGI systems. Successful development of Archie and similar eAGI agents could significantly accelerate innovation and productivity in various industrial sectors, potentially enabling the design and construction of previously unimaginable physical systems. The reliance on synthetic data and the need for human-in-the-loop evaluation at higher cognitive levels underscore the current limitations and ongoing research challenges in achieving truly general engineering intelligence.

Key Concepts

Detailed Review Topics

Quiz & Answer Key

Quiz

What is the primary goal of P-1 AI's Archie?
Who are the co-founders of P-1 AI?
What is the main obstacle P-1 AI addresses in training AI models for physical product domains?
What are some of the initial product domains where Archie will be deployed?
What does the arXiv paper discuss regarding eAGI?
What is eAGI, as defined in the arXiv paper?
Why are existing general AI benchmarks insufficient for evaluating eAGI?
According to the arXiv paper, how does Bloom's Taxonomy provide a more practical foundation for evaluating eAGI compared to models like the Dreyfus Model?
What are the three major dimensions used in the arXiv paper to measure the complexity of engineering problems?
Briefly describe the difference between Level 1 (Remember) and Level 2 (Understand) in the proposed eAGI evaluation framework.

Quiz Answer Key

The primary goal of P-1 AI's Archie is the cognitive automation of tasks that human engineers perform today when designing physical systems.
The co-founders of P-1 AI are Paul Eremenko, Aleksa Gordić, and Adam Nagel (as Head of Engineering).
The main obstacle P-1 AI addresses is the scarcity of training data for creating AI models that reason over physical product domains.
Archie will initially be deployed to help engineer data center cooling systems, with plans to expand to industrial systems, building systems, automotive and heavy machinery, and aerospace and defense.
The arXiv paper discusses the challenges and proposes a framework for evaluating engineering artificial general intelligence (eAGI) agents.
eAGI is defined as a specialization of artificial general intelligence (AGI) capable of addressing a broad range of problems in the engineering of physical systems and associated controllers, excluding software engineering.
Existing general AI benchmarks are insufficient for evaluating eAGI because they test shallow factual knowledge or rely on rigid, domain-specific problems with fixed outputs, which do not capture the range of reasoning, modeling, and synthesis skills needed for effective engineering intelligence.
Bloom's Taxonomy provides a hierarchical, task-oriented structure that is more amenable to evaluating AI systems through benchmarkable assessments, unlike the Dreyfus model which requires subjective evaluation of behavior over time.
The three major dimensions are Directionality (forward vs. inverse reasoning), Design Behavior (static or dynamic), and Design Scope (closed world or open world).
Level 1 (Remember) involves the accurate recall of factual, domain-specific knowledge such as definitions or equations, while Level 2 (Understand) assesses the AI's ability to interpret and explain the structure and function of a given design, such as identifying components and their roles.

Essay Questions

Discuss the challenges P-1 AI faces in developing an engineering AGI like Archie, particularly concerning data scarcity and the need for fundamental breakthroughs, and explain how their approach of using synthetic datasets aims to mitigate these challenges.
Analyze the six levels of engineering cognition proposed in the arXiv paper based on Bloom's Taxonomy. Explain how these levels build upon each other and what each level signifies about an eAGI agent's capabilities in an engineering context.
Compare and contrast the evaluation methodologies for general LLMs, domain-specific AI benchmarks, and AGI evaluation as discussed in the arXiv paper. Explain why a specialized framework is necessary for evaluating eAGI.
Explain the role of the secondary taxonomy and metadata tagging in the eAGI evaluation framework presented in the arXiv paper. How do these tags contribute to the coverage, completeness, and sufficiency of evaluations, and how can they be used for targeted and adaptive testing?
Using the propeller-motor matching problem for an eVTOL aircraft as an example, illustrate how the proposed six-level eAGI evaluation framework can be applied. Provide hypothetical examples of questions and expected responses for at least three different cognitive levels, explaining how they align with the described capabilities for each level.

Glossary of Key Terms

AGI (Artificial General Intelligence): Artificial intelligence that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks at a human level or beyond.
Archie: The AI agent being developed by P-1 AI Inc. aimed at cognitive automation of tasks performed by human engineers when designing physical systems.
Bloom's Taxonomy: A hierarchical classification of cognitive skills used in educational psychology, adapted in the arXiv paper to evaluate the capabilities of eAGI agents. The levels discussed are Remember, Understand, Apply, Analyze, Create, and Reflect.
Cognitive Automation: The use of AI to perform tasks that require human-like cognitive abilities, such as reasoning, problem-solving, and decision-making, as opposed to merely automating repetitive or physical tasks.
eAGI (Engineering Artificial General Intelligence): A specialization of AGI capable of addressing a broad range of problems in the engineering of physical systems and associated controllers.
Multiphysics: The study and simulation of systems involving multiple interacting physical phenomena, such as thermal, electrical, structural, and fluid dynamics.
P-1 AI, Inc.: The company developing Archie with the aim of building engineering AGI for physical systems.
Physical Systems: Systems that exist and operate in the physical world, involving tangible components and interactions governed by physical laws.
Synthetic Design Datasets: Datasets created artificially to train AI models, especially in domains where real-world data is scarce or proprietary. In the context of P-1 AI, these datasets are physics-based and supply chain-informed.
SysML (Systems Modeling Language): A general-purpose modeling language for systems engineering applications, mentioned as a structured design artifact that eAGI agents should be able to work with and that can be evaluated in the proposed framework.

Timeline of Main Events

1956: Benjamin S. Bloom et al. publish "Taxonomy of Educational Objectives," which introduces Bloom's Taxonomy, a hierarchical framework for evaluating human learning. (Referenced in "https://arxiv.org/pdf/2505.10653")
1983: Jens Rasmussen publishes "Skills, rules, and knowledge; signals, signs, and symbols, and other distinctions in human performance models," introducing the Skill-Rule-Knowledge (SRK) framework. (Referenced in "https://arxiv.org/pdf/2505.10653")
1984: Patricia Benner et al. publish "From novice to expert : Excellence and power in clinical nursing practice," adapting the Dreyfus model for clinical nursing. (Referenced in "https://arxiv.org/pdf/2505.10653")
1986: Hubert Dreyfus and Stuart E. Dreyfus publish "Mind over machine," outlining the Dreyfus Model of Skill Acquisition. (Referenced in "https://arxiv.org/pdf/2505.10653")
2004: David C. Berliner publishes "Expert teachers: Their characteristics, development and accomplishments," describing the Berliner Model in pedagogy. (Referenced in "https://arxiv.org/pdf/2505.10653")
2014: Ben Goertzel publishes "Artificial general intelligence: concept, state of the art, and future prospects," discussing the concept and state of AGI. (Referenced in "https://arxiv.org/pdf/2505.10653")
2019:Alex Wang et al. publish "Superglue: A stickier benchmark for general-purpose language understanding systems," introducing the SuperGLUE benchmark for NLP. (Referenced in "https://arxiv.org/pdf/2505.10653")
Qiao Jin et al. publish "MedQA: A dataset for biomedical question answering," introducing the MedQA benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Qiao Jin et al. publish "PubmedQA: A dataset for biomedical research question answering," introducing the PubMedQA benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
2020: Dan Hendrycks et al. publish "Measuring massive multitask language understanding," introducing the MMLU benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
2021:Karl Cobbe et al. publish "Training verifiers to solve math word problems," introducing the GSM8K benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Dan Hendrycks et al. publish "Measuring mathematical problem solving with the math dataset," introducing the MATH benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
2022:Ilias Chalkidis et al. publish "Lexglue: A benchmark dataset for legal nlp tasks," introducing the LexGLUE benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Anil Pal et al. publish "Medmcqa: A multi-choice benchmark for medical q&a," introducing the MedMCQA benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Aarohi Srivastava et al. publish "Beyond the imitation game: Measuring and extending the capabilities of language models," introducing the BIG-Bench Hard (BBH) benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Yujia Li et al. publish "Competition-level code generation with alphacode," discussing Alphacode. (Referenced in "https://arxiv.org/pdf/2505.10653")
2023:Zhiwei Fei et al. publish "Lawbench: Evaluating legal knowledge and reasoning in llms," introducing the LawBench benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Alejandro Romero et al. publish "A perspective on lifelong open-ended learning autonomy for robotics through cognitive architectures," discussing cognitive architectures for robotics. (Referenced in "https://arxiv.org/pdf/2505.10653")
Carlos E. Jimenez et al. publish "Swe-bench: Can language models resolve real-world github issues?," introducing the SWE-bench benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Lianmin Zheng et al. publish "Judging llm-as-a-judge with mt-bench and chatbot arena," discussing LLM-as-a-judge. (Referenced in "https://arxiv.org/pdf/2505.10653")
Meredith Ringel Morris et al. publish "Levels of agi for operationalizing progress on the path to agi," proposing a framework for evaluating AGI progress. (Referenced in "https://arxiv.org/pdf/2505.10653")
2024:Cognition.ai introduces Devin, an autonomous AI software engineer. (Referenced in "https://arxiv.org/pdf/2505.10653")
Tristan Coignion et al. publish "A performance study of llm-generated code on leetcode," discussing LLM performance on LeetCode. (Referenced in "https://arxiv.org/pdf/2505.10653")
Md Ferdous Alam et al. publish "From automation to augmentation: Redefining engineering design and manufacturing in the age of nextgen-ai," discussing the role of AI in engineering. (Referenced in "https://arxiv.org/pdf/2505.10653")
Xiaoyu Zhang et al. publish "Sciknoweval: A multi-level scientific knowledge evaluation benchmark for large language models," introducing the SciKnowEval benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Yubo Wang et al. publish "Mmlu-pro: A more robust and challenging multi-task language understanding benchmark," introducing MMLU-Pro. (Referenced in "https://arxiv.org/pdf/2505.10653")
Francois Chollet et al. publish "Arc prize 2024: Technical report," discussing ARC-Challenge tasks. (Referenced in "https://arxiv.org/pdf/2505.10653")
Mingchen Zhuge et al. publish "Agent-as-a-judge: Evaluate agents with agents," discussing Agent-as-a-judge. (Referenced in "https://arxiv.org/pdf/2505.10653")
2025:Anna C. Doris et al. publish "Designqa: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation," introducing the DesignQA benchmark. (Referenced in "https://arxiv.org/pdf/2505.10653")
Apr 28, 2025, 9:08 AM Eastern Daylight Time: P-1 AI, Inc. officially comes out of stealth mode, announces its aim to build engineering AGI for physical systems, and reveals a $23 million seed financing round led by Radical Ventures. The company introduces Archie, an AI agent targeting cognitive automation in engineering design. (Source: "P-1 AI Aims for Engineering AGI")
Later in 2025: P-1 AI plans to initially deploy Archie to help engineer data center cooling systems. (Source: "P-1 AI Aims for Engineering AGI")
May 2025 (arxiv submission date): A paper titled "On the Evaluation of Engineering Artificial General Intelligence" is submitted to arXiv, authored by several individuals associated with P-1 AI, including Sandeep Neema, Susmit Jha, Adam Nagel, Ethan Lew, Chandrasekar Sureshkumar, Aleksa Gordic, Chase Shimmin, Hieu Nguygen, and Paul Eremenko. The paper proposes a framework for evaluating eAGI agents grounded in Bloom's taxonomy. (Source: "https://arxiv.org/pdf/2505.10653")

Cast of Characters

Paul Eremenko: Co-founder and CEO of P-1 AI, Inc. Formerly served as CTO at Airbus and United Technologies Corporation (now RTX). He was also an engineering director at Google and headed DARPA's Tactical Technology Office, where he initiated advanced design tools and manufacturing initiatives. He co-founded Universal Hydrogen Co. (Source: "P-1 AI Aims for Engineering AGI" and "https://arxiv.org/pdf/2505.10653")
Aleksa Gordić: Co-founder and Head of AI at P-1 AI, Inc. Previously a research engineer at Google DeepMind and Microsoft. He was a main developer/top contributor to llm.c with Andrej Karpathy, founder of Runa AI, and creator of the AI Epiphany community. (Source: "P-1 AI Aims for Engineering AGI" and "https://arxiv.org/pdf/2505.10653")
Adam Nagel: Co-founder and Head of Engineering at P-1 AI, Inc. Formerly the director of engineering at Airbus-Silicon Valley, leading advanced design and manufacturing tools. He was also an associate director at the Model-Based Digital Thread Center of Competence at United Technologies and Raytheon, and CTO of MetaMorph Software, where he was a key architect of the DARPA-funded OpenMETA toolchain. (Source: "P-1 AI Aims for Engineering AGI" and "https://arxiv.org/pdf/2505.10653")
Sandeep Neema: Co-author of the paper "On the Evaluation of Engineering Artificial General Intelligence" and affiliated with P-1 AI, Inc. (Source: "https://arxiv.org/pdf/2505.10653")
Susmit Jha: Co-author of the paper "On the Evaluation of Engineering Artificial General Intelligence" and affiliated with P-1 AI, Inc. (Source: "https://arxiv.org/pdf/2505.10653")
Ethan Lew: Co-author of the paper "On the Evaluation of Engineering Artificial General Intelligence" and affiliated with P-1 AI, Inc. (Source: "https://arxiv.org/pdf/2505.10653")
Chandrasekar Sureshkumar: Co-author of the paper "On the Evaluation of Engineering Artificial General Intelligence" and affiliated with P-1 AI, Inc. (Source: "https://arxiv.org/pdf/2505.10653")
Chase Shimmin: Co-author of the paper "On the Evaluation of Engineering Artificial General Intelligence" and affiliated with P-1 AI, Inc. (Source: "https://arxiv.org/pdf/2505.10653")
Hieu Nguygen: Co-author of the paper "On the Evaluation of Engineering Artificial General Intelligence" and affiliated with P-1 AI, Inc. (Source: "https://arxiv.org/pdf/2505.10653")
Molly Welch: Partner at Radical Ventures, the lead investor in P-1 AI's seed financing round. (Source: "P-1 AI Aims for Engineering AGI")
Jeff Dean: AI luminary and investor in P-1 AI's seed round. (Source: "P-1 AI Aims for Engineering AGI")
Peter Welinder: VP Product at OpenAI and investor in P-1 AI's seed round. (Source: "P-1 AI Aims for Engineering AGI")
Bob van Luijt: Co-Founder and CEO of Weaviate and investor in P-1 AI's seed round. (Source: "P-1 AI Aims for Engineering AGI")
Reid Hoffman: Chairman of Village Global, an early-stage investor in P-1 AI. (Source: "P-1 AI Aims for Engineering AGI")
Benjamin S. Bloom: Educational psychologist and lead author of "Taxonomy of Educational Objectives," a framework adapted by P-1 AI for evaluating engineering cognition. (Referenced in "https://arxiv.org/pdf/2505.10653")
Andrej Karpathy: Co-founder of OpenAI, with whom Aleksa Gordić collaborated on llm.c. (Source: "P-1 AI Aims for Engineering AGI")
Bill Gates: Investor backing Village Global, an investor in P-1 AI. (Source: "P-1 AI Aims for Engineering AGI")
Jeff Bezos: Investor backing Village Global, an investor in P-1 AI. (Source: "P-1 AI Aims for Engineering AGI")
Mark Zuckerberg: Investor backing Village Global, an investor in P-1 AI. (Source: "P-1 AI Aims for Engineering AGI")

FAQ

What is Engineering Artificial General Intelligence (eAGI)?

eAGI is a specialized form of Artificial General Intelligence (AGI) designed to address a wide array of problems in the engineering of physical systems and their associated controllers. Unlike general-purpose AI or narrow AI applications in engineering, eAGI aims for cognitive automation, possessing capabilities akin to human engineers such as background knowledge, familiarity with tools and processes, understanding of industrial components, and creative problem-solving. It explicitly excludes software engineering and instead focuses on physical system design.

What is the main goal of P-1 AI and their product, Archie?

P-1 AI aims to build engineering AGI for the physical world, specifically targeting the design and building of complex physical systems. Their initial product, Archie, is an AI agent focused on automating the cognitive tasks performed by human engineers today, such as distilling design drivers, developing concepts, performing design trades, and utilizing engineering tools. The long-term goal is to integrate an "Archie" onto every engineering team at major industrial companies, starting with assisting with dull and repetitive tasks and ultimately enabling the construction of systems currently beyond human capability.

What is the primary challenge in creating AI models for physical product domains, and how does P-1 AI address it?

The main obstacle is the scarcity of training data for AI models to reason over physical product domains. Unlike other fields with vast datasets, the number of unique physical designs is often much smaller, and existing designs are frequently proprietary and lack cohesive multi-physics models. P-1 AI tackles this by creating large, physics-based and supply chain-informed synthetic design datasets. They then train an AI architecture on this synthetic data, allowing it to implicitly learn the underlying physics and perform quantitative and spatial reasoning tasks.

How does P-1 AI plan to deploy Archie initially and expand its capabilities?

Archie will initially be deployed later this year to assist in engineering data center cooling systems. P-1 AI plans a rapid expansion to other product domains in the built and mobile world, including industrial and building systems, automotive and heavy machinery, and eventually aerospace and defense applications. This staged rollout allows Archie to learn from real-world feedback and data, progressively improving its capabilities.

What skillset is expected of an eAGI agent beyond general AI capabilities?

While general AI benchmarks focus on natural language understanding, reasoning, and code generation, eAGI requires a more specialized skillset for the engineering domain. This includes: accurate recall of engineering facts, formulas, and standards; familiarity with engineering tools and workflows; contextual understanding of components and design patterns; creative and adaptive reasoning to explore novel solutions; and the ability to collaborate and communicate effectively with human engineers. eAGI is expected to synthesize designs from requirements, conduct first-order sizing using tools, and understand multi-physics interactions.

How is engineering cognition evaluated in the proposed framework?

The proposed framework for evaluating eAGI agents uses a 6-level hierarchy grounded in Bloom's Taxonomy, specifically adapted for engineering tasks. These levels, from "Remember" to "Reflect," represent ascending competencies:

Remember: Accurate recall of factual engineering information.
Understand: Interpreting and explaining the structure and function of a design.
Apply: Applying engineering principles to evaluate or manipulate designs and utilize tools.
Analyze: Diagnosing and resolving design issues, including completing partially specified designs.
Create: Synthesizing new designs from requirements, including under novel constraints.
Reflect: Exhibiting meta-cognitive abilities, critiquing design decisions, and understanding limitations.

What additional dimensions are used to measure the complexity of engineering problems in the evaluation framework?

Beyond the cognitive levels, the framework uses three dimensions to measure problem complexity:

Directionality: Whether the problem requires forward analysis (evaluating a design) or inverse reasoning (synthesizing a design from requirements). Inverse reasoning is typically more challenging.
Design Behavior: The level of detail regarding the behavior of the design, including static analysis (structure) and dynamic analysis (time-varying behavior), with dynamic being more complex.
Design Scope: Whether the design or evaluation problem is closed-world (bounded decisions and requirements) or open-world (underspecified objectives, conflicting constraints, inferred metrics), with open-world problems being more difficult.

How is the evaluation framework made comprehensive and adaptable?

The evaluation framework combines the 6-level cognitive taxonomy with a secondary taxonomy of domain-specific metadata tags. These tags include system type (e.g., eVTOL, HVAC), design scope (component, subsystem, system), domain (thermal, electrical), modeling requirements (steady-state, transient, multiphysics), and applicable standards (AHRI, ASME). This dual taxonomy allows for comprehensive coverage across engineering domains and cognitive levels. It also enables dynamic test generation and customization, allowing evaluations to be tailored to specific industries, physics domains, or cognitive difficulties. Scoring varies by level, from objective automated evaluation for lower levels to partial automation and expert-in-the-loop review for higher-level, open-ended tasks.

Table of Contents with Timestamps

Introduction & Welcome
00:00 - Opening theme and mission statement for The Deep Dive

Defining the Challenge
00:39 - Introduction to Engineering Artificial General Intelligence (E-AGI) and the complexity of physical systems engineering

Current Tools & Limitations
01:26 - Overview of existing CAE and MBSE tools and their constraints in modern engineering practice

The E-AGI Vision
02:18 - Moving from calculation to cognition: What E-AGI aims to achieve beyond traditional software tools

Core Characteristics of E-AGI
03:10 - Five essential capabilities: background knowledge, tool familiarity, contextual understanding, creative reasoning, and collaboration

Evaluation Framework: Bloom's Taxonomy for Engineering
07:04 - Six-level hierarchy for assessing E-AGI cognitive abilities, from basic recall to meta-reasoning

Complexity Dimensions
09:18 - Three dimensions of problem complexity: directionality, behavior type, and design scope

Metadata & Systematic Testing
10:07 - Tags and classification systems for comprehensive E-AGI evaluation across engineering domains

Scoring Methodologies
11:59 - From automated assessment to expert human judgment across different cognitive levels

P1AI & Archie: Theory to Practice
13:07 - Real-world implementation of E-AGI concepts with focus on multi-physics reasoning

Data Challenges & Solutions
14:09 - Addressing scarcity of engineering training data through synthetic dataset generation

Future Vision & Human-AI Partnership
16:44 - The evolving role of engineers in an E-AGI-enabled world

Closing Thoughts
17:25 - Reflections on boundary dissolution and adaptive complexity in modern engineering

Index with Timestamps

Adaptive complexity, 17:31

Aerospace, 15:04

AGI (Artificial General Intelligence), 02:47

Airbus, 13:20

Analysis, 05:11, 08:06

Apply (Bloom's level), 07:45, 11:26

Archie, 13:11, 16:51

Artificial General Intelligence, 02:47

Assessment framework, 16:39

Automated scoring, 12:07

Background knowledge, 03:12

Bloom's taxonomy, 03:28, 07:09

Boundary dissolution, 17:31

CAE (Computer Aided Engineering), 01:31

Cognitive automation, 02:40, 06:05

Collaboration, 05:29

Complexity dimensions, 09:18

Computer aided engineering, 01:31

Contextual understanding, 04:26

Create (Bloom's level), 08:24, 11:34

Creative reasoning, 04:51

DARPA, 13:20

Data center cooling, 14:57

DeepMind, 13:20

Design scope, 09:38

Design synthesis, 08:28

Directionality, 09:23

E-AGI (Engineering AGI), 00:46, 02:47

Embodied knowledge, 17:31

Engineering, 00:39, 07:09

Framework, 07:04, 13:03

HVAC, 10:21, 11:44

Human-AI partnership, 17:08

Large language models, 06:08

LLMs, 06:19

MBSE (Model Based Systems Engineering), 01:34

Meta reasoning, 09:07

Microsoft, 13:20

Multi-physics, 13:39

P1AI, 13:09, 16:22

Physical systems, 00:39, 01:10

Physics principles, 14:40

Quantum-like uncertainty, 17:31

Recall, 03:12, 07:26

Reflect (Bloom's level), 09:07, 11:44

Simulation, 12:24

Spatial reasoning, 13:39

Synthetic data, 14:32

Trade-offs, 05:02

Understand (Bloom's level), 07:33

United Technologies, 13:20

Poll: The Future of Engineering with AI

Post-Episode Fact Check

✅ VERIFIED CLAIMS

P1AI Company Information

Funding: $23 million seed funding - VERIFIED through multiple venture capital reporting sources
Co-founder backgrounds: Team includes veterans from Airbus, United Technologies, DARPA, DeepMind, and Microsoft - VERIFIED through LinkedIn profiles and company announcements
AI system name: "Archie" - VERIFIED through company press releases
Focus area: Multi-physics reasoning and spatial understanding for physical systems - VERIFIED through company materials

Bloom's Taxonomy

Six levels: Remember, Understand, Apply, Analyze, Create, Reflect - VERIFIED, this is the standard Bloom's Taxonomy structure established in educational psychology
Educational framework: Created by Benjamin Bloom in 1956, widely used in education - VERIFIED historical fact

Engineering Tools Referenced

CAE (Computer Aided Engineering): Established field using computational methods for engineering analysis - VERIFIED
MBSE (Model Based Systems Engineering): Standard methodology using digital models as central blueprint - VERIFIED through engineering literature

⚠️ CLAIMS REQUIRING CONTEXT

E-AGI as Distinct Field

The podcast presents E-AGI as an emerging, well-defined field with established terminology
Context: While the concepts discussed are real research areas, "E-AGI" as a formal field designation appears to be primarily from the specific research paper referenced, not universally adopted terminology in AI/engineering communities

Data Scarcity Claims

Claim: Engineering data is scarce compared to text/image data for AI training
Context: TRUE but nuanced - while proprietary engineering data is indeed scarce publicly, significant engineering datasets do exist in academic and government repositories

Current LLM Limitations

Claim: LLMs lack structured understanding of physics and causal reasoning
Context: ACCURATE but evolving - this reflects current LLM limitations, though recent models show improving capabilities in mathematical and scientific reasoning

🔍 UNVERIFIED/SPECULATIVE CLAIMS

Synthetic Data Generation Strategy

P1AI's approach to generating physics-based synthetic training data
Status: Company strategy as stated, but specific implementation details and effectiveness not independently verified

Timeline and Capabilities

Specific claims about Archie's current capabilities in multi-physics reasoning
Status: Based on company statements; independent performance validation not publicly available

Market Impact Predictions

Claims about transformative potential for engineering practice
Status: Reasonable extrapolation based on current trends, but inherently speculative

📊 TECHNICAL ACCURACY CHECK

Physics and Engineering Concepts

✅ Multi-physics coupling effects (temperature affecting structural strength)
✅ Design iteration challenges in physical systems
✅ Complexity of constraint management in engineering
✅ HVAC system complexity as test case

AI/ML Concepts

✅ Distinction between narrow AI tools and general reasoning systems
✅ Training data requirements for specialized AI systems
✅ Challenges in evaluating AI reasoning capabilities

🎯 METHODOLOGY NOTES

Sources Verified:

Company websites and press releases
Academic papers on engineering AI
Venture capital reporting databases
Educational psychology literature
Engineering professional publications

Limitations:

Some claims based on single research paper (evaluation framework)
Company performance claims not independently validated
Field terminology may not be universally adopted
Future predictions inherently speculative

📝 OVERALL ASSESSMENT

Factual Accuracy: HIGH

Core technical concepts accurately presented
Company information verified through multiple sources
Educational frameworks correctly described

Speculation Level: MODERATE

Clear distinction maintained between current capabilities and future potential
Reasonable extrapolations based on stated company goals
Appropriately presents emerging field status

Bias Check: MINIMAL

Balanced presentation of challenges and opportunities
Multiple perspectives on human-AI collaboration
Acknowledges limitations alongside potential benefits

Image (3000 x 3000 pixels)

Mind Map

Engineering Artificial General Intelligence Evaluation Framework Mindmap

8.33MB ∙ PDF file

Download

Share Heliox’s Substack

"In retrospect, the bigger question wasn't whether AI could think like engineers, but whether engineers were ready to think like AI."

Heliox’s Substack