PAX - Proactive Agent for eXemplary Trustworthiness

Warning - Please Read

PAX is currently in private R&D as of October 2024. So I can not share all the details and code. But I can give a general overview of what it is and how it works.

Background

PAX (Proactive Agent for eXemplary Trustworthiness) began life as a major pivot from my earlier project, TARS, which targeted autonomous cybersecurity penetration testing. As I was developing TARS, it became clear that one of the largest barriers to practical, reliable AI agents was not just task automation, but establishing the trustworthiness of a AI-generated response. Especially when those outputs could impact real-world decisions that can have massive consequences.

Rather than just automating cybersecurity penetration testing tasks with TARS, I wanted to address a fundamental problem: How do we know we can trust what an LLM says?

TARS was developed as an MVP for my first startup, Osgil, which I co-founded. Our goal was to automate cybersecurity penetration testing using AI agents. TARS enabled us to secure pre-seed funding from the Forum Ventures accelerator. However, when we approached defense and cybersecurity companies, we discovered that those organizations did not trust AI agents to perform and report on critical tasks like penetration testing. Also, almost all of them mainly wanted to do business with cybersecurity companies to have a fall guy in case things go bad. Basically, the decision-makers at these companies did not care about their security unless they had to, and when they did, part of their criteria was to have a fall guy in case something went wrong, asa form of insurance. As of late 2024, automated AI-powered cyber attacks are still not a major concern, so decision-makers didn’t see a real need for our solution. Due to this lack of market demand, we pivoted to focusing on reducing hallucinations in LLM models. By improving LLM reliability, we believe our work can benefit a wide range of future AI agent applications beyond cybersecurity.

A Nudge from Transformers Lore

The name PAX is a nod to the Transformers universe. Before becoming the iconic Optimus Prime, the character’s original name was Orion Pax. This idea of transformation, from possibility to responsibility, inspired PAX’s mission of moving from raw, impressive LLM capability to something trustworthy enough to be truly relied upon.

Project Vision

PAX is a research agent and framework that systematically:

  • Measures the trustworthiness of any LLM response.
  • Reduces hallucinations and unsupported statements.
  • Forces and tracks attribution to verifiable sources.
  • Provides explainable, structured reports scoring both responses and claims.

The aim of this project is to make LLMs not just plausible, but provably trustworthy, with transparent measures of risk and confidence.

Quick & Basic Demo

Overview Of How PAX Works

1. Enforced Attribution

For any user query, PAX routes the prompt through an agent that strictly distinguishes between common knowledge and information needing validation. When the response contains facts or claims not widely considered common knowledge (such as statistics, recent events, etc), PAX ensures the agent retrieves and refers to trusted, up-to-date external sources.

Pseudo-process:

  • If claim is not common knowledge → run external search APIs
  • Collect results, map every important statement to relevant references
  • Insert structured placeholders in the response (not plain URLs or raw footnotes)

2. Probabilistic Confidence Scoring

PAX doesn’t just rely on human intuition. It measures how “confident” the language model was in generating each part of its answer, by analyzing the inner probabilities used during text generation. This allows the system to assign a numeric trust score to each sentence, and to the answer as a whole. Low-confidence areas can thus be automatically flagged.

Pseudo-process:

  • For each response token/word, retrieve the model’s probability for that choice
  • Aggregate across sentences
  • Produce per-sentence and overall trust/reliability scores

3. Observed Consistency

Instead of accepting one answer, PAX asks the LLM the same question multiple times, using embeddings (vector representations of meaning) to measure agreement and consistency between plausible responses.

  • High agreement suggests the answer is robust/stable
  • Widely-varying responses are warning signs: possible risk or ambiguity

Pseudo-process:

  • Send the question to the LLM multiple times; collect responses
  • Compute semantic similarity scores between outputs
  • Report a “consistency score” for the user

4. Self-Assessment

PAX optionally asks another LLM (or ensemble) to review the entire interaction, citations, and probability scores, and give its own final verdict, both as a number (0-1) and a narrative explanation. This adds a meta layer of self-reflection.

Pseudo-process:

  • Feed conversation/report to an assessment agent (different model)
  • Agent critiques factuality, coherence, citation integrity, and confidence
  • Outputs a final trust score with explanation for auditability

Interaction Flow

The interaction flow of PAX goes as follows:

  • User sends a prompt.
  • PAX agent processes the prompt, consults external APIs as needed, and builds a response with structured attributions.
  • The system:
    • Assigns per-statement trust/confidence scores
    • Logs which parts are supported by which evidence
    • Optionally, generates a self-reflective summary and trust score

The result is a highly transparent answer with a numerical score and linked references, along with an auditable record of all supporting data.

Inspiration

The methods used to make PAX work was heavily inspired by the the works done by CleanLabs. Specially, their scoring algorithm/method as detailed HERE. With in this algorithm/method, the following is utilized:

  1. Self-Reflection: This is a process in which the LLM is asked to explicitly rate the response and explicitly state how confidently good this response appears to be.

  2. Probabilistic Prediction: This is “a process in which we consider the per-token probabilities assigned by a LLM as it generates a response based on the request (auto-regressively token by token)”.

  3. Observed Consistency: This scoring is a process in which the LLM probabilistically generates multiple plausible responses it thinks could be good, and we measure how contradictory these responses are to each other (or a given response).

Why Does This Matter?

Traditional LLM deployments can hallucinate facts or give outdated/believable but false information. For mission-critical uses—research, healthcare, legal and technical advice—unverifiable AI is simply not good enough.

PAX aims to make trust in AI measurable and explainable. Its approach:

  • Demands “show-your-work” evidence for nontrivial claims.
  • Quantifies how much confidence to place in every output.
  • Allows users to audit and understand why an answer should (or shouldn’t) be trusted.

Research Status & Next Steps

PAX is currently in active development as a private research project under the umbrella of Osgil. Key focuses include:

  • Reducing latency of external searches and scoring.
  • Experimenting with user-perception vs. automated trust scores.
  • Building domain-specific plugins for science, news, and regulatory uses.
  • Preparing benchmark datasets for open research and possible release.

Final Words

PAX is about transforming LLMs from “black box plausible generators” into transparent, citeable, and numerically trustable assistants which is crucial for real-world, high-stakes tasks.

If you’re interested in collaboration, audits, or want to discuss the future of trustworthy generative AI, please reach out. Thank you for reading!