Stop tuning prompts by gut feel. globalMOO treats the LLM as a black-box twin — turning prompt engineering into a measurable, multi-objective optimization problem.
Production LLM prompts have dozens of implicit dials — formality, brevity, empathy, assertiveness, technical depth — and each response is judged on dozens of quality metrics. Teams currently tune these by hand, one A/B at a time, with no way to see how the dials trade off against each other. globalMOO replaces that workflow with a learned surrogate model that solves the inverse problem: given the response qualities you want, find the persona parameters that produce them.
What used to be weeks of manual A/B testing — with no guarantee of finding a globally good configuration — collapses to a single optimization run that simultaneously hits every quality target without the trade-offs inherent in one-variable-at-a-time tuning.
The calibrator wraps any LLM in the same twin abstraction globalMOO uses for chemical reactors and solar plants: parameter vectors go in, measurable outputs come out. globalMOO learns the mapping — including the messy, nonlinear interactions that make manual prompt tuning so frustrating — and then inverts it to recommend the exact persona configuration for your goals.
Continuous persona parameters: formality, brevity, assertiveness, empathy, technical depth, urgency, friendliness, structure, and any other prompt dimension you can describe. Each parameter has natural-language anchors at calibration points along its range.
Self-reported scores extracted from the LLM’s JSON response (confidence, word count, sentiment) plus rubric-graded quality metrics scored by a separate grader LLM (tone, clarity, safety, helpfulness, factual accuracy).
Real telemetry from a 20-iteration calibration run of a customer-service persona under deliberately aggressive target bands. The optimizer moves all seven quality dimensions in parallel — response length, formality, emotional intensity, solution specificity, response efficiency, apology calibration, and customer empowerment — instead of optimizing one at a time and praying nothing regresses.
Manual prompt engineering optimizes one trait at a time and hopes the others don’t regress. globalMOO orchestrates every persona dial simultaneously while monitoring every quality metric in parallel.
The Persona Calibrator ships with six fully-specified problems covering the most common production prompt patterns — ready to run, modify, or use as templates for your own:
Professional email drafting calibrated for tone, structure, and length across recipient contexts.
Support agent persona balancing empathy, resolution speed, and policy compliance.
20-input, 50-output content generator tuned for brand voice and conversion-oriented metrics.
Clinical decision support calibrated for accuracy, caution, and escalation behavior.
Moderation policy tuned for precision, recall, and explanation quality.
Safety review agent calibrated to surface risks without false positives.
A full optimization run costs less than lunch — without sacrificing rigor. Built-in Anthropic Batch API support, prompt caching, and Haiku-class graders keep the per-run bill under a few dollars even for 50-objective problems.
A browser-based workbench wraps the entire workflow — problem authoring, prompt preview, evaluation, training, optimization, and run inspection — behind a guided UI. Every screen exposes the same primitives the CLI uses, so analysts and engineers work from the same source of truth.
Drag the parameter sliders and watch the system prompt regenerate in real time — including natural-language interpolation between calibration anchors, so the model sees prose like “between professional-friendly and formal” rather than raw numbers.
Each output metric can be targeted, minimized, maximized, or left unconstrained, with explicit tolerance bounds. Configuration is saved back to YAML with comments and formatting preserved — never a lossy round-trip.
The Inspector lets you replay any past run iteration by iteration: the exact system prompt, the LLM’s raw response, the extracted output values, the grader’s reasoning, and the resulting scores. Reproducibility is not a bolt-on — it’s the default.