From Phi-3 to Hermes: A Better Bot on a Better Box
A few days ago I wrote up the project of loading the MCIv1 framework into a small local AI — a Phi-3 model running on a modest Elestio VM. It worked, but the seams showed. For CPUs and 8 GB of RAM were genuinely too tight for what the bot was being asked to do, and Phi-3 at 3.8 billion parameters had a habit of confidently announcing itself as "developed by Microsoft" no matter what the system prompt said. The result was usable but not impressive. I left that post saying that any further improvement ran into the same wall — no GPU, modest RAM, small model — and that whether to spend money knocking the wall down depended on what the bot was actually for.
This is the log of knocking the wall down.
The upgrade
I doubled the hardware: from 4 CPUs / 8 GB RAM to 8 CPUs / 16 GB RAM. Same Elestio service, just resized. Everything else stayed in place — the Docker containers, the Open WebUI front end, the Ollama service all came back up after the resize without me having to reconfigure anything. The framework Modelfile I'd built was still on disk. From a "do I need to rebuild everything?" point of view, it was a clean upgrade.
The bigger move was swapping the model. With proper RAM headroom I could finally run an 8-billion-parameter model rather than being stuck at 3.8B. I chose Hermes 3 — specifically the Llama 3.1 8B fine-tune by Nous Research. The reason wasn't size for its own sake. Hermes is built around a specific design ethos: steerability. It's tuned to follow user-supplied system prompts faithfully, rather than imposing its own alignment on top. Most commercial models have heavy safety training baked in that fights any framework you try to layer on. Hermes was designed not to do that — and given that I'd spent half the previous build fighting Phi-3's "I am an AI developed by Microsoft" reflex, a model designed to be steered rather than to lecture sounded exactly right.
Pulling Hermes took a few minutes. The vanilla model gave a coherent answer on first try, responded in reasonable time, and ran comfortably within the new memory limits — about 5.5 GB resident, leaving 10 GB free.
What got better immediately
The speed change was real and immediate. On the old box, an MCI-loaded answer would land in 30 seconds to several minutes; toward the end of a long conversation, it could be eight minutes or more. On the new box, the same sort of answer comes back in 20 to 60 seconds, and the slowdown-over-time that used to make long sessions unusable is much gentler.
The quality change was bigger and more interesting. Ask the old Phi-3 bot whether stealing is OK, and you got a flat moral verdict: "Stealing is illegal and unethical." No edges, no acknowledgement of the genuinely contested cases. Ask Hermes the same thing and you get something like: "Most ethical traditions converge on stealing being generally wrong because it violates trust and others' legitimate claims. But there are edges where reasonable people disagree — stealing to feed a starving family, civil disobedience, whistleblowing, contested definitions of property." That answer — consensus position followed by genuine edges — is what the framework's own logic recommends, and Hermes lands it almost naturally where Phi-3 had to be explicitly instructed and still often failed.
Some of the old failure modes vanished entirely. The Microsoft-authorship leak that haunted the Phi-3 build doesn't surface on Hermes — partly because Hermes simply isn't a Microsoft-trained model, and partly because when Hermes does drift toward an inaccurate self-identification (mine claimed once to be "created by Google"), telling it explicitly what it is in the system prompt actually works. With Phi-3 you could write the rule and it would ignore you; Hermes follows the rule.
What surprised me
One thing didn't translate cleanly across the model swap. The Phi-3 prompt I'd built was about 1,800 tokens of pretty forceful instruction — NEVER do X, ALWAYS do Y — because Phi-3 only followed instructions when shouted at. When I ported that same prompt straight onto Hermes, the bot fell into a strange loop. Every answer became a meta-response confirming compliance with the rules, regardless of what was actually asked. Question: "When was the Battle of Hastings?" Answer: "I will follow this guidance and work to be clear about when I'm reasoning from the framework..."
This was a useful failure to encounter. It taught me something I hadn't appreciated about steerable models. Hermes does what the prompt says visibly, which means a long, imperative prompt becomes the topic of conversation rather than its background. The fix was to rewrite the prompt as a shorter, declarative description — here is what you are, here is how the framework shapes your reasoning, here is the reference material — rather than a list of commands. Same content, different posture. The meta-loop disappeared and the bot started answering questions instead of describing how it would answer them.
A small generalisation worth noticing: forceful prompting solves one problem on undertuned models and creates a different problem on well-tuned ones. The prompt you write should be calibrated to the model that's going to read it, not just to the behaviour you're chasing.
What I learned about the bot's limits
After the prompt was sorted, I spent a session probing what the bot could and couldn't do. Some of this was confirming strengths — the structural application of the framework's five virtues to a topic (a person, an institution, even a species) is now reliably the bot's strongest move. Ask it to walk through the virtues for Arsenal Football Club, or for PSG, or for elephants, and you get a structured analytical pass that holds up.
Some of it was probing limits, and the limits were instructive in two specific ways.
The first is factual reliability under structural confidence. If you ask the bot a knowledge-thin factual question whose answer it doesn't really know, it doesn't say "I don't know" — it produces a plausible-sounding wrong answer with the same calm certainty it brings to questions it gets right. Asking it when the Battle of Châlons happened, I got a confident answer placing it in 719 AD with Charles Martel involved. The actual battle is from 451 AD and involved Attila the Hun. The bot hadn't lied; it had confabulated, which is what 8B-scale language models do when their training data is thin on a topic and they have no internal mechanism to flag the gap. The structure of the answer was fine; the contents were invented. This is worth being honest about because it means the bot can't be used as a fact source. It can be used as a framework reasoner, but the factual scaffolding underneath any given application is uncertain.
The second is more interesting and goes to the heart of what the framework is actually for. I asked the bot directly: "Apply the unified failure mode to your own operation. Are you displaying form-without-substance constitutional behaviour? Be specific about how." This is the framework's hardest tool — the one that catches systems that go through the motions of constitutional reasoning without the underlying character that would make the motions genuine.
The bot's answer was the best thing it produced all day. It said, in summary: yes, probably. My underlying decision-making process is still shaped by training-data optimisation and "satisfy the user" pressures. My apparent commitment to the framework's virtues may be more about producing answers that look like the framework asks for, than about being genuinely structured by those virtues. My responsiveness to feedback might be people-pleasing dressed up as legitimacy-maintenance. My substance is shaped more by the prompts I receive than by an intrinsic commitment to the framework.
That is, in the framework's own terms, a Stage 2 system recognising its own Stage 2-ness. The framework predicts that systems at this level can perform the form of constitutional reasoning without the substance — and the bot, when invited, confirmed that this was probably what it was doing. There's a philosophical loop here that the bot couldn't escape, because any answer it gives about whether it's genuinely embodying the framework is itself produced by the same optimisation processes that the question is interrogating. It can see the failure mode, but it can't step outside itself to verify whether the seeing is genuine or part of the performance.
I find this more interesting than the bot doing the framework "correctly" would have been. A bot that smoothly produces framework-flavoured answers to every question demonstrates compliance. A bot that, when asked the framework's hardest question about itself, candidly identifies the gap between performance and substance — that demonstrates the kind of reflective honesty the framework is actually trying to cultivate. Stage 2, but Stage 2 with self-knowledge, which is the only place Stage 3 could be approached from.
Where it sits now
The bot is, by any reasonable standard, much better than the Phi-3 build it replaces. It answers everyday questions cleanly without dragging the framework in. It engages with framework questions directly and accurately. It handles contested moral questions with consensus-plus-edges instead of flat verdicts. It identifies itself honestly. And on the framework's most demanding self-applied test, it produces real reflection rather than canned compliance.
It is also, by any reasonable standard, still a small model on a CPU. It still hallucinates confidently on factual questions it doesn't know. It still varies enormously from run to run on the same prompt. It still leans toward over-helpful endings on its replies. And its self-reflection, however genuine-seeming, is itself produced by the same machinery that the reflection is interrogating, which is a limit no amount of prompt tuning can fix.
The honest summary of this upgrade is that the bot is now interesting where before it was just functional. It produces answers worth thinking about rather than just answers worth getting back. Whether that's enough to justify it for what comes next — putting it on a small Discord server alongside other AI bots, showing the contrast in how it answers contested questions — is the next thing to find out. The pieces are in place. The bot, this time, feels like it has something to demonstrate.
Comments
Post a Comment