Blog

Research, results, and what we found when we looked.

We Verified Code from 4 AI Platforms. Average Score: 40/100

Bolt 42. Lovable 42. Replit 44. Claude 35. 21 bugs across 4 projects. 0 passed.

Four independent proofs from four research groups. Every future model will hallucinate. This changes how you build.

120 curated pairs: 91.5%. 2,000 pairs: 77.4%. More data caused catastrophic collapse. Verification stays external.