Google Gemini 2.5 Pro GA now live on Bollwerk AI

Written by

Oisin Maher

Published

Jun 17, 2025

Google Gemini 2.5 Pro GA is now available on Bollwerk AI.

Google has made Gemini 2.5 Pro generally available (GA), and now provides new endpoint locations! Finally, we get to use the new 2.5 Pro also in Europe :)

We provide access to all Google models via Vertex AI, meaning all your AI interactions are protected.

Preview versions of the 2.5 Pro model have been around for almost 3 months now, so we'll focus on the main differences vs. the new GA (General Availability) release.


Differences vs previous preview releases


Strong Gains in Core Intelligence: The GA version shows clear improvements in foundational areas. The +5.0 point increase on the AIME math benchmark and the +3.4 point increase on the GPQA science benchmark suggest that the model's core reasoning and problem-solving abilities were significantly enhanced for the final release.


Mixed Results in Coding: This is the most interesting trade-off.

  • Regression in Generation/Agentic Tasks: The GA version scored lower on LiveCodeBench (-6.6 pts) and SWE-bench (-3.6 pts). This could indicate that the preview model had a raw capability that was later tuned down, perhaps in favor of safety, instruction-following, or other general-purpose skills.

  • Improvement in Editing: Conversely, the GA model is significantly better at Aider Polyglot (+5.7 pts), which focuses on editing existing code based on user requests. This suggests the GA model may be better fine-tuned for interactive, iterative coding tasks.


Long Context Benchmark Change: The scores for Long Context are not directly comparable.

  • The preview used the MRCR benchmark.

  • The GA version used MRCR V2 (8-needle).

  • The "8-needle" version is a significantly harder "needle in a haystack" test, designed to be more challenging. The drastic drop in scores (e.g., 93.0% down to 58.0% at 128k) is a reflection of the increased test difficulty, not necessarily a regression in the model's actual long-context capability.


Benchmarks Not in Both Lists:

  • Only in Preview Data: Video-MME (84.8%) and Global MMLU Lite (88.6%).

  • Only in GA Data: FACTS grounding (87.8%). This is a new benchmark to highlight the model's ability to ground its responses in provided sources.