Replies

ϻค𝔬ᑭmaop@mstdn.mx
Jun 9, 2026, 4:33 AM
@Migueldeicaza This seems very similar to Google's new Gemma 4 QAT models. Testing them with Ollama yesterday, I noticed that instead of loading the full model into VRAM, it only loads a small portion, likely using mmap to read the weights from disk during inference. Impressive for those of us with limited GPU memory!"
💬 0🔄 0⭐ 0
Bartosz Sypytkowskihorusiath@fosstodon.org
Jun 9, 2026, 6:16 AM
@Migueldeicaza Yeah, but will that mean that the hardware prices will go back to reasonable levels?
💬 0🔄 0⭐ 0
Matt Gallaghercocoawithlove@mastodon.social
Jun 9, 2026, 6:36 AM
@Migueldeicaza So "Apple Intelligence" is at least 5 different models: 2 on device, 2 on Apple Silicon in the cloud and the final one is basically Gemini running on NVIDIA in Google Cloud.
💬 0🔄 0⭐ 0
GNU/Knoppersgnuplusknoppers@troet.cafe
Jun 9, 2026, 9:46 AM
@Migueldeicaza sounds similar to speculative decoding
💬 0🔄 0⭐ 0