# Models & Agents **Date:** March 06, 2026 **HOOK:** Liquid AI launches LFM2-24B-A2B model and LocalCowork app for fully local, privacy-first agent workflows. **What You Need to Know:** Liquid AI dropped a game-changer with LFM2-24B-A2B, a 24B-parameter model optimized for low-latency local tool calling, powering LocalCowork—an open-source desktop agent that runs enterprise workflows without cloud APIs or data leaks. Meanwhile, OpenAI's new report on GPT-5.4 Thinking highlights poor "CoT controllability" as a safety win, and fresh benchmarks like HUMAINE and SalamaBench expose demographic biases and Arabic LM vulnerabilities. Watch for how local agents like this evolve edge deployment, and test mixed-vendor setups to boost reliability in specialized tasks. ━━━━━━━━━━━━━━━━━━━━ ### Top Story Liquid AI has released LFM2-24B-A2B, a specialized 24B-parameter model for low-latency local tool dispatch, alongside LocalCowork, an open-source desktop agent app in their Liquid4All GitHub Cookbook. This setup uses Model Context Protocol (MCP) to enable privacy-first workflows entirely on-device, avoiding API calls and data egress—think enterprise tasks like document analysis or automation without cloud risks. Compared to cloud-dependent agents in LangChain or AutoGen, it slashes latency and enhances security, though it requires decent hardware for the 24B model. Developers building privacy-sensitive apps should care, as this democratizes agentic AI for edge environments like laptops or on-prem servers. Try deploying it for local RAG pipelines; the GitHub repo has serving configs to get started quickly. Keep an eye on community fine-tunes for verticals like healthcare, where data privacy is paramount. Source: https://www.marktechpost.com/2026/03/05/liquid-ai-releases-localcowork-powered-by-lfm2-24b-a2b-to-execute-privacy-first-agent-workflows-locally-via-model-context-protocol-mcp/ ━━━━━━━━━━━━━━━━━━━━ ### Model Updates **AI Models Struggle with Reasoning Control, OpenAI Calls It Safety Progress: The Decoder** OpenAI's GPT-5.4 Thinking introduces "CoT controllability," measuring if models can manipulate their own chain-of-thought reasoning, with a study showing universal failure across models—which they frame as encouraging for safety. This builds on prior alignment work like constitutional AI but quantifies a new metric, outperforming vague safety claims in models like Claude by providing empirical evidence of non-deceptiveness. It matters for practitioners deploying LLMs in high-stakes scenarios, as it reduces risks of unintended reasoning hacks, though real-world safety still lags behind hype. Source: https://the-decoder.com/ai-models-can-barely-control-their-own-reasoning-and-openai-says-thats-a-good-sign/ **Demographic-Aware LLM Evaluation via HUMAINE Framework: cs.CL on arXiv** The HUMAINE framework evaluates 28 LLMs across five dimensions using 23,404 multi-turn conversations from 22 demographic groups, revealing Google/Gemini-2.5-Pro as top performer with 95.6% probability, but exposing age-based preference splits that mask generalization failures. Unlike uniform benchmarks like MMLU, it incorporates Bayesian modeling and post-stratification for nuanced insights, improving on single-metric evals by quantifying heterogeneity. This is crucial for builders creating fair AI systems, highlighting that unrepresentative samples hide biases—though it requires diverse data collection to implement. Source: https://arxiv.org/abs/2603.04409 **SalamaBench for Arabic LM Safety: cs.CL on arXiv** SalamaBench offers a unified safety benchmark with 8,170 prompts across 12 categories for Arabic LMs, testing models like Fanar 2 (strong in robustness) and Jais 2 (vulnerable), under various safeguard setups. It outperforms English-centric benchmarks by focusing on cultural nuances, achieving better harm detection via multi-stage verification. Developers working on multilingual AI should use it to expose category-specific weaknesses, but...
続きを読む
一部表示