Software that enhances AI development workflows without being embedded in your application code: IDE extensions, CLI utilities, testing frameworks and observability.

Adopt

Mature, well-supported tools with proven track records in production development workflows.

Software engineering copilots

AI-augmented development represents a permanent shift in software engineering. Teams not actively building capability here are falling behind.

The tooling falls into two categories. Model-agnostic interfaces let teams switch between providers: OpenCode stands out for its terminal experience and breadth of integration, Cursor, Windsurf and Zed are standalone editors, and CLI tools such as Aider and Cline work across providers. Provider-specific tools such as Claude Code, Gemini CLI and OpenAI Codex are optimised for their respective models. GitHub Copilot and Tabnine offer traditional IDE integrations.

Two approaches have emerged: free-form “vibe coding” and structured methodologies. Kiro offers both, with a conversational mode and a dedicated specs mode for drafting requirements before code generation. Cursor enables teams to codify standards through .cursorrules.

Senior engineers derive the greatest value, using AI for routine tasks whilst maintaining quality oversight. Junior developers often struggle to evaluate AI suggestions. Success correlates with intentional training around effective AI collaboration and a “trust but verify” mindset.

Provider-agnostic LLM facades

The LLM market moves quickly enough that today’s best choice may be displaced within months. A facade between your application and the underlying provider keeps that switching cost manageable. Options range from the lightweight AISuite and Simon Willison’s LLM library to heavier alternatives such as LangChain and LlamaIndex. A thin in-house wrapper is also a reasonable choice for teams that prefer to minimise dependencies. We have seen enough projects hampered by tight coupling to a single provider to consider some form of abstraction a default.

Notebooks

Notebooks remain the de facto standard for data science and ML experimentation. Code, prose and visualisations live alongside each other, which suits the iterative work of model development and gives technical and non-technical collaborators something concrete to discuss. Jupyter is the most widely used, with cloud-hosted equivalents from Google Colab, AWS Sagemaker, Azure and Databricks. Language-specific alternatives include Pluto.jl for Julia, Clerk for Clojure and Polynote for Scala.

Trial

Promising tools with growing adoption that are worth exploring for teams building AI systems.

MLflow

MLflow is an open-source platform for managing the machine learning lifecycle. The 3.0 release expanded it to cover generative AI: alongside the classical experiment tracking, model registry and deployment tooling, MLflow now offers OpenTelemetry-based tracing for LLMs and agents, capturing prompt construction, tool calls, memory retrievals, latency and token costs. This puts it in direct overlap with LLM observability tools such as Phoenix, Langfuse and LangSmith.

MLflow runs self-hosted or as a managed service on Databricks, SageMaker, Azure ML, Red Hat OpenShift AI and Nebius. That breadth is part of the appeal: teams avoid the lock-in of monolithic MLOps platforms while still being able to pick a managed offering where one fits.

Realising the benefits self-hosted requires technical expertise to configure and integrate. Unlike a turnkey platform such as Vertex AI, MLflow does not provide a plug-and-play experience; its modular components must be tailored to specific use cases. We recommend it for organisations that value flexibility and have the proficiency to manage integrations.

Vector databases

Vector databases have emerged as specialised tools for managing the high-dimensional embeddings required by AI models. Prominent solutions include Pinecone, Qdrant, Milvus and Weaviate.

Traditional databases may suffice for simpler operations, and alternative approaches such as Timescale’s PGAI vectorizer bring vector search directly into Postgres, avoiding the data consistency challenges of keeping embeddings synchronised across databases. If a dedicated vector database is required, Pinecone leads in production readiness but comes with managed service costs, while open-source alternatives such as Qdrant and Milvus offer greater control but demand more operational expertise.

For prototyping, Chroma offers a Python-first approach with minimal configuration. A 2025 Rust rewrite improved performance, though it remains best suited for small-to-medium scale applications. LanceDB takes a different approach as an embedded vector database, similar in philosophy to SQLite. It operates as a library within your application using Apache Arrow’s columnar format, making it compelling for local AI assistants and edge deployments where data must remain on-device. The trade-off is limited high-concurrency support.

Local model execution environments

Tools such as Ollama, LM Studio and AnythingLLM provide accessible ways to run open weight models on local hardware. These enable experimentation with models from Meta, Mistral, DeepSeek, Alibaba and OpenAI without API costs or sending data to external services. Many now support tool calling via MCP and connections to commercial APIs for hybrid workflows.

These tools serve developers testing AI features, teams comparing model responses and organisations exploring capabilities with sensitive data that cannot leave their infrastructure.

LLM observability tools

Modern agentic builds involve multi-step reasoning, tool orchestration, RAG retrieval and chains of LLM calls where a single user request might trigger dozens of internal operations. Debugging why an agent produced an unexpected result requires visibility into every step of that chain. This is distinct from production AI monitoring, which focuses on drift detection in deployed systems.

Phoenix, from Arize AI, has emerged as a leading open-source option. Built on OpenTelemetry, it provides tracing and evaluation with auto-instrumentation for LangChain, LlamaIndex, DSPy and direct integrations with OpenAI, Anthropic and AWS Bedrock. Langfuse is the most popular fully open-source alternative (MIT licence), combining tracing and evaluation with strong multi-turn conversation support.

For LangChain-committed teams, LangSmith provides native integration that surfaces framework internals in debugging views. Helicone takes a lightweight proxy approach: route API calls through its endpoint for observability without SDK changes. MLflow 3 also offers OpenTelemetry-based tracing for LLMs and agents, which is a natural choice where a single platform across classical ML and GenAI matters. Since these tools capture prompt and response data, data sovereignty matters too: Phoenix and Langfuse both offer self-hosting for teams with data residency requirements.

LLM API gateways

As organisations adopt multiple model providers, an infrastructure layer emerges between applications and providers. LLM API gateways handle routing, caching, failover, rate limiting and cost tracking at the proxy level, complementing code-level abstraction from libraries such as AISuite. A facade is a developer choice about how to call models; a gateway is a platform decision about how to manage model traffic across the organisation.

LiteLLM is the most widely adopted open-source option, providing an OpenAI-compatible proxy with spend tracking, budget controls and key management. Portkey offers a managed alternative with semantic caching and conditional routing. Kong brings enterprise API management experience to LLM traffic.

The operational value for multi-provider deployments is clear: centralised audit logging, per-team budget enforcement and automatic failover. Worth evaluating whether a gateway simplifies the operational story before building equivalent functionality in-house.

See also: Provider-agnostic LLM facades, LLM observability tools.

AI red teaming tools

Our security coverage has grown this quarter, with guidance on prompt injection in MCP and RAG and architectural defences such as CaMeL. Missing has been offensive testing: tools that systematically probe AI systems for vulnerabilities before attackers do.

Promptfoo is the standout, evolving from a prompt evaluation CLI into a comprehensive red teaming platform. OpenAI acquired it in March 2026 but the MIT licence remains. It ships financial services plugins for PCI DSS and banking regulation testing. Microsoft’s PyRIT focuses on multi-turn attack orchestration with Azure AI Foundry integration. NVIDIA’s Garak takes an agentic approach, autonomously probing for prompt injection, data leakage and toxicity.

The EU AI Act will require adversarial testing for high-risk systems by August 2026. We recommend starting with Promptfoo for breadth and CI/CD integration.

See also: CaMeL, Agentic tool use, LLM observability tools.

AI-assisted code migration

Large-scale code migration sits in a gap between what copilots do (assist line-by-line) and what bootstrappers do (generate new projects). Migration tools operate at codebase scale, applying thousands of coordinated changes to upgrade language versions, swap frameworks or modernise APIs. The pattern that works best combines deterministic code transformations with AI for edge cases that rules alone cannot handle.

Moderne is the leading platform, built on the open-source OpenRewrite engine. OpenRewrite provides deterministic “recipes” for common transformations such as Java version upgrades, Spring Boot migrations and Jakarta EE transitions, while Moderne adds AI-assisted handling of non-standard patterns and enterprise-scale orchestration.

These migrations are well-understood but labour-intensive, exactly where deterministic transformation augmented by AI pays off. We recommend trialling OpenRewrite on a representative repository before committing to the Moderne platform.

See also: Software engineering copilots, Spec-driven development.

Assess

Emerging tools that require careful evaluation before adoption.

AI application bootstrappers

AI application bootstrappers generate complete applications from prompts or designs. Lovable (formerly GPT Engineer) has emerged as a leader alongside V0, Bolt.new and Replit Agent. Google entered the space with Firebase Studio. These tools can take projects from concept to working application in hours.

Capabilities are improving rapidly. Lovable’s visual editor allows Figma-like manipulation with automatic code updates. V0 excels at production-ready React components. Bolt.new runs full-stack development in the browser.

However, success still correlates strongly with existing engineering expertise. Senior developers use them as accelerators, understanding how to refactor generated code. Teams without this expertise risk shipping code they cannot maintain or debug. The gap between “working demo” and “production-ready system” remains substantial.

We recommend these primarily for prototyping and proof-of-concept work, with clear separation from production codebases unless your team has the engineering depth to take ownership of generated code.

Visual computer use agents

AI agents that interact with computers through visual understanding have matured but remain risky. Claude Computer Use controls desktops and browsers by seeing the screen and reasoning about interface elements. OpenAI Operator focuses on web browser automation through a managed environment. Browser Use offers an open-source alternative across multiple providers.

Reliability for bounded tasks has improved, with standard office workflows seeing success rates in the high 80s. However, prompt injection attacks, where malicious instructions hidden on web pages hijack agent behaviour, represent a systemic vulnerability. OpenAI has acknowledged this problem “may never be fully solved”.

For many automation needs, programmatic approaches via APIs and workflow automation platforms remain more reliable and secure. Visual computer use is best suited to isolated environments where the agent cannot access sensitive data. Teams should grant minimal permissions and maintain human oversight for high-stakes actions.

See also: Agentic tool use, CaMeL.

This section was previously titled “Agentic computer use”.

Lakera

Lakera was acquired by Check Point Software for approximately $300M in November 2025. Lakera Guard, its core AI safety scanning product, is being integrated into Check Point’s CloudGuard WAF as part of a broader application security offering. The underlying capability of scanning LLM inputs and outputs for prompt injection, toxic content and data leakage remains relevant, but the product context has changed substantially.

Technical limitations from our earlier evaluation still apply: scanning is text-only with no multimodal support, custom rules rely on regex patterns rather than context-aware analysis, and scanning is non-stateful with no awareness of conversation history. Teams evaluating Lakera should now assess it as part of the Check Point ecosystem rather than as a standalone product.

Structured output libraries

Libraries such as Instructor, Outlines and Marvin address a common challenge: LLMs naturally produce freeform text, but applications need structured data. These libraries constrain outputs to match specified structures through prompting, logit manipulation or grammar-based generation. Instead of hoping an LLM produces valid JSON, developers specify Pydantic models and receive guaranteed-valid objects. For agentic systems this is essential, as agents need to produce function calls and decision objects that downstream code can reliably process.

The space is evolving quickly. Instructor has gained traction for its simplicity and Pydantic integration, while Outlines offers more sophisticated constrained generation. Native structured output features from model providers (OpenAI’s JSON mode, Anthropic’s tool use) may reduce the need for external libraries in some scenarios.

A broader category of runtime guardrails has grown up alongside these libraries. NVIDIA’s NeMo Guardrails and Guardrails AI go beyond schema conformance to include prompt injection scanning, content filtering and hallucination checks. Teams building production LLM applications should evaluate both levels: structured outputs for type safety and guardrails for content safety.

See also: PydanticAI, Lakera, AI red teaming tools.

Hold

Not recommended for new projects due to better alternatives or limited long-term viability.

OpenClaw

OpenClaw is an open-source agent runtime created by Peter Steinberger, who later joined OpenAI. It runs persistent, always-on AI agents that execute multi-step tasks by controlling computers: clicking, typing, navigating applications, browsing the web. OpenClaw supports Claude, DeepSeek and OpenAI as backends, and has spawned a wave of imitators. NVIDIA’s NemoClaw wraps OpenClaw in the NVIDIA Agent Toolkit with sandboxed execution against Nemotron 3 Super models. Moonshot’s KimiClaw runs natively on kimi.com with a community skill marketplace and persistent cloud memory.

The security model has not been figured out across any of these variants. Persistent agent runtimes grant broad computer access to AI agents processing potentially untrusted instructions. The same prompt injection vulnerabilities that affect all visual computer use agents apply here, amplified by the breadth of access and the always-on character of the deployment. An agent with permission to control your browser, email and file system has an enormous blast radius if compromised.

We do not recommend OpenClaw or its variants for new projects until the security model matures. Teams already using one should enforce strict sandboxing, limit accessible applications and maintain human oversight for actions involving sensitive data.

See also: Visual computer use agents, Agentic tool use, Software engineering copilots, CaMeL.

Conversational data analysis

Tools such as pandas-ai, tablegpt, promptql and Julius enable natural language querying of databases. Modern MCP servers can provide substantial context to models, including schema understanding and data contents. Our experience with JUXT’s XTDB revealed remarkable moments where models traversed complex table structures with apparent ease.

For experienced analysts, these tools represent a meaningful productivity boost, converting natural language into draft queries that can be refined. However, generated queries can be inefficient or incorrect despite appearing plausible. Uber’s QueryGPT demonstrates both the potential and complexity, highlighting the guardrails required for reliable results.

We’ve placed this in Hold because successful deployment requires users capable of understanding and validating generated queries. These tools offer substantial benefits for data teams with appropriate expertise, but should be approached cautiously by those unable to review AI-generated database queries.

Get industry news, insights, research, updates and events directly to your inbox

Sign up for our newsletter