ChatNVIDIA features and configurations head to the API reference.
Overview
Thelangchain-nvidia-ai-endpoints package contains LangChain integrations for chat models and embeddings powered by NVIDIA AI Foundation Models, and hosted on the NVIDIA API Catalog.
A strong starting point is Nemotron, NVIDIA’s open model family purpose-built for agentic AI. Nemotron models use a hybrid Mamba-Transformer mixture-of-experts architecture that delivers leading accuracy and up to 3x higher throughput than comparable models, with up to 1M token context windows. Model weights, training data, and implementation recipes are published openly under the NVIDIA Open Model License.
NVIDIA AI Foundation models run on NIM microservices: container images distributed through the NVIDIA NGC Catalog that expose a standard OpenAI-compatible API, optimized with TensorRT-LLM for maximum throughput. They can be accessed via the hosted NVIDIA API Catalog or deployed on-premises with an NVIDIA AI Enterprise license.
This page covers how to use LangChain to interact with NVIDIA models via ChatNVIDIA, including Nemotron and other models from the API Catalog.
For more information on accessing embedding models through this API, refer to the NVIDIAEmbeddings documentation.
Integration details
| Class | Package | Serializable | JS support | Downloads | Version |
|---|---|---|---|---|---|
ChatNVIDIA | langchain-nvidia-ai-endpoints | beta | ❌ |
Model features
| Tool calling | Structured output | Image input | Audio input | Video input | Token-level streaming | Native async | Token usage | Logprobs |
|---|---|---|---|---|---|---|---|---|
| ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ |
Install the package
Access the NVIDIA API Catalog
To get access to the NVIDIA API Catalog, do the following:- Create a free account on the NVIDIA API Catalog and log in.
- Click your profile icon, and then click API Keys. The API Keys page appears.
- Click Generate API Key. The Generate API Key window appears.
- Click Generate Key. You should see API Key Granted, and your key appears.
- Copy and save the key as
NVIDIA_API_KEY. - To verify your key, use the following code.
Instantiation
Now we can access models in the NVIDIA API Catalog. Nemotron models are a recommended starting point for agentic and reasoning workloads:Invocation
Self-host with NVIDIA NIM Microservices
When you are ready to deploy your AI application, you can self-host models with NVIDIA NIM. For more information, refer to NVIDIA NIM Microservices. The following code connects to locally hosted NIM Microservices.Stream, batch, and async
These models natively support streaming, and as is the case with all LangChain LLMs they expose a batch method to handle concurrent requests, as well as async methods for invoke, stream, and batch. Below are a few examples.Supported models
Queryingavailable_models will still give you all of the other models offered by your API credentials.
The playground_ prefix is optional.
Model types
All of these models above are supported and can be accessed viaChatNVIDIA.
Some model types support unique prompting techniques and chat messages. We will review a few important ones below.
To find out more about a specific model, please navigate to the API section of an AI Foundation model as linked here.
Nemotron models for agentic AI
Nemotron is NVIDIA’s open model family purpose-built for agentic workflows. Key characteristics:- Efficiency: hybrid Mamba-Transformer MoE architecture delivers up to 3x higher throughput than comparable dense models
- Long context: native support for up to 1M token context windows
- Agentic reasoning: trained specifically for multi-step planning, tool use, and autonomous software engineering tasks
- Open: weights, training recipes, and curated datasets published under the NVIDIA Open Model License
General chat
Models such asmeta/llama3-8b-instruct and mistralai/mixtral-8x22b-instruct-v0.1 are good all-around models that you can use for with any LangChain chat messages. Example below.
Code generation
These models accept the same arguments and input structure as regular chat models, but they tend to perform better on code-generation and structured code tasks. An example of this ismeta/codellama-70b.
Multimodal
NVIDIA also supports multimodal inputs, meaning you can provide both images and text for the model to reason over. An example model supporting multimodal inputs isnvidia/neva-22b.
Below is an example use:
Passing an image as a URL
Passing an image as a base64 encoded string
At the moment, some extra processing happens client-side to support larger images like the one above. But for smaller images (and to better illustrate the process going on under the hood), we can directly pass in the image as shown below:Directly within the string
The NVIDIA API uniquely accepts images as base64 images inlined within<img/> HTML tags. While this isn’t interoperable with other LLMs, you can directly prompt the model accordingly.
Example usage within a RunnableWithMessageHistory
Like any other integration, ChatNVIDIA is fine to support chat utilities like RunnableWithMessageHistory which is analogous to using ConversationChain. Below, we show the LangChain RunnableWithMessageHistory example applied to the mistralai/mixtral-8x22b-instruct-v0.1 model.
Tool calling
Starting in v0.2,ChatNVIDIA supports bind_tools.
ChatNVIDIA provides integration with the variety of models on build.nvidia.com as well as local NIMs. Not all these models are trained for tool calling. Be sure to select a model that does have tool calling for your experimention and applications.
You can get a list of models that are known to support tool calling with,
Use with NVIDIA Dynamo
NVIDIA Dynamo is a distributed inference-serving framework built to deploy models in multi-node environments at data center scale. It simplifies and automates the complexities of distributed serving by disaggregating the various phases of inference across different GPUs, intelligently routing requests to the appropriate GPU to avoid redundant computation, and extending GPU memory through data caching to cost-effective storage tiers.ChatNVIDIADynamo is a drop-in replacement for ChatNVIDIA that automatically injects nvext.agent_hints into every request. These hints tell the Dynamo deployment:
osl(output sequence length) — how many tokens to expect, so the scheduler can plan memory allocationiat(inter-arrival time) — how quickly requests arrive, so the router can anticipate loadlatency_sensitivity— how latency-critical a request is, so interactive calls get priority routingpriority— request priority, so background work can yield to critical-path requests
prefix_id is auto-generated for every request, enabling the router to track KV cache affinity.
This section assumes you have a running NVIDIA Dynamo deployment.
Basic usage
SwapChatNVIDIA for ChatNVIDIADynamo and every request automatically includes routing hints. All standard ChatNVIDIA parameters are supported.
ChatNVIDIADynamo accepts four additional parameters beyond those supported by ChatNVIDIA:
| Parameter | Type | Default | Description |
|---|---|---|---|
osl | int | 512 | Expected output sequence length (tokens) |
iat | int | 250 | Expected inter-arrival time (ms) |
latency_sensitivity | float | 1.0 | Higher latency sensitivities get priority routing |
priority | int | 1 | Lower priority settings receive more scheduling priority |
Set defaults at construction time
Configure Dynamo hints when creating the model instance. This is useful when a model instance always serves a particular role, such as a high-priority interactive assistant versus a low-priority background summarizer.Override per invocation
Dynamo parameters can also be overridden on each call. This is useful when the same model instance handles requests with varying characteristics.Stream with Dynamo hints
Dynamo hints are included in the initial streaming request. Dynamo uses them to select the optimal worker before tokens start flowing.Inspect the payload
For debugging, inspect the exact payload thatChatNVIDIADynamo sends to the NIM endpoint using the internal _get_payload method.
nvext.agent_hints section:
API reference
For detailed documentation of allChatNVIDIA features and configurations head to the API reference.
Related topics
langchain-nvidia-ai-endpointspackageREADME- Nemotron model family — NVIDIA’s open models for agentic AI
- NVIDIA API Catalog — browse and try all available models
- Overview of NVIDIA NIM for Large Language Models (LLMs)
- Overview of NeMo Retriever Embedding NIM
- Overview of NeMo Retriever Reranking NIM
NVIDIAEmbeddingsModel for RAG Workflows- NVIDIA Provider Page
- NVIDIA Dynamo — open-source inference framework
- Dynamo Quickstart Guide — get a local deployment running
- KV Cache-Aware Routing — how the Smart Router works
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

