NVIDIA AI Bootcamp

Building AI Agents with Multimodal Models

AI-Powered Global Highway

About This Course

Just like how humans have multiple senses to perceive the world around them, computers have a variety of sensors to help perceive the human world. In the health industry, computed tomography (CT) scans provide a 3D representation used to detect potentially dangerous abnormalities. In the robotics industry, lidars are used to help robots see depth and navigate the complex topology around them. In this course, learners will develop neural network based multimodal models that can understand many different data types by exploring different fusion techniques.

  • Wed, 04/29/2026

  • 9:00 AM – 5:00 PM

Course Details

Duration: 8 hours
Subject: Deep Learning
Language: English
Course Prerequisites:

  • A basic understanding of Deep Learning Concepts.
  • Familiarity with a Deep Learning framework such as TensorFlow, PyTorch, or Keras. This course uses PyTorch.

Tools, libraries, frameworks used: PyTorch, CLIP

Course Outline

1. Early and Late Fusion (1 hr)
  • Use camera and LiDAR data to predict object positions.
  • Convert various datatypes to make them neural network ready.
2. Intermediate Fusion (1 hr)
  • Explore the theory behind effective multimodal model architecture.
  • Train a Contrastive Pretraining model.
  • Create a vector database.
3. Cross-modal Projection (2 hr)
  • Converting a Language model into a Vision Language Model (VLM).
  • Process PDFs with Optical Character Recognition (OCR) tools.
4. Model Orchestration (2 hr)
  • Analyze video using Cosmos Nemotron.
  • Use VSS to answer user queries about video content.
  • Orchestrate with NVIDIA AI Blueprints.
5. Assessment (1 hr)
  • Convert a pre-trained model to input a different datatype using projection.