GPU Pursuit Jeopardy Template

GPU 101

Why Oracle AI

Competitors

Gen AI

Why OCI?

100

What does GPU stand for?

Graphics Processing Unit

100

Why buy NVIDIA GPUs from Oracle?

Oracle's AI Infrastructure sets us apart

100

List a couple reasons why Oracle over AWS

Pay less for the same compute capacity
Oracle Cloud Infrastructure (OCI) consistently charges less than Amazon Web Services (AWS) for the equivalent compute capacity. For a typical 4 vCPU AMD-based virtual machine with 16 GB of memory (m6a.xlarge), AWS charges more than 2X as much in its cheapest US region.
Even compared with the cost when using an AWS compute savings plan, which requires a commitment of at least a year, OCI is still cheaper.
Scale your compute infrastructure precisely to your workload
OCI Compute offers flexible virtual machines that let you scale performance and capacity by a single core. Flexible VMs allow you to pay only for the compute you need, scaling as necessary, rather than being forced to purchase fixed sizes that might be too big for your workload.
AWS doesn’t offer flexible sizing for its EC2 instances—you must pick from their existing sizes. As you require more performance, the difference between sizes increases dramatically. If your needs fall between sizes, you must choose either an undersized instance type or an oversized, and more expensive, instance type.
Pay less for block storage with higher performance
OCI Block Storage provides high performance volumes that are attached to virtual machines. OCI not only lets you change the performance characteristic of block volumes during active use, but it also allows you to set up elastic performance so the volumes dynamically change performance based on actual use.
AWS requires you to pick from multiple options with different capacity, performance, and cost characteristics. For the equivalent capacity and performance, OCI Block Storage is cheaper than the multiple options from AWS. For example, for a 5 TB volume with the 375,000 IOPS that a high performance database could require, AWS charges 35X as much (based on io2 pricing) in their cheapest region while only providing 256,000 IOPS.
Pay less to move your data where you want it
We believe in letting customers move their data. OCI charges significantly less than AWS for data leaving a cloud region using the public internet.
For 50 TB of data egress from a US region in one month, AWS charges almost 13X as much as OCI.
In addition, OCI includes the first 10 TB of data egress using the public internet per month at no additional charge. AWS only includes 100 GB, which is 1% of what OCI provides.
Pay the same for cloud services in all regions, including OCI Dedicated Region
OCI has globally consistent pricing. We designed OCI for a consistent experience, both in performance and cost, wherever you want to deploy. If you run applications and workloads in multiple regions, this makes it easier to plan and budget for cloud expenditure.
This also holds true for on-premises deployments of OCI Dedicated Region, which has the same per-service pricing as public regions. A minimum commitment is required.
In contrast, AWS charges differently in different regions for the same instance types, which makes it costlier to run applications in multiple regions, especially outside the US. Comparing the costs of a typical 4 vCPU AMD-based virtual machine with 16 GB of memory (m6a.xlarge), AWS charges more than 2X as much in its eu-west-2 (London) region and almost 4X as much in its sa-east-1 (Brazil) region as it does in its us-east-1 region.
Simplify budgeting and forecasting with simpler pricing models
OCI offers simpler pricing models for its cloud services. Compute is priced by the number of processor cores and amount of memory. Block storage is priced based on the amount of data and desired performance. Network data egress fees are charged when data moves outside a region, and the first 10 TB per month is included at no additional charge.
Many services are included at no additional charge, such as secrets on OCI Vault, OCI Vulnerability Scanning Service, Oracle Cloud Guard, and distributed denial-of-service protection. This reduces your overall spend and makes it simpler to plan your budget.
AWS charges for its multiple GuardDuty security protection plans, including the S3, EKS, malware, RDS, and Lambda protection plans. Data movement fees can be charged depending on whether the data is moving between virtual machines, to particular AWS services, inside a region, or between Availability Zones. Other networking services, such as Direct Connect, have per-hour and per-byte charges that you need to plan for. Unexpected usage can lead to surprises on your bill.

100

Define Fine Tuning

Fine Tuning - optimizing a pretrained foundational model on a smaller domain specific dataset
- Improve model performance on specific tasks
- Improve model efficiency

100

What is RDMA stand for?

Remote Direct Memory Access

200

How many GPUs are in a Node?

Nodes of 8

200

Benefit of MultiCloud for AI

1. Access to Best-of-Breed AI Services Across Clouds

-A multicloud Oracle setup allows you to run Oracle databases or applications while leveraging AI/ML tools from other cloud providers like: Google Cloud's Vertex AI, AWS SageMaker, Azure Machine Learning

This means you can use Oracle for your core enterprise data management while using cutting-edge AI platforms optimized for model training and deployment.

2. Data Proximity for AI/ML Workloads

Oracle offers services like Oracle Cloud Infrastructure (OCI) Interconnect (e.g., with Azure) and low-latency cross-cloud connectivity. This enables fast, secure access to data across clouds, essential for:

-AI model training (which is data-intensive)

-Real-time inference and analytics

3. Simplified AI Model Training with Integrated Oracle AI Services

-Oracle provides built-in AI services (like Oracle AI, -Oracle Digital Assistant, and OCI Data Science).

These services can work in conjunction with AI services on other clouds, allowing flexibility in where and how models are trained and deployed.

4. Better Governance, Compliance & Security for AI Workloads

Oracle's strength in data governance and security supports AI models that require compliance with: GDPR, HIPAA, Industry-specific regulations

Multicloud allows AI workloads to remain compliant by processing sensitive data in Oracle Cloud, while offloading general AI tasks to other environments.

5. Scalability & Resilience for AI Applications

AI models and data pipelines benefit from a resilient multicloud architecture, ensuring:High availability, Fault tolerance, Disaster recovery

You can train models in one cloud and deploy them in another depending on latency, location, or cost considerations.

6. Cost Optimization & Flexibility

AI training can be expensive. Multicloud allows you to: Use cheaper compute resources (e.g., GPU instances) from different clouds

Store data in Oracle Cloud, which may offer better performance-cost balance for structured enterprise data

7. AI-Driven Insights for Oracle Workloads

AI can be applied directly to Oracle applications (like Oracle ERP, HCM, SCM) for: Predictive analytics, Anomaly detection, Intelligent automation

Multicloud enhances this by integrating external AI capabilities for advanced modeling or custom ML pipelines.

200

name 3 of Oracle's tier 2 competitors

Coreweave

runpod

voltage park

Lambda

Crusoe

Baseten

Modal

200

What are the top personas to target for Gen AI?

Model Refiners
• Orgs with large
amounts of
intellectual property
• Use that data with
foundation models
• Building with AI as a
core differentiator

Model Consumers
• Building AI Solutions;
Agentic AI focused
• Leverage appropriate
base models
• Enterprise data
integration (RAG...)

200

Provide an analogy to describe flexible infrastructure

With OCI you can scale up or down, pricing is flexible allowing us to meet you where
you are in your journey.

When a company contracts with one of our competitors they must select from predefined
shapes and sizes.

Analogy: In the city, you don’t have to commit to a fixed service contract. Instead, you
can spend credits as needed and use them for a variety of purposes: whether it’s road
repair, electricity, or healthcare services. This credit-based system lets you scale up
or down as the need arises.

Or think about buying a T-shirt
oWith AWS you must buy an XL and keep that size the entire year whether or not
you need a different size
oWith OCI you can go up or down sizes as you need.

OCI differentiator: Oracle’s Universal Credit Program lets businesses pay as they go,
offering flexibility and scalability in how they consume cloud services.

300

What is the difference between a GPU and a CPU

A CPU (Central Processing Unit) and a GPU (Graphics Processing Unit) are both processors, but they have different architectures and are designed for different tasks.

CPUs are general-purpose processors that handle a wide range of tasks, while GPUs are specialized for parallel processing, particularly for graphics and other compute-intensive operations

CPU (compute) does one task as a time - serial

GPUs can run processes in parallel and do multiple things at once

300

how do we help customers plan their GPU roadmap?

We provide custom tooling for cluster management

customer has direct access to our topology API, enhanced cluster health and monitoring tools and scripts to get the highest performance out of their cluster. We're happy to co develop solutions to solve unique problems

300

How is Oracle's throughput better than our competitors?

4x more throughout than AWS

8x more throughout than GCP

Same as Azure

300

What does dedicated AI cluster mean? and what is the benefit of it?

Dedicated AI Clusters - dedicated AI clusters are GPU based compute resources that host the customers fine tuning and inference workloads
Establishes a dedicated AI cluster which includes dedicated GPUs and an exclusive RDMA cluster network for connecting the GPU
- The GPUs allocated for customers gen ai tasks are isolated from other GPUs

300

Define bare metal

bonus 100 for what is the advantage of bare metal

Bare Metal GPUs give users direct access to the physical GPU hardware without any virtualization layer.
This means no resource sharing with other tenants—just raw, dedicated performance.

eliminating the overhead of virtualization and leading to faster processing speeds and reduced latency. This makes them ideal for workloads requiring high performance, such as those involving large data processing, AI, or financial trading.

Advantage of Bare Metal:Bare Metal provides direct access to hardware resources, eliminating the overhead of virtualization and leading to faster processing speeds and reduced latency.

What is the virtualization layer??

- Abstraction:The virtualization layer abstracts physical hardware resources, making them appear as logical, independent resources to each VM.

Resource Sharing:It allows multiple VMs to share the same physical resources (CPU, memory, storage, network).
Isolation:It ensures that each VM operates in its own isolated environment, preventing one VM's issues from impacting others.
Flexibility and Scalability:It enables on-demand allocation of resources, making it easier to scale resources up or down as needed.

400

What is the fastest NVIDIA GPU shape?

GB300- havent fully launched yet

400

InfiniBand vs RoCE

RoCE offers greater versatility and relatively lower cost.

InfiniBand excels in raw performance: It typically offers lower latency and higher bandwidth compared to RoCE, especially in scenarios demanding ultra-low latency and consistent performance
InfiniBand's strengths: InfiniBand's design, with features like adaptive routing and robust congestion control, makes it particularly suitable for high-performance computing (HPC) clusters and demanding applications like real-time data processing

RoCE provides true RDMA semantics for Ethernet as it does not require the complex and low performance TCP transport (needed for iWARP, for example).RoCE is the most efficient low latency Ethernet solution today. It requires a very low CPU over- head and takes advantage of Priority Flow Control in Data Center Bridging Ethernet for lossless connectivity.

400

Why is Oracle support better than competitors

HPC, NCCL, and CUDA engineers aligned to you at no cost

hands on support: will help to deploy, test and tune and optimize GPU clusters

On OCI you dont pay for broken nodes, support is included and networking costs are negligible

bonus points: extra 200, what is a NCCL engineer?

400

List 3 of the Gen AI models that Oracle sells

xAI - Grok 3, Grok 3 mini, grok 3 fast. grok 3 mini fast, Grok 4 in August!

Cohere - command A, Command R+08 - 2024, command R 08 - 2024, embed, rerank

Meta - Llama 4, llama 3.3, llama 3.2 with vision, llama 3.1 405B

Nvidia - AI Enterprise, BYOM

coming soon: gemini, and open AI

Open AI and anthropic are 3rd party hosted models that are supported via OCIs LLM Gateway

400

tell me about our supercluster manager

Supercluster manager- why we might win the race. On how we're actually maintaining our zetta (over 131,000 GPUs) scale clusters. Lots of GPUs fail- much higher fail rate than CPUs. Our supercluster manager allows us to keep our clusters as healthy as possible.

AI/ML on OCI NVIDIA GPU Superclusters launched earlier this year. For many large labs running these types of GPU based workloads at scale, monitoring can be a huge challenge. Out-of-the-box, OCI has excellent monitoring solutions, including some GPU instance metrics, but deeper integration into GPU metrics for OCI Superclusters is a big differentiator.Setting up your GPU monitoring is a very straight forward process. Especially if you're already using the HPC Marketplace stack- most of the tooling is already included in that deployment image. Once you install and execute the NVIDIA DCGM exporter docker container, then set up Prometheus on a separate compute node, and set up the Grafana Dashboard- the NVIDIA DCGM dashboard will display GPU information for the cluster hosts targeted by Prometheus. Out-of-the-box, the dashboard includes valuable information for a given date/time range.This monitoring data is invaluable for your need to have deeper insights into your infrastructure while running AI/ML workloads on OCI GPU Superclusters. If you have AI/ML GPU based workloads which require ultrafast cluster networking, consider OCI for its industry leading scalability at a much better price than other cloud providers.

in a nutshell shows how your are utilizing your gpus in a dashboard and how much you are utilizing

500

on demand vs reserved contracts

Reserved

-Long-term
commitment (1 or 3
years) with a
discounted rate
-Lower cost
compared to On-
Demand
-Less flexible, as
you commit to
using the instance
for a fixed period

Ideal for
predictable
workloads with
consistent usage

On demand - pay as you go model with no long term commitment

-higher cost compared to reserved

-highly flexible, as you can start and stop instances as needed

-suitable for unpredictable workloads or short term projects

Spot - Utilizes spare capacity at a discounted rate, but it can be interrupted

-lowest cost but with the risk of interruption

-least flexible as instances can be terminated with little notice

-best for flexible fault tolerant workloads that can handle interruptions

Spot

500

How has Oracle tweaked our RoCE v2 architecture?

we added in collision control - collisions are the death of an AI network

500

Oracle vs competitors in performance

competitors have on demand and spot but performance isn't great, noisy neighbor issues when you share with other customers, with us customers get 100% dedicated bare metal infrastructure and access to server and GPU

500

Why was our partnership with Cohere so important?

Cohere is one of the early Ogs of the whole AI transformer world. NVIDIA and Oracle both invested in Cohere- saw that they were going to be doing things in enterprise very differently.

Transformers are a type of deep learning helps AI read sentences and has self-attention to determine the relevance and importance of each word

Transformer models have encoder and decoder parts (encoder reads the input texts and encodes it into embeddings that capture meaning of the text) (decoder uses these embeddings to generate output text)

Language Models understand "tokens" rather than words - one token can be part of a word, an entire word, or punctuation

500

List all the differentiators in the Why OCI pitch

Off box virtualization

Core density per MW

Flexible Infrastructure

Networking

Regional Pricing

Bonus: late mover advantage, white glove support, multicloud

Bonus 350 if you can say what each differentiator means