Skip to content

High-Performance Cluster Deployments#

Organizations looking to accelerate their analytics and model training processes can leverage VergeOS in a high-performance cluster deployment model. This document provides an overview of deploying VergeOS high-performance clusters using powerful, scalable hardware, outlines the key benefits, and details the different high-performance cluster configurations available.

Data Science Inc. Example#

Data Science Inc. will be used throughout this document to explain the key concepts and details around a high-performance cluster deployment. Data Science Inc. approached VergeOS to enhance their machine learning and data processing workloads with a solution that offers high performance, scalability, and simplified management.

What is a High-Performance Cluster#

In the context of VergeOS, a high-performance cluster refers to a robust, scalable solution designed for managing intensive compute tasks such as model training and large-scale data analysis. A typical VergeOS high-performance cluster consists of multiple nodes optimized for high IO-demand environments, featuring high-performance storage (SSD or NVMe) and a memory-to-core ratio of 8-1/16-1.

Image title
Typical 4 node High-Performance cluster

Why You Would Want a High-Performance Cluster#

High-performance clusters are essential for data-driven organizations due to several key benefits:

  • Performance Optimization: Automatically allocates resources based on the demands of specific data science workloads, including support for GPU virtualization to handle intensive compute tasks efficiently.

  • Scalability: Easy scaling of resources to manage increases in data volume and computational needs without significant downtime or configuration changes. Additional nodes can be seamlessly integrated with existing infrastructure.

  • Simplified Management: A centralized management dashboard provides insights into resource usage, performance metrics, and health monitoring of the HCI infrastructure, reducing IT overhead.

  • Reliability and Availability: High availability features ensure minimal downtime and maintain data integrity during node failures or maintenance. Backup and disaster recovery solutions are tailored to large-scale data sets and machine learning models.

  • Security: Robust security measures, including data encryption, ensure the protection of sensitive information.

Overview#

VergeOS offers a highly adaptable and configurable solution for customers looking to deploy high-performance workloads. This deployment will enable Data Science Inc. to leverage advanced computing capabilities for their machine learning and data processing tasks. The solution is designed to be centrally managed, providing seamless integration and scalability to handle increasing data volumes and computational demands.

Summary#

This deployment allows Data Science Inc. IT staff to centrally manage their machine learning and data processing workloads. They will be deploying separate clusters for workload and vSAN.

Hardware and Features#

Hardware#

For Data Science Inc.'s VergeOS deployment, the following hardware is recommended:

Networking#

  • Top of rack switching
  • (2x) FS N8560-32C, 32-Port 100Gb QSFP28

vSAN Storage Cluster#

  • (2x) vergeOS Controller Nodes

  • 8 core HIGH clock speeds CPU

  • 768 GB ECC RAM
  • (2x) 2TB NVMe (vergeOS TIER0, 3DWPD)
  • (2x) 100Gbe vergeOS Core Nics
  • (2x) 25Gbe External Nics
  • On Board IPMI
  • Redundant Power Supplies

  • (8x) vergeOS vSAN Nodes

  • (2x) 12 Core @3Ghz (Xeon Gold 5317)

  • 512 GB ECC RAM
  • (22x) 7.68TB 2.5" NVMe SSD (1DWPD)
  • (2x) 100Gbe vergeOS Core Nics
  • On Board IPMI
  • Redundant Power Supplies

Workload Cluster#

  • (35x) vergeOS Compute Nodes

  • (2x) 32 Core @2.1 Ghz (Xeon Gold 6530)

  • 768 GB ECC RAM
  • (1x) 960GB NVMe vergeOS Boot Drive (1 DWPD)
  • (2x) 25Gbe vergeOS Core Nics
  • On Board IPMI
  • Redundant Power Supplies

VergeOS Features#

The deployment will leverage the following VergeOS features:

  • Resource Allocation: Automatic resource allocation based on workload demands
  • Global Inline Deduplication: Reduces storage consumption by eliminating duplicate data
  • GPU Virtualization: Efficient handling of intensive compute tasks
  • Catalog Repositories: Centralized management of VM recipes
  • Cloud Snapshots: Local snapshots for quick recovery of a site
  • Security: Robust security measures, including data encryption