Resource Actions
CloudBank
Description
CloudBank enables researchers to access commercial cloud platforms through the ACCESS and NAIRR Pilot allocation processes. Awards are made in dollars and usable across the following commercial cloud platforms: Amazon Web Services, Google Cloud, IBM Cloud, and Microsoft Azure. After onboarding, the CloudBank portal automatically creates cloud accounts for the platforms listed in your request, with the option to add more later. In the CloudBank Portal, you can manage which authorized users have permission to your cloud accounts and easily access the cloud vendor platforms using federated login with your ACCESS credentials. CloudBank continuously monitors resource usage, applies charges against your allocation, and delivers automated spend alerts. Regulated data is also supported on CloudBank.
Description
CloudBank Classroom is a cloud-hosted, small-scale JupyterHub designed for teaching, built on an open-source interactive computing stack (Python, R, JupyterLab, and related libraries). It provides each learner with a modest allocation of persistent storage, memory, and CPUs as well as seamless authentication through their university’s identity management system, allowing them to log in with existing campus credentials. The environment comes preconfigured with widely used data science tools, enabling students to run code, analyze data, and complete assignments directly in the browser without local setup. Instructors can easily distribute materials, manage assignments, and support learners through integrated tools like nbgitpuller and grading extensions. By removing technical barriers, ensuring equity of access, and offering a consistent, reproducible workflow, the JupyterHub lowers the cost of teaching with data while remaining scalable to support classes of varying sizes. GPUs are expected to be available in the platform by fall 2026.
Georgia Institute of Technology
Description
Nexus is a collaborative computing platform jointly developed by the Georgia Institute of Technology (Georgia Tech) and the National Center for Supercomputing Applications (NCSA). It enables researchers to move seamlessly between local and national computing environments, combining AI, data science, and simulation into a unified, adaptive ecosystem. Nexus lowers barriers to high-performance computing, making advanced cyberinfrastructure accessible to a broader scientific community.
Indiana University
Description
Jetstream2 is a hybrid-cloud platform that provides flexible, on-demand, programmable cyberinfrastructure tools ranging from interactive virtual machine services to a variety of infrastructure and orchestration services for research and education. The primary resource is a standard CPU resource consisting of AMD Milan 7713 CPUs with 128 cores per node and 512gb RAM per node connected by 100gbps ethernet to the spine.
Description
Jetstream2 GPU is a hybrid-cloud platform that provides flexible, on-demand, programmable cyberinfrastructure tools ranging from interactive virtual machine services to a variety of infrastructure and orchestration services for research and education. This particular portion of the resource is allocated separately from the primary resource and contains 360 NVIDIA A100 GPUs -- 4 GPUs per node, 128 AMD Milan cores, and 512gb RAM connected by 100gbps ethernet to the spine.
Description
Jetstream2 LM is a hybrid-cloud platform that provides flexible, on-demand, programmable cyberinfrastructure tools ranging from interactive virtual machine services to a variety of infrastructure and orchestration services for research and education. This particular portion of the resource is allocated separately from the primary resource and contains 32 nodes of GPU-ready 1TB RAM compute nodes, AMD Milan 7713 CPUs with 128 cores per node connected by 100gbps ethernet to the spine.
Institute for Advanced Computational Science at Stony Brook University
Description
AMA27 is a ARM-based HPC cluster featuring the AmpereOne A192-32M processor. The first phase of cluster's deployment consists of 312 compute nodes, each with a single socket CPU with 192 cores and 384 GB of DDR5 memory. The system is capable of delivering more than 500 million CPU core hours per year. AMA27 also features a 5 PB GPFS file system. Compute nodes are connected via a 200 Gbps NDR Infiniband network.
NRP
Description
The Prototype National Research Platform (PNRP) is a Category II NSF-funded system integrated into the Nautilus cluster operated jointly by the San Diego Supercomputer Center at UC San Diego, the Massachusetts Green High Performance Computing Center (MGHPCC) and the University of Nebraska–Lincoln (UNL). The system features a novel, extremely low-latency fabric from GigaIO that allows dynamic composition of hardware, including FPGAs, GPUs, and NVMe storage. Each of the three sites (SDSC, UNL, and MGHPCC) includes ~1 PB of usable disk space. The three storage systems function as data origins of the CDN, providing data access anywhere in the country within a round-trip delay of ~10ms via use of network caches at three sites and five Internet2 network colocation facilities. The Nautilus cluster serves the broader National Research Platform (NRP) which is a community-owned research and education platform connecting researchers and educators to foster collaboration, accelerate innovation, and share resources. The PNRP contribution to the cluster comprises of 1) a HPC subsystem at SDSC with 8 HGX A100 servers with 8 80G A100 GPUs, 512G memory, and 1TB NVMe storage per server; 32 Alveo U55C FPGAs available composed nodes via GigaIO fabric; 122TB of FabreX connected NVMe; 2) two FP32 subsystems, one each at UNL and MGHPCC, each with 18 GPU nodes with 8 A10 GPUs, 512G memory, and 8TB NVMe per node; and 3) 8 distributed data caches of 50TB each. In addition, the distributed Kubernetes cluster architecture enables other institutions to incorporate their own resources into the cluster.
NSF National Center for Atmospheric Research
Description
The Derecho-GPU allocated resource is composed of 82 nodes, each with single-socket AMD Milan processors, 512 GB memory, 4 Nvidia A100 Tensor Core GPUs, connected by a 600 GB/s NVlink GPU interconnect, for a total of 328 A100 GPUs. Each A100 GPU has 40 GB HBM2 memory. The Derecho-GPU nodes each have four injection ports into Derecho's Slingshot interconnect. The NSF National Center for Atmospheric Research (NCAR) operates the Derecho system to support Earth system science and related research by researchers at U.S. institutions.
Description
The Derecho supercomputer is a 19.87-petaflops HPE Cray EX cluster with 2,488 nodes, each with two 64-core AMD EPYC 7763 Milan processors, for a total of 323,712 processor cores. Each node has 256 GB DDR4 memory per node. The Derecho nodes are connected by an HPE Slingshot v11 high-speed interconnect in a dragonfly topology. The NSF National Center for Atmospheric Research (NCAR) operates the Derecho system to support Earth system science and related research by researchers at U.S. institutions.
National Center for Supercomputing Applications
Description
The NCSA Granite Tape Archive is architected using a 19-frame Spectra TFinity tape library outfitted with 20 LTO-9 tape drives to enable a total capacity of over 170PB of accessible and replicated data, of which 3.6PB is currently available for ACCESS allocations. The additional capacity is reserved for other NCSA use, and additional space is still available to expand the archive when needed. The archive operates on Versity's ScoutAM/ScoutFS products giving users a single archive namespace from which to stage data in and out. Access to the Granite system is available directly via Globus and S3 tools.
Description
The DeltaAI resource comprises 114 NVIDIA quad Grace Hopper nodes interconnected by HPE's Slingshot interconnect. Each Grace Hopper node consists of four NVIDIA super chips with one ARM based CPU, 128 GB of LP-DDR5 RAM and one H100 GPU with 96GB of HBM. The four super chips are tightly coupled with NVLink and share a unified shared memory space.
Description
The Delta GPU resource comprises 5 different node configurations intended to support accelerated computation across a broad range of domains including traditional simulation and AI/ML work. Delta is designed to support the transition of applications from CPU-only to using the GPU or hybrid CPU-GPU models. Delta GPU resource capacity is predominately provided by 200 single-socket nodes, each configured with 1 AMD EPYC 7763 (“Milan”) processors with 64-cores/socket (64-cores/node) at 2.55GHz and 256GB of DDR4-3200 RAM. Half of these single-socket GPU nodes (100 nodes) are configured with 4 NVIDIA A100 GPUs with 40GB HBM2 RAM and NVLink (400 total A100 GPUs); the remaining half (100 nodes) are configured with 4 NVIDIA A40 GPUs with 48GB GDDR6 RAM and PCIe 4.0 (400 total A40 GPUs). Rounding out the GPU resource is 14 additional “dense” GPU nodes, containing 8 GPUs each, in a dual-socket CPU configuration (128-cores per node) and 2TB of DDR4-3200 RAM but otherwise configured similarly to the single-socket GPU nodes. Within the “dense” GPU nodes, 5 nodes employ NVIDIA A100 GPUs (40 total A100 GPUs in “dense” configuration) and 1 node employs AMD MI100 GPUs (8 total MI100 GPUs) with 32GB HBM2 RAM. A 1.6TB, NVMe solid-state disk is available for use as local scratch space during job execution on each GPU node type. Finally 8 of the dense GPU nodes have NVIDIA H200 GPUs with 141GB of HBM each. All Delta GPU compute nodes are interconnected to each other and to the Delta storage resource by a 200 Gb/sec HPE Slingshot network fabric. One Delta GPU SU is equal to one A100 GPU hour in the standard quad A100 partition. Other node types have charge factors that reflect their relative cost with H200s costing 3x an A100 GPU hour.
Description
The Delta CPU resource comprises 124 dual-socket compute nodes for general purpose computation across a broad range of domains able to benefit from the scalar and multi-core performance provided by the CPUs, such as appropriately scaled weather and climate, hydrodynamics, astrophysics, and engineering modeling and simulation, and other domains using algorithms not yet adapted for the GPU. Each Delta CPU node is configured with 2 AMD EPYC 7763 (“Milan”) processors with 64-cores/socket (128-cores/node) at 2.45GHz and 256GB of DDR4-3200 RAM. An 800GB, NVMe solid-state disk is available for use as local scratch space during job execution. All Delta CPU compute nodes are interconnected to each other and to the Delta storage resource by a 100 Gb/sec HPE Slingshot network fabric.
OSG Consortium
Description
A virtual HTCondor pool made up of resources from the OSG Consortium
Open Storage Network
Description
The Open Storage Network (OSN) is an NSF-funded cloud storage resource, geographically distributed among storage pods. OSN is a collaboration between MGHPCC, SDSC, NCSA, Rice, JHU, and RENCI, with a federation of pod-owning sites and contributions from other advanced computing centers. Each OSN pod currently hosts 1.5 PB or more of storage, and is connected to R&E networks between 40 and 100Gbit. OSN storage is allocated in buckets and is accessible using S3 interfaces, including with tools such as Rclone, Cyberduck, and the AWS CLI, or via REST API interfaces.
Pittsburgh Supercomputing Center
Description
Bridges-2 combines high-performance computing (HPC), high performance artificial intelligence (HPAI), and large-scale data management to support simulation and modeling, data analytics, community data, and complex workflows.
Bridges-2 Extreme Memory (EM) nodes enable memory-intensive genome sequence assembly, graph analytics, in-memory databases, statistics, and other applications that need a large amount of memory and for which distributed-memory implementations are not available. Bridges-2 Extreme Memory (EM) nodes each consist of 4 Intel Xeon Platinum 8260M “Cascade Lake” CPUs, 4TB of DDR4-2933 RAM, 7.68TB NVMe SSD. They are connected to Bridges-2's other compute nodes and its Ocean parallel filesystem and archive by two HDR-200 InfiniBand links, providing 400Gbps of bandwidth to read or write data from each EM node.
Description
The Brain Image Library (BIL) is a national public resource enabling researchers to deposit, analyze, mine, share and interact with large brain image datasets. BIL encompasses the deposition of datasets, the integration of datasets into a searchable web-accessible system, the redistribution of datasets, and a computational enclave to allow researchers to process datasets in-place and share restricted and pre-release datasets. The BIL is operated as a partnership between the Biomedical Applications Group at the Pittsburgh Supercomputing Center and the Center for Biological Imaging at the University of Pittsburgh.
Description
Neocortex is a highly innovative advanced computing system ideal for foundation and large language models. Neocortex, which captures promising specialized innovative hardware technologies, is designed to vastly accelerate large deep learning (DL) models and high- performance computing (HPC) research in pursuit of science, discovery, and societal good. Neocortex features two Cerebras CS-2 systems, provisioned by an HPE Superdome Flex HPC server and the Bridges-2 filesystems. Each CS-2 system features a Cerebras WSE-2 (Wafer Scale Engine 2), the largest chip ever built, with 850,000 Sparse Linear Algebra Compute cores, 40 GB SRAM on-chip memory, 20 PB/s aggregate memory bandwidth and 220 Pb/s interconnect bandwidth. The HPE Superdome Flex (SDF) features 32 Intel Xeon Platinum 8280L CPUs with 28 cores (56 threads) each, 2.70-4.0 GHz, 38.5 MB cache, 24 TiB RAM, aggregate memory bandwidth of 4.5 TB/s, and 204.6 TB aggregate local storage capacity with 150 GB/s read bandwidth. The SDF can provide 1.2 Tb/s to each CS-2 system and 1.6 Tb/s from the Bridges-2 filesystems. Jobs are submitted via SLURM. The CS-2 systems can run customized TensorFlow and Pytorch containers, as well as programs written using the Cerebras SDK or the WSE Field Equation API.
Description
The Bridges-2 Ocean data management system provides a unified, high-performance filesystem for active project data, archive, and resilience. Ocean consists of two tiers, disk and tape, transparently managed by HPE DMF as a single, highly usable namespace.
Ocean's disk subsystem, for active project data, is a high-performance, internally resilient Lustre parallel filesystem with 15PB of usable capacity, configured to deliver up to 129GB/s and 142GB/s of read and write bandwidth, respectively.
Ocean's tape subsystem, for archive and additional resilience, is a high-performance tape library with 7.2PB of uncompressed capacity (estimated 8.6PB compressed, with compression done transparently in hardware with no performance overhead), configured to deliver 50TB/hour.
Description
Bridges-2 combines high-performance computing (HPC), high performance artificial intelligence (HPAI), and large-scale data management to support simulation and modeling, data analytics, community data, and complex workflows. Bridges-2 Accelerated GPU (GPU) nodes are optimized for scalable artificial intelligence (AI; deep learning). They are also available for accelerated simulation and modeling applications. Bridges-2 has four types of GPU nodes: 10 HPE Cray 670 h100-80 nodes, with eight H100-SXM5-80GB GPUs each with 80GB of GPU memory and a total of 2TB RAM per node; 24 HPE v100-32 nodes with eight V100 GPUs with NVLink, each with 32GB of GPU memory and a total of 512GB RAM per node;9 v100-16 nodes containing eight V100 GPUs without NVLink, each with 16GB of GPU memory and a total of 192GB RAM per node; and 3 HPE l40s-48 nodes with 8 L40S GPUs without NVLink, each with 48GB of GPU Memory and a total of 1TB RAM per node.
The nodes are connected to Bridges-2's other compute nodes and its Ocean parallel filesystem and archive by two HDR-200 InfiniBand links, providing 400Gbps of bandwidth to enhance scalability of deep learning training.
Description
Neocortex is a highly innovative resource that targets the acceleration of AI-powered scientific discovery by vastly shortening the time required for deep learning training, fostering greater integration of artificial deep learning with scientific workflows, and providing revolutionary new hardware for the development of more efficient algorithms for artificial intelligence and high performance computing.
The HPE Superdome Flex (SDFlex) features 32 Intel Xeon Platinum 8280L CPUs with 28 cores (56 threads) each, 2.70-4.0 GHz, 38.5 MB cache, 24 TiB RAM, aggregate memory bandwidth of 4.5 TB/s, and 204.6 TB aggregate local storage capacity with 150 GB/s read bandwidth. The SDF can provide 1.2 Tb/s to each CS-2 system and 1.6 Tb/s from the Bridges-2 filesystems.
SDF Service units are calculated as chassis hours. Each chassis has 112 cpu cores, so an SDFlex SU = 112 core hours.
Description
Bridges-2 combines high-performance computing (HPC), high performance artificial intelligence (HPAI), and large-scale data management to support simulation and modeling, data analytics, community data, and complex workflows.
Bridges-2 Regular Memory (RM) nodes provide extremely powerful general-purpose computing, machine learning and data analytics, AI inferencing, and pre- and post-processing. Each Bridges RM node consists of two AMD EPYC “Rome” 7742 64-core CPUs, 256-512GB of RAM, and 3.84TB NVMe SSD. 488 Bridges-2 RM nodes have 256GB RAM, and 16 have 512GB RAM for more memory-intensive applications (see also Bridges-2 Extreme Memory nodes, each of which has 4TB of RAM). Bridges-2 RM nodes are connected to other Bridges-2 compute nodes and its Ocean parallel filesystem and archive by HDR-200 InfiniBand.
Purdue University
Description
The Purdue Anvil AI system has 21 nodes each with four NVIDIA 80GB H100 SXM GPUs to support machine learning and artificial intelligence applications.
Description
16 nodes each with four NVIDIA A100 Tensor Core GPUs providing 1.5 PF of single-precision performance to support machine learning and artificial intelligence applications.
Description
Purdue's Anvil cluster built in partnership with Dell and AMD consists of 1,000 nodes with two 64-core AMD EPYC "Milan" processors each and delivers over 1 billion CPU core hours each year, with a peak performance of 5.1 petaflops. Each of these nodes has 256GB of DDR4-3200 memory. A separate set of 32 large memory nodes has 1TB of DDR4-3200 memory each. Anvil's nodes are interconnected with 100 Gbps Mellanox HDR100 InfiniBand.
San Diego Supercomputer Center
Description
The Cosmos supercomputer is built on the HPE Cray Supercomputing EX2500 platform, incorporating innovative AMD Instinct™ MI300A accelerated processing units (APUs), HPE Slingshot interconnect and a flash-based filesystem. The APU uniquely features an in-chip memory layout, which is integrated and shared between CPU and GPU resources. This type of memory architecture facilitates an incremental programming approach, which enables many communities to adopt GPUs and ease the process of porting and optimizing a range of applications. Cosmos has 42 nodes, each with 4 APUs in a fully connected network based on AMDs Infinity xGMI (socket-to-socket global memory interface) technology, which provides 768 GBps aggregate and 256 GBps peer-to-peer bi-directional bandwidth between APUs. Nodes are interconnect with a high-performance interconnect based on HPE’s Slingshot technology, which provides low latency and congestion control. The high-performance VAST filesystem (551TB usable) incorporates flash-based storage and provides the high IOPS, and bandwidth needed for the anticipated mixed-application workload. Cosmos also has access to 4.9 PB of Ceph capacity storage to provide excellent I/O performance for most applications and to store persistent project data.
Description
Voyager is a heterogeneous system designed to support complex deep learning AI workflows. The system features 42 Intel Habana Gaudi training nodes, each with 8 training processors (336 in total). Each training node has 512GB of memory and 6.4TB of node local NVMe storage. The Gaudi training processors feature specialized hardware units for AI, HBM2, and on-chip high-speed Ethernet. The on-chip ethernet ports are used in a non-blocking all-to-all network between processors on a node and the remaining ports are aggregated into 6 400G connections on each node that are plugged into a 400G Arista switch to provide scale out of network. Voyager also has two first-generation inference nodes, each with 8 inference processors (16 in total). In addition to the custom AI hardware, the system also has 36 Intel x86 processors compute nodes for general purpose computing and data processing. Voyager features 3PB of storage currently deployed as a Ceph filesystem.
Description
Expanse is a Dell integrated compute cluster, with AMD Rome processors, 128 cores per node, interconnected with Mellanox HDR InfiniBand in a hybrid fat-tree topology. The compute node section of Expanse has a peak performance of 3.373 PF. Full bisection bandwidth is available at rack level (56 compute nodes) with HDR100 connectivity to each node. HDR200 switches are used at the rack level and 3:1 oversubscription cross-rack. Compute nodes feature 1TB of NVMe storage and 256GB of DRAM per node. The system also features 12PB of Lustre based performance storage (140GB/s aggregate), and 7PB of Ceph based object storage.
Description
5PB of storage on a Lustre based filesystem.
Description
Expanse is a Dell integrated compute cluster, with AMD Rome processors, NVIDIA V100 GPUs, interconnected with Mellanox HDR InfiniBand in a hybrid fat-tree topology. The GPU component of Expanse features 52 GPU nodes, each containing four NVIDIA V100s (32 GB SMX2), connected via NVLINK, and dual 20-core Intel Xeon 6248 CPUs. They feature 1.6TB of NVMe storage and 256GB of DRAM per node. There is HDR100 connectivity to each node. The system also features 12PB of Lustre based performance storage (140GB/s aggregate), and 7PB of Ceph based object storage.
Texas A&M University
Description
Launch is a regional computational resource that supports researchers incorporating computational and data-enabled approaches in their scientific workflows at The Texas A&M University System Schools. A portion is offered to the national community. Researchers must be based in the US and associated with a US academic research institution.
Launch is a Dell Linux cluster with 45 compute nodes (8,640 cores) and 2 login nodes. There are 35 compute nodes with 384 GB memory and 10 GPU compute nodes with 768 GB memory and two NVIDIA A30s. The interconnecting fabric uses a single NVIDIA HDR100 InfiniBand switch.
Description
ACES is a Dell cluster with a rich accelerator testbed consisting of Intel Max GPUs (Graphics Processing Units), Intel FPGAs (Field Programmable Gate Arrays), NVIDIA H100 and A30 GPUs, NEC Vector Engines, NextSilicon co-processors, and Graphcore IPUs (Intelligence Processing Units). Researchers must be based in the US and associated with a US academic research institution.
The ACES cluster consists of compute nodes using a mix of the following processors:
Intel Xeon 8468 Sapphire Rapids processors
Intel Xeon Ice Lake 8352Y processors
Intel Xeon Cascade Lake 8268 processors
AMD Epyc Rome 7742 processors
The compute nodes are interconnected with NVIDIA NDR200 connections for MPI and access to the Lustre storage. The Intel Optane SSDs and all accelerators (except the Graphcore IPUs and NEC Vector Engines) are accessed using Liqid's composable infrustructre via PCIe (Peripheral Component Interconnect express) Gen4 and Gen5 fabrics.
Texas Advanced Computing Center
Description
TACC's High Performance Computing (HPC) systems are used primarily for scientific computing and while their disk systems are large, they are unable to store the long-term final data generated on these systems. The Ranch archive system fills this need for high capacity long-term storage, by providing a massive high performance file system and tape-based backing store designed, implemented, and supported specifically for archival purposes.
Ranch is a Quantum StorNext-based system, with a DDN- provided front-end disk system (30PB raw), and a 5000 slot Quantum Scalar i6000 library for its back-end tape archive.
Ranch is an allocated resource, meaning that Ranch is available only to users with an allocation on one of TACC's computational resources such as Frontera, Stampede3, or Lonestar6. ACCESS PIs will be prompted automatically for the companion storage allocation as part of the proposal submission process and should include a justification of the storage needs in their proposal. The default allocation on Ranch for users is 2TB. To request a shared Ranch project space for your team's use, please submit a TACC Helpdesk ticket.
Ranch is a Quantum StorNext-based system, with a DDN- provided front-end disk system (30PB raw), and a 5000 slot Quantum Scalar i6000 library for its back-end tape archive.
Ranch is an allocated resource, meaning that Ranch is available only to users with an allocation on one of TACC's computational resources such as Frontera, Stampede3, or Lonestar6. ACCESS PIs will be prompted automatically for the companion storage allocation as part of the proposal submission process and should include a justification of the storage needs in their proposal. The default allocation on Ranch for users is 2TB. To request a shared Ranch project space for your team's use, please submit a TACC Helpdesk ticket.
Description
Stampede3 is generously funded through the National Science Foundation and is designed to serve today's researchers as well as support the research community on an evolutionary path toward many-core processors and accelerated technologies. Stampede 3 maintains the familiar programming model for all of today's users, and thus will be broadly useful for traditional simulation users, users performing data intensive computations, and emerging classes of new users.
Texas Tech University
Description
The REmotely-managed Power Aware Computing Systems and Services (REPACSS) resource is a high-performance computing (HPC) cluster supported by multiple forms of energy developed to support research into advanced data center control for running scalable scientific workflows and data-intensive research in remotely managed settings. The focus of the project is on improvements to data center and infrastructure control to provide adaptability to emergent conditions and ability to adjust workloads to match data center load conditions including the availability and cost of electrical power. The GPU nodes feature dual-socket Intel Xeon Gold 6448Y processors, 512GB RAM, and 4 H100 GPUs connected as two H100-NVL pairs per node. The GPU nodes are interconnected with the rest of the cluster and with storage by NVIDIA ConnectX-7 network NDR Infiniband adapters running at 200 Gbps per card with two Infiniband cards per node. The Hammerspace storage provides nearly 3PB of combined NVMe and HDD storage, supporting large-scale data throughput. All nodes are also controlled and provisioned through high-bandwidth Dell PowerSwitch S5248-ON and S5232-ON Ethernet switches at 25 Gbps per node. The cluster supports intelligent workload placement and adaptive scheduling tools to align computational activity with the goal to match as much of the workload as possible to low-cost energy availability. REPACSS also features advanced remote management capabilities and automation tools to manage scientific workflows that are specifically targeted to be adopted at scale by other resource facilities and industry.
Description
The REmotely-managed Power Aware Computing Systems and Services (REPACSS) resource is a high-performance computing (HPC) cluster supported by multiple forms of energy developed to support research into advanced data center control for running scalable scientific workflows and data-intensive research in remotely managed settings. The focus of the project is on improvements to data center and infrastructure control to provide adaptability to emergent conditions and ability to adjust workloads to match data center load conditions including the availability and cost of electrical power. The CPU infrastructure comprises 110 AMD EPYC 9754 compute nodes with access to high-speed cluster-wide storage. Each CPU compute node offers 256 cores and 1.5TB of DDR5 memory, supported by local NVMe swap and temporary storage (1.92TB) to support high-speed checkpoint and restore and local ephemeral usage. The cpu nodes are interconnected with the rest of the cluster and with storage by NVIDIA ConnectX-7 network NDR Infiniband adapters running at 200 Gbps per card with two Infiniband cards per node. The Hammerspace storage provides nearly 3PB of combined NVMe and HDD storage, supporting large-scale data throughput. All nodes are controlled and provisioned through high-bandwidth Dell PowerSwitch S5248-ON and S5232-ON Ethernet switches at 25 Gbps per node. The cluster supports intelligent workload placement and adaptive scheduling tools to align computational activity with the goal to match as much of the workload as possible to low-cost energy availability. REPACSS also features advanced remote management capabilities and automation tools to manage scientific workflows that are specifically targeted to be adopted at scale by other resource facilities and industry.
University of Kentucky
Description
Five large memory compute nodes dedicated for XSEDE allocation. Each of these nodes have 40 cores (Broadwell class and lntel(R) Xeon(R) CPU E7-4820 v4 @ 2.00GHz with 4 sockets, 10 cores/socket), 3TB RAM, and 6TB SSD storage drives. The 5 dedicated XSEDE nodes will have exclusive access to approximately 300 TB of network attached disk storage. All these compute nodes are interconnected through a 100 Gigabit Ethernet (l00GbE) backbone and the cluster login and data transfer nodes will be connected through a 100Gb uplink to lnternet2 for external connections.