Using GPU accessible VS Code Server on UIUC DeltaEN

Why writing this blog post Many UIUC students rely on the Delta to access the GPU resources for their research. Delta provides 4 ssh-enabled login nodes, and lots of computing nodes with GPUs. Usually, we must ssh to the login node (by password and DUO 2FA OTP) first, and then use srun to request GPU resources to run our code. However, based on my experience, sometimes we could suffer many problems when using the Delta: ...

2024-12-22 · 3 分钟 · Monsoon

NFS Performance Tuning

前言 本文是我在实践中总结出的生产场景下 10 Gbps 网络下的 NFS 性能调优指南,特别是针对大量小文件(Lots of Small Files, LOSF)读写的优化。 调优 硬件 网络硬件方面,带宽和延迟两者都很重要。 ...

2024-02-16 · 4 分钟 · Monsoon

Building WireGuard VPN for Machine Learning Server Cluster

Motivation 机器学习集群需要一个安全的方式向用户暴露服务,以及跨公网服务器互联,为此需要部署 VPN 网络。 VPN 网络的部署需要考虑如下因素: 网络拓扑:需要选择合适的拓扑结构以尽可能降低延迟; 用户管理:可以方便地进行用户的增减和授权; 使用和维护简单。 Design 网络拓扑 网络拓扑决定着延迟。 ...

2024-01-29 · 2 分钟 · Monsoon

Building Storage System for Machine Learning Server ClusterEN

This is an unfinished blog.

2023-11-24 · 1 分钟 · Monsoon

Ascend 910B 自定义 PyTorch 算子

环境 本文基于的硬件环境为 Ascend 910B3,基于的软件环境包括 CANN 7.0-RC1、PyTorch 1.11.0、Ascend PyTorch Adapter v5.0.rc3-pytorch1.11.0。其他 CANN 和 PyTorch 版本上的情况可能略有不同。 ...

2023-11-14 · 2 分钟 · Monsoon

Catching Mining VirusEN

Problem On October 30, 2023, I received a warning message from the data center administrator, informing me that the firewall detected mining traffic sending from the server managed by me. The “mining traffic” was a bitcoin.sipa.be DNS request sent to 223.5.5.5. Initially, I thought it was a simple task to find the virus process, just like my previous encounter with another mining virus. In that case, the hacker logged in the server by hacking a weak SSH password, gained root permission possibly by an privilege escalation vulnerability exploitation (it was a server running EOL Ubuntu 16.04). Then a cron job was set up to run a mining virus. ...

2023-11-01 · 2 分钟 · Monsoon

优化 MKL 在 AMD CPU 上的性能

问题 实验室有一些 AMD EPYC 7713 的服务器,采购的原因是组里有一些人的程序有非常高的 CPU 负载(我也不知道是什么负载,为什么不能跑在 GPU 上,我也没有精力去逐个帮助解决),框框多的 AMD 处理器非常适合这种需求。 ...

2023-06-19 · 2 分钟 · Monsoon