環境建置
Build AI development env for learning.
The setup uses CUDA 11.7 and PyTorch 2.0, includes configuring SSH services, setting up NVIDIA drivers, and a test script to verify if PyTorch and the GPU are working correctly.
Updated: (this is better.)
https://github.com/Microfish31/ai-env
Install docker-ce and docker client
- docker-ce
sudo apt-get update sudo apt install apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" apt-cache policy docker-ce sudo apt install docker-ce - docker client
When you install Docker CE, the Docker client tools are automatically installed. - Configuration
This command sequence creates adockergroup, adds the current user to the group, and refreshes the session to allow the user to run Docker commands withoutsudoimmediately.sudo groupadd docker sudo usermod -aG docker $USER newgrp docker - Decker Desktop (or this)
https://docs.docker.com/desktop/gpu/
Setting Up NVIDIA Drivers and Docker Container Toolkit
- To enable GPU usage in the Docker container, install the NVIDIA Container Toolkit:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \ && sudo apt-get update - Install the NVIDIA Container Toolkit and restart Docker:
sudo apt install nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
Creating the Dockerfile
We will use the official PyTorch CUDA 11.7 container image and install OpenSSH for remote access.
- Create file name as
dockerfilein current path. And paste the content below:FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime # Set the environment variable ENV PATH="/opt/conda/bin:$PATH" # Install openssh-server RUN apt-get update && apt-get install -y openssh-server # SSH Configurations RUN mkdir -p /var/run/sshd && \ echo 'root:root' | chpasswd && \ sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \ sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config # Install JupyterLab RUN pip install jupyterlab # Expose port 22 for ssh EXPOSE 22 # Expose port 8888 for JupyterLab EXPOSE 8888 # Set the volume VOLUME ["/home/ubuntu/ai:/workspace"] # Start SSH service and add env path CMD ["sh", "-c", "echo 'export PATH=\"/opt/conda/bin:$PATH\"' >> ~/.bashrc && . ~/.bashrc && /usr/sbin/sshd -D"]
Building and Running the Docker Container
- Pull the PyTorch container image from Docker Hub:
docker pull pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime - Build the Docker image from the Dockerfile:
docker build -t cuda11.7-ssh-jupyter . - Run the Docker container with GPU support and SSH enabled:
docker run --name my-ai-env --gpus all -p 3131:22 -p 8888:8888 -w /workspace -d cuda11.7-ssh-jupyter - Connect to the container via SSH (replace the IP address as necessary):
ps: You can find the ip by usingssh root@<container-IP-address> -p3131 rootifconfig.
Verify your gpu in a new container
docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi
Verify Jupyter
jupyter server --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''
visit http://172.31.230.225:8888/lab in chrome browser
Verify GPU with PyTorch
You can run the following Python script to verify that PyTorch is correctly installed and that the GPU is available:
import torch
def check_pytorch_and_gpu():
# Check if PyTorch is installed
if torch.__version__:
print(f"PyTorch version: {torch.__version__} is installed.")
else:
print("PyTorch is not installed.")
# Check if a GPU is available
if torch.cuda.is_available():
print(f"GPU is available. GPU name: {torch.cuda.get_device_name(0)}")
print(f"CUDA version: {torch.version.cuda}")
else:
print("GPU is not available. Running on CPU.")
if __name__ == "__main__":
check_pytorch_and_gpu()

Backup docker images
- Save docker image
docker image save -o <output-file>.tar <image-name>:<tag> - Load docker image
docker load -i /path/to/your-image-file.tar
Setting the Container Path (moved to dockerfile)
Since SSH-ing into a container does not initialize the environment as expected, we need to manually add the Anaconda path to ensure that Python packages are accessible.
- To add the Anaconda path to the container, execute the following command:
echo 'export PATH="/opt/conda/bin:$PATH"' >> ~/.bashrc source ~/.bashrc
config for containerd
- path:
/etc/containerd/config.tomlversion = 2 [plugins] [plugins."io.containerd.runtime.v1.linux"] shim_debug = true [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "runc" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] BinaryName = "/usr/bin/nvidia-container-runtime" [debug] level = "info" [metrics] address = "127.0.0.1:1338" grpc_histogram = false [grpc] address = "/run/containerd/containerd.sock" uid = 0 gid = 0 [timeouts] task_shutdown = "15s" [ttrpc] address = "" [proxy_plugins] [proxy_plugins."snapshot-overlayfs"] type = "snapshot" address = "/run/containerd/snapshotter-overlayfs.sock" [plugins."io.containerd.snapshotter.v1.overlayfs"] root_path = "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs" - After config , restart the containerd service
sudo systemctl restart containerd
Next -> Reduce image size
Build image form nvidia-cuda
- pytorch
- python
- pip
- miniconda
https://hub.docker.com/r/nvidia/cuda https://pytorch.org/
Docker add user
- Write in Dockerfile
FROM python:3.10-slim ARG UID=1001 RUN useradd -u $UID -m appuser USER appuser - Pass user id into Dockerfile
docker build --build-arg UID=$(id -u) -t myimage .
Wifi Settings
# 建立一個新的連線設定
sudo nmcli connection add type wifi ifname wlx00ad244780e7 con-name "LIN-static" ssid "LIN"
# 設定密碼與加密方式
sudo nmcli connection modify "LIN-static" wifi-sec.key-mgmt wpa-psk
sudo nmcli connection modify "LIN-static" wifi-sec.psk "your password"
# 設定固定 IP(記得依你的網路環境調整)
sudo nmcli connection modify "LIN-static" ipv4.addresses 192.168.1.100/24
sudo nmcli connection modify "LIN-static" ipv4.gateway 192.168.1.1
sudo nmcli connection modify "LIN-static" ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli connection modify "LIN-static" ipv4.method manual
# 啟用新設定
sudo nmcli connection up "LIN-static"
Display Setting
- 成功告訴 X server:「使用 NVIDIA GTX 1060 來輸出畫面」
- 明確設定了 GPU,避免跟 GTX 750 Ti 混淆
- 在你目前的硬體環境下運作良好
以後要設定雙螢幕、多 GPU 使用、或 HDMI/DP 特定輸出,都可以在這份檔案中進一步客製化!
sudo tee /etc/X11/xorg.conf > /dev/null <<EOF
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0" 0 0
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/input/mice"
Option "ZAxisMapping" "4 5 6 7"
EndSection
Section "InputDevice"
Identifier "Keyboard0"
Driver "kbd"
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BusID "PCI:1:0:0"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
SubSection "Display"
Depth 24
EndSubSection
EndSection
EOF
VPN
Tailscale Free Tier Summary: Free to use, indefinitely Up to 3 users Up to 100 devices Access to nearly all Tailscale features Additional devices: $0.50 per device per month Official website: https://tailscale.com/
Device A (Client) ---\
+--> Tailscale 控制伺服器(交換資訊、協助穿透 NAT)
Device B (Client) ---/
=> 然後 A 與 B 嘗試直接溝通(使用 WireGuard 協議)
https://login.tailscale.com/admin/machines/ https://tailscale.com/download https://tailscale.com/blog/how-tailscale-works
Power Control
- 手動 shutdown
- 用 TP-Link app 關閉插座電源(完全切電)
- 等個 30 秒,再開啟電源
- 伺服器就會偵測到電源恢復,自動開機
ASUS B85M PLUS
Advanced > APM Configuration > Restore AC Power Loss >
[Power Off]:若系統電源中斷後再次連接電源,電腦保持關機狀態,不會自動開機
[Power On]:若系統電源中斷後再次連接電源,電腦會自動開機,不需要按壓機箱上的開機鍵
[Last State]:若系統電源中斷後再次連接電源,電腦會恢復到關機前的狀態,舉例如下:
a. 如果電源中斷前,系統是開機,睡眠,或休眠其中的一種狀態,那麼電源中斷後再次連接電源後,系統恢復至對應狀態
b. 如果電源中斷前,系統是關機狀態,那麼電源中斷後再次連接電源後,系統狀態還是關機狀態
Mail Notification
When your server boot up ready, and send a email to yourself.
msmtp 是一個輕量級的 SMTP 寄信工具
msmtp 是一個 SMTP client(客戶端),它的工作是:
- 連線到別人的 SMTP 伺服器(例如 Gmail、Yahoo、公司郵件伺服器)
- 使用你的帳號和密碼登入 (App Passwords)
- 把信「投遞出去」
mail 會幫你組 email → 然後把它交給 msmtp 寄出去
Flow
你寫的 script 或手動下指令
↓
mail(使用者介面)
↓
msmtp(SMTP 客戶端)
↓
smtp.mail.yahoo.com(Yahoo 的 SMTP server)
↓
你的 email 信箱 📨
Steps
- Install smtp client
sudo apt update sudo apt install msmtp - Setting msmtprc
password from: https://login.yahoo.com/account/security and the generate App Passwordsdefaults auth on tls on tls_trust_file /etc/ssl/certs/ca-certificates.crt logfile ~/.msmtp.log account yahoo host smtp.mail.yahoo.com port 587 from <your yahoo mail> user <your yahoo mail> password <your psw> account default : yahoo - Change level
chmod 600 ~/.msmtprc - Test
echo -e "Subject: 測試信 from msmtp\n\n這是一封測試信" | msmtp <your yahoo mail> - Create notification scripts
# boot_notify_email.sh #!/bin/bash # Get Time NOW=$(date +"%Y-%m-%d %H:%M:%S") SUBJECT_TEXT="🟢 Server Boot Notification $NOW" TO="<your yahoo mail>" # MIME encoding(Base64 + UTF-8) SUBJECT="=?UTF-8?B?$(echo -n "$SUBJECT_TEXT" | base64)?=" # System Info HOSTNAME=$(hostname) LOCAL_IP=$(hostname -I) PUBLIC_IP=$(curl -s ifconfig.me) UPTIME=$(uptime -p) DATE=$(date) DISK=$(df -h --output=source,size,used,avail,pcent,target | tail -n +2 | awk 'BEGIN {print "<table border=1 cellpadding=5 cellspacing=0><tr><th>Filesystem</th><th>Size</th><th>Used</th><th>Avail</th><th>Use%</th><th>Mounted on</th></tr>"} {printf "<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>", $1, $2, $3, $4, $5, $6} END {print "</table>"}') MEM=$(free -h | awk 'NR==1 {print "<table border=1 cellpadding=5 cellspacing=0><tr>"; for(i=1;i<=NF;i++) printf "<th>%s</th>", $i; print "</tr>"} NR==2 || NR==3 {printf "<tr>"; for(i=1;i<=NF;i++) printf "<td>%s</td>", $i; print "</tr>"} END {print "</table>"}') # Combine to HTML BODY=$(cat <<EOF Content-Type: text/html; charset=UTF-8 Subject: $SUBJECT To: $TO From: $TO <html> <body style="font-family: sans-serif;"> <h2>✅ The server has booted up!</h2> <p><strong>🖥️ Hostname:</strong> $HOSTNAME</p> <p><strong>🌐 Local IP:</strong> $LOCAL_IP</p> <p><strong>🌍 Public IP:</strong> $PUBLIC_IP</p> <p><strong>📈 Uptime:</strong> $UPTIME</p> <p><strong>🕒 Time:</strong> $DATE</p> <h3>💾 Disk Usage:</h3> $DISK <h3>🧠 Memory:</h3> $MEM </body> </html> EOF ) MAX_RETRIES=3 RETRY_DELAY=5 COUNT=0 SUCCESS=0 while [ $COUNT -lt $MAX_RETRIES ]; do echo "$BODY" | msmtp --read-envelope-from -t if [ $? -eq 0 ]; then echo "Mail sent successfully." SUCCESS=1 break else echo "Send failed. Retrying... ($((COUNT+1))/$MAX_RETRIES)" sleep $RETRY_DELAY ((COUNT++)) fi done if [ $SUCCESS -ne 1 ]; then echo "Failed to send mail after $MAX_RETRIES attempts." exit 1 fi - setting start-up script
paste below and savechmod +x ~/boot_notify_email.sh crontab -e@reboot /home/your_username/boot_notify_email.sh
References
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/release-notes.html#
https://github.com/NVIDIA/nvidia-container-toolkit/issues/154
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.
https://blog.csdn.net/haima95/article/details/139169784
https://docs.docker.com/engine/install/ubuntu/#uninstall-docker-engine