環境建置

Build AI development env for learning.

This note is about enabling GPU support in Docker and connecting to the container via SSH.
The setup uses CUDA 11.7 and PyTorch 2.0, includes configuring SSH services, setting up NVIDIA drivers, and a test script to verify if PyTorch and the GPU are working correctly.

Updated: (this is better.)
https://github.com/Microfish31/ai-env

Install docker-ce and docker client

  • docker-ce
    sudo apt-get update
    sudo apt install apt-transport-https ca-certificates curl software-properties-common
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
    apt-cache policy docker-ce
    sudo apt install docker-ce
    
  • docker client
    When you install Docker CE, the Docker client tools are automatically installed.
  • Configuration
    This command sequence creates a docker group, adds the current user to the group, and refreshes the session to allow the user to run Docker commands without sudo immediately.
    sudo groupadd docker
    sudo usermod -aG docker $USER
    newgrp docker
    
  • Decker Desktop (or this)
    https://docs.docker.com/desktop/gpu/

Setting Up NVIDIA Drivers and Docker Container Toolkit

  1. To enable GPU usage in the Docker container, install the NVIDIA Container Toolkit:
    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
    && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
    && sudo apt-get update
    
  2. Install the NVIDIA Container Toolkit and restart Docker:
    sudo apt install nvidia-container-toolkit
    sudo nvidia-ctk runtime configure --runtime=docker
    sudo systemctl restart docker
    

Creating the Dockerfile

We will use the official PyTorch CUDA 11.7 container image and install OpenSSH for remote access.

  • Create file name as dockerfile in current path. And paste the content below:
    FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
    
    # Set the environment variable
    ENV PATH="/opt/conda/bin:$PATH"
    
    # Install openssh-server
    RUN apt-get update && apt-get install -y openssh-server
    
    # SSH Configurations
    RUN mkdir -p /var/run/sshd && \
        echo 'root:root' | chpasswd && \
        sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
        sed -i 's/#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config
    
    # Install JupyterLab
    RUN pip install jupyterlab
    
    # Expose port 22 for ssh
    EXPOSE 22
    
    # Expose port 8888 for JupyterLab
    EXPOSE 8888
    
    # Set the volume
    VOLUME ["/home/ubuntu/ai:/workspace"]
    
    # Start SSH service and add env path
    CMD ["sh", "-c", "echo 'export PATH=\"/opt/conda/bin:$PATH\"' >> ~/.bashrc && . ~/.bashrc && /usr/sbin/sshd -D"]
    

Building and Running the Docker Container

  1. Pull the PyTorch container image from Docker Hub:
    docker pull pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
    
  2. Build the Docker image from the Dockerfile:
    docker build -t cuda11.7-ssh-jupyter .
    
  3. Run the Docker container with GPU support and SSH enabled:
    docker run --name my-ai-env --gpus all -p 3131:22 -p 8888:8888 -w /workspace -d cuda11.7-ssh-jupyter
    
  4. Connect to the container via SSH (replace the IP address as necessary):
    ssh root@<container-IP-address> -p3131
    root
    
    ps: You can find the ip by using ifconfig.

Verify your gpu in a new container

docker run --rm --gpus all nvidia/cuda:11.7.1-base-ubuntu20.04 nvidia-smi

Verify Jupyter

jupyter server --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''

visit http://172.31.230.225:8888/lab in chrome browser

Verify GPU with PyTorch

You can run the following Python script to verify that PyTorch is correctly installed and that the GPU is available:

import torch

def check_pytorch_and_gpu():
    # Check if PyTorch is installed
    if torch.__version__:
        print(f"PyTorch version: {torch.__version__} is installed.")
    else:
        print("PyTorch is not installed.")

    # Check if a GPU is available
    if torch.cuda.is_available():
        print(f"GPU is available. GPU name: {torch.cuda.get_device_name(0)}")
        print(f"CUDA version: {torch.version.cuda}")
    else:
        print("GPU is not available. Running on CPU.")

if __name__ == "__main__":
    check_pytorch_and_gpu()

image

Backup docker images

  • Save docker image
    docker image save -o <output-file>.tar <image-name>:<tag>
    
  • Load docker image
    docker load -i /path/to/your-image-file.tar
    

Setting the Container Path (moved to dockerfile)

Since SSH-ing into a container does not initialize the environment as expected, we need to manually add the Anaconda path to ensure that Python packages are accessible.

  • To add the Anaconda path to the container, execute the following command:
    echo 'export PATH="/opt/conda/bin:$PATH"' >> ~/.bashrc
    source ~/.bashrc
    

config for containerd

  1. path: /etc/containerd/config.toml
    version = 2
    
    [plugins]
    [plugins."io.containerd.runtime.v1.linux"]
        shim_debug = true
    
    [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
        default_runtime_name = "runc"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
            runtime_type = "io.containerd.runc.v2"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
            runtime_type = "io.containerd.runc.v2"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                BinaryName = "/usr/bin/nvidia-container-runtime"
    
    [debug]
    level = "info"
    
    [metrics]
    address = "127.0.0.1:1338"
    grpc_histogram = false
    
    [grpc]
    address = "/run/containerd/containerd.sock"
    uid = 0
    gid = 0
    
    [timeouts]
    task_shutdown = "15s"
    
    [ttrpc]
    address = ""
    
    [proxy_plugins]
    [proxy_plugins."snapshot-overlayfs"]
        type = "snapshot"
        address = "/run/containerd/snapshotter-overlayfs.sock"
    
    [plugins."io.containerd.snapshotter.v1.overlayfs"]
    root_path = "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
    
  2. After config , restart the containerd service
    sudo systemctl restart containerd
    

Next -> Reduce image size

Build image form nvidia-cuda

  • pytorch
  • python
  • pip
  • miniconda

https://hub.docker.com/r/nvidia/cuda https://pytorch.org/

Docker add user

  1. Write in Dockerfile
    FROM python:3.10-slim
    
    ARG UID=1001
    RUN useradd -u $UID -m appuser
    USER appuser
    
  2. Pass user id into Dockerfile
    docker build --build-arg UID=$(id -u) -t myimage .
    

Wifi Settings

# 建立一個新的連線設定
sudo nmcli connection add type wifi ifname wlx00ad244780e7 con-name "LIN-static" ssid "LIN"

# 設定密碼與加密方式
sudo nmcli connection modify "LIN-static" wifi-sec.key-mgmt wpa-psk
sudo nmcli connection modify "LIN-static" wifi-sec.psk "your password"

# 設定固定 IP(記得依你的網路環境調整)
sudo nmcli connection modify "LIN-static" ipv4.addresses 192.168.1.100/24
sudo nmcli connection modify "LIN-static" ipv4.gateway 192.168.1.1
sudo nmcli connection modify "LIN-static" ipv4.dns "8.8.8.8 1.1.1.1"
sudo nmcli connection modify "LIN-static" ipv4.method manual

# 啟用新設定
sudo nmcli connection up "LIN-static"

Display Setting

  1. 成功告訴 X server:「使用 NVIDIA GTX 1060 來輸出畫面」
  2. 明確設定了 GPU,避免跟 GTX 750 Ti 混淆
  3. 在你目前的硬體環境下運作良好

以後要設定雙螢幕、多 GPU 使用、或 HDMI/DP 特定輸出,都可以在這份檔案中進一步客製化!

sudo tee /etc/X11/xorg.conf > /dev/null <<EOF
Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/input/mice"
    Option         "ZAxisMapping" "4 5 6 7"
EndSection

Section "InputDevice"
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BusID          "PCI:1:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection
EOF

VPN

Tailscale Free Tier Summary: Free to use, indefinitely Up to 3 users Up to 100 devices Access to nearly all Tailscale features Additional devices: $0.50 per device per month Official website: https://tailscale.com/

Device A (Client)  ---\
                       +--> Tailscale 控制伺服器(交換資訊、協助穿透 NAT)
Device B (Client)  ---/

=> 然後 A 與 B 嘗試直接溝通(使用 WireGuard 協議)

https://login.tailscale.com/admin/machines/ https://tailscale.com/download https://tailscale.com/blog/how-tailscale-works

Power Control

  1. 手動 shutdown
  2. 用 TP-Link app 關閉插座電源(完全切電)
  3. 等個 30 秒,再開啟電源
  4. 伺服器就會偵測到電源恢復,自動開機

ASUS B85M PLUS
Advanced > APM Configuration > Restore AC Power Loss >
[Power Off]:若系統電源中斷後再次連接電源,電腦保持關機狀態,不會自動開機
[Power On]:若系統電源中斷後再次連接電源,電腦會自動開機,不需要按壓機箱上的開機鍵
[Last State]:若系統電源中斷後再次連接電源,電腦會恢復到關機前的狀態,舉例如下:
a. 如果電源中斷前,系統是開機,睡眠,或休眠其中的一種狀態,那麼電源中斷後再次連接電源後,系統恢復至對應狀態
b. 如果電源中斷前,系統是關機狀態,那麼電源中斷後再次連接電源後,系統狀態還是關機狀態

Mail Notification

When your server boot up ready, and send a email to yourself.
msmtp 是一個輕量級的 SMTP 寄信工具
msmtp 是一個 SMTP client(客戶端),它的工作是:

  1. 連線到別人的 SMTP 伺服器(例如 Gmail、Yahoo、公司郵件伺服器)
  2. 使用你的帳號和密碼登入 (App Passwords)
  3. 把信「投遞出去」

mail 會幫你組 email → 然後把它交給 msmtp 寄出去

Flow

你寫的 script 或手動下指令
         ↓
  mail(使用者介面)
         ↓
    msmtp(SMTP 客戶端)
         ↓
smtp.mail.yahoo.com(Yahoo 的 SMTP server)
         ↓
     你的 email 信箱 📨

Steps

  1. Install smtp client
    sudo apt update
    sudo apt install msmtp
    
  2. Setting msmtprc
    defaults
    auth on
    tls on
    tls_trust_file /etc/ssl/certs/ca-certificates.crt
    logfile ~/.msmtp.log
    
    account yahoo
    host smtp.mail.yahoo.com
    port 587
    from <your yahoo mail>
    user <your yahoo mail>
    password <your psw>
    
    account default : yahoo
    
    password from: https://login.yahoo.com/account/security and the generate App Passwords
  3. Change level
    chmod 600 ~/.msmtprc
    
  4. Test
    echo -e "Subject: 測試信 from msmtp\n\n這是一封測試信" | msmtp <your yahoo mail>
    
  5. Create notification scripts
    # boot_notify_email.sh
    
    #!/bin/bash
    # Get Time
    NOW=$(date +"%Y-%m-%d %H:%M:%S")
    
    SUBJECT_TEXT="🟢 Server Boot Notification $NOW"
    TO="<your yahoo mail>"
    
    # MIME encoding(Base64 + UTF-8)
    SUBJECT="=?UTF-8?B?$(echo -n "$SUBJECT_TEXT" | base64)?="
    
    # System Info
    HOSTNAME=$(hostname)
    LOCAL_IP=$(hostname -I)
    PUBLIC_IP=$(curl -s ifconfig.me)
    UPTIME=$(uptime -p)
    DATE=$(date)
    
    DISK=$(df -h --output=source,size,used,avail,pcent,target | tail -n +2 | awk 'BEGIN {print "<table border=1 cellpadding=5 cellspacing=0><tr><th>Filesystem</th><th>Size</th><th>Used</th><th>Avail</th><th>Use%</th><th>Mounted on</th></tr>"} {printf "<tr><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td><td>%s</td></tr>", $1, $2, $3, $4, $5, $6} END {print "</table>"}')
    
    MEM=$(free -h | awk 'NR==1 {print "<table border=1 cellpadding=5 cellspacing=0><tr>"; for(i=1;i<=NF;i++) printf "<th>%s</th>", $i; print "</tr>"} NR==2 || NR==3 {printf "<tr>"; for(i=1;i<=NF;i++) printf "<td>%s</td>", $i; print "</tr>"} END {print "</table>"}')
    
    # Combine to HTML
    BODY=$(cat <<EOF
    Content-Type: text/html; charset=UTF-8
    Subject: $SUBJECT
    To: $TO
    From: $TO
    
    <html>
    <body style="font-family: sans-serif;">
    <h2>✅ The server has booted up!</h2>
    
    <p><strong>🖥️ Hostname:</strong> $HOSTNAME</p>
    <p><strong>🌐 Local IP:</strong> $LOCAL_IP</p>
    <p><strong>🌍 Public IP:</strong> $PUBLIC_IP</p>
    <p><strong>📈 Uptime:</strong> $UPTIME</p>
    <p><strong>🕒 Time:</strong> $DATE</p>
    
    <h3>💾 Disk Usage:</h3>
    $DISK
    
    <h3>🧠 Memory:</h3>
    $MEM
    
    </body>
    </html>
    EOF
    )
    
    
    MAX_RETRIES=3
    RETRY_DELAY=5
    COUNT=0
    SUCCESS=0
    
    while [ $COUNT -lt $MAX_RETRIES ]; do
        echo "$BODY" | msmtp --read-envelope-from -t
        if [ $? -eq 0 ]; then
            echo "Mail sent successfully."
            SUCCESS=1
            break
        else
            echo "Send failed. Retrying... ($((COUNT+1))/$MAX_RETRIES)"
            sleep $RETRY_DELAY
            ((COUNT++))
        fi
    done
    
    if [ $SUCCESS -ne 1 ]; then
        echo "Failed to send mail after $MAX_RETRIES attempts."
        exit 1
    fi
    
  6. setting start-up script
    chmod +x ~/boot_notify_email.sh
    crontab -e
    
    paste below and save
    @reboot /home/your_username/boot_notify_email.sh
    

References

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/release-notes.html#
https://github.com/NVIDIA/nvidia-container-toolkit/issues/154
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.
https://blog.csdn.net/haima95/article/details/139169784
https://docs.docker.com/engine/install/ubuntu/#uninstall-docker-engine

comments