I followed the instructions here since we tried to run on Cloud instance but inside k8s environment.
https://docs.habana.ai/en/latest/Getting_Started_Guide_EKS/Getting_Started_Guide_EKS_with_Habana.html
First use Habana self build AMI and install k8s on it.
Device plugin could start successfully:
$ kubectl create -f JFrog
[ec2-user@node1 ~]$ kubectl get pods -n habana-system
NAME READY STATUS RESTARTS AGE
habanalabs-device-plugin-daemonset-gaudi-jd9kh 1/1 Running 0 19m
First tried with this example, job-hl.yaml seems working, well I need change habana.ai/gaudi: 8 to 1, otherwise not enough resource found. But this example just print out device info, do not do any real training/references.
[ec2-user@node1 ~]$ kubectl logs job-hl-xb6lx
±----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.2.0-fw-32.5.0.0 |
| Driver Version: 1.2.0-124dd38 |
|-------------------------------±---------------------±---------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-205 N/A | 0000:90:1e.0 N/A | 0 |
| N/A 55C N/A 104W / 350W | 512Mib / 32768Mib | 2% N/A |
|-------------------------------±---------------------±---------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
We actually create a job in k8s, and run it with kubectl apply -f our_job.yaml, but it failed.
For this job, it just runs the following example code inside the docker container: “/Model-References/TensorFlow/examples/distribute_with_hpu_strategy/mnist_keras.py”.
But the errors we got it are like below, could you help? Thanks!
[1,5]: File “/usr/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
[1,5]: “main”, mod_spec)
[1,5]: File “/usr/lib/python3.7/runpy.py”, line 85, in _run_code
[1,5]: exec(code, run_globals)
[1,5]: File “/Model-References/TensorFlow/examples/distribute_with_hpu_strategy/mnist_keras.py”, line 162, in
[1,5]: train_mnist(args.use_hpu, args.batch_size, args.use_bfloat, args.epochs)
[1,5]: File “/Model-References/TensorFlow/examples/distribute_with_hpu_strategy/mnist_keras.py”, line 93, in train_mnist
[1,5]: from habana_frameworks.tensorflow import load_habana_module
[1,5]: File “/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/init.py”, line 12, in
[1,5]: from .sysconfig import version
[1,5]: File “/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/sysconfig.py”, line 22, in
[1,5]: version_dict = get_version_dict()
[1,5]: File “/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/version_getter.py”, line 48, in get_version_dict
[1,5]: [sys.executable, file], env=env, encoding=“ascii”)
[1,5]: File “/usr/lib/python3.7/subprocess.py”, line 411, in check_output
[1,5]: **kwargs).stdout
[1,5]: File “/usr/lib/python3.7/subprocess.py”, line 512, in run
[1,5]: output=stdout, stderr=stderr)
[1,5]:subprocess.CalledProcessError: Command ‘[’/usr/bin/python3’, ‘/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/version_getter.py’]’ died with <Signals.SIGBUS: 7>.
k8s yaml file:
apiVersion: batch/v1
kind: Job
metadata:
name: benchmark
spec:
template:
spec:
containers:
- name: benchmark
image: mnist-dl1
imagePullPolicy: Always
env:
- name: HABANA_VISIBLE_DEVICES
value: “all”
- name: HCL_CONFIG_PATH
value: “/etc/hcl/worker_config.json”
- name: DEVICE_TYPE
value: “hpu”
- name: BATCH_SIZE
value: “256”
- name: DATA_TYPE
value: “fp”
- name: EPOCH
value: “200”
volumeMounts:
- mountPath: /dev/shm
name: dshm
securityContext:
capabilities:
add: [“SYS_RAWIO”]
resources:
limits:
habana.ai/gaudi: 1
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: “16Gi”
restartPolicy: Never
Docker image mnist-dl1 Dockerfile:
mnist-dl1
ARG TENSORFLOW_VER=2.7.0
ARG DLC_VER=1.2.0
ARG DLC_VER_MINOR=585
ARG DLC_REPO=vault.habana.ai/gaudi-docker/${DLC_VER}/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-${TENSORFLOW_VER}:${DLC_VER}-${DLC_VER_MINOR}
ARG TENSORFLOW_REPO=${DLC_REPO}
FROM ${DLC_REPO}
ENV LD_LIBRARY_PATH=/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/tf2_7_0/lib/habanalabs:/usr/lib/habanalabs/openmpi/lib
RUN apt-get update && apt-get install -y lsof git
RUN python3 -m pip install tensorflow_datasets mpi4py
install MNIST dataset
#ARG MNIST_DATASET_VER="-"
#ARG MNIST_DATASET_REPO=https://af01p-igk.devtools.intel.com/artifactory/platform_hero-igk-local/hero_features_assets/ai/dataset/mnist.npz
#ADD ${MNIST_DATASET_REPO} /root/.keras/datasets/mnist.npz
ARG MODEL_REFERENCE_VER=10ed2098c36
ARG MODEL_REFERENCE_REPO=https://github.com/HabanaAI/Model-References.git
RUN git clone ${MODEL_REFERENCE_REPO} &&
cd /Model-References &&
git checkout ${MODEL_REFERENCE_VER} &&
rm -rf .git
ENV PYTHONPATH=/Model-References:$PYTHONPATH
COPY run_test.sh /
RUN mkfifo /export-logs
CMD ( ./run_test.sh; echo $? >status; ) 2>&1 | tee output.logs &&
tar cf /export-logs status *.logs &&
sleep infinity
$ more run_test.sh
#!/bin/bash -e
export NUM_WORKERS=${NUM_WORKERS:-8}
cd Model-References/TensorFlow/examples/distribute_with_hpu_strategy
time -p ./run_mnist_keras.sh -b ${BATCH_SIZE:-256} -t ${DATA_TYPE:-fp} -e ${EPOCH:-1} -d ${DEVICE_TYPE:-hpu}