Not able to run the example of Model-references in Kubernetes mode

I followed the instructions here since we tried to run on Cloud instance but inside k8s environment.
https://docs.habana.ai/en/latest/Getting_Started_Guide_EKS/Getting_Started_Guide_EKS_with_Habana.html

First use Habana self build AMI and install k8s on it.

Device plugin could start successfully:
$ kubectl create -f JFrog
[ec2-user@node1 ~]$ kubectl get pods -n habana-system
NAME READY STATUS RESTARTS AGE
habanalabs-device-plugin-daemonset-gaudi-jd9kh 1/1 Running 0 19m

First tried with this example, job-hl.yaml seems working, well I need change habana.ai/gaudi: 8 to 1, otherwise not enough resource found. But this example just print out device info, do not do any real training/references.
[ec2-user@node1 ~]$ kubectl logs job-hl-xb6lx
±----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.2.0-fw-32.5.0.0 |
| Driver Version: 1.2.0-124dd38 |
|-------------------------------±---------------------±---------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-205 N/A | 0000:90:1e.0 N/A | 0 |
| N/A 55C N/A 104W / 350W | 512Mib / 32768Mib | 2% N/A |
|-------------------------------±---------------------±---------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |

We actually create a job in k8s, and run it with kubectl apply -f our_job.yaml, but it failed.
For this job, it just runs the following example code inside the docker container: “/Model-References/TensorFlow/examples/distribute_with_hpu_strategy/mnist_keras.py”.

But the errors we got it are like below, could you help? Thanks!

[1,5]: File “/usr/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
[1,5]: “main”, mod_spec)
[1,5]: File “/usr/lib/python3.7/runpy.py”, line 85, in _run_code
[1,5]: exec(code, run_globals)
[1,5]: File “/Model-References/TensorFlow/examples/distribute_with_hpu_strategy/mnist_keras.py”, line 162, in
[1,5]: train_mnist(args.use_hpu, args.batch_size, args.use_bfloat, args.epochs)
[1,5]: File “/Model-References/TensorFlow/examples/distribute_with_hpu_strategy/mnist_keras.py”, line 93, in train_mnist
[1,5]: from habana_frameworks.tensorflow import load_habana_module
[1,5]: File “/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/init.py”, line 12, in
[1,5]: from .sysconfig import version
[1,5]: File “/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/sysconfig.py”, line 22, in
[1,5]: version_dict = get_version_dict()
[1,5]: File “/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/version_getter.py”, line 48, in get_version_dict
[1,5]: [sys.executable, file], env=env, encoding=“ascii”)
[1,5]: File “/usr/lib/python3.7/subprocess.py”, line 411, in check_output
[1,5]: **kwargs).stdout
[1,5]: File “/usr/lib/python3.7/subprocess.py”, line 512, in run
[1,5]: output=stdout, stderr=stderr)
[1,5]:subprocess.CalledProcessError: Command ‘[’/usr/bin/python3’, ‘/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/version_getter.py’]’ died with <Signals.SIGBUS: 7>.

k8s yaml file:
apiVersion: batch/v1
kind: Job
metadata:
name: benchmark
spec:
template:
spec:
containers:
- name: benchmark
image: mnist-dl1
imagePullPolicy: Always
env:
- name: HABANA_VISIBLE_DEVICES
value: “all”
- name: HCL_CONFIG_PATH
value: “/etc/hcl/worker_config.json”
- name: DEVICE_TYPE
value: “hpu”
- name: BATCH_SIZE
value: “256”
- name: DATA_TYPE
value: “fp”
- name: EPOCH
value: “200”
volumeMounts:
- mountPath: /dev/shm
name: dshm
securityContext:
capabilities:
add: [“SYS_RAWIO”]
resources:
limits:
habana.ai/gaudi: 1
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: “16Gi”
restartPolicy: Never

Docker image mnist-dl1 Dockerfile:

mnist-dl1

ARG TENSORFLOW_VER=2.7.0
ARG DLC_VER=1.2.0
ARG DLC_VER_MINOR=585
ARG DLC_REPO=vault.habana.ai/gaudi-docker/${DLC_VER}/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-${TENSORFLOW_VER}:${DLC_VER}-${DLC_VER_MINOR}
ARG TENSORFLOW_REPO=${DLC_REPO}
FROM ${DLC_REPO}

ENV LD_LIBRARY_PATH=/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/tf2_7_0/lib/habanalabs:/usr/lib/habanalabs/openmpi/lib
RUN apt-get update && apt-get install -y lsof git
RUN python3 -m pip install tensorflow_datasets mpi4py

install MNIST dataset

#ARG MNIST_DATASET_VER="-"
#ARG MNIST_DATASET_REPO=https://af01p-igk.devtools.intel.com/artifactory/platform_hero-igk-local/hero_features_assets/ai/dataset/mnist.npz
#ADD ${MNIST_DATASET_REPO} /root/.keras/datasets/mnist.npz

ARG MODEL_REFERENCE_VER=10ed2098c36
ARG MODEL_REFERENCE_REPO=https://github.com/HabanaAI/Model-References.git
RUN git clone ${MODEL_REFERENCE_REPO} &&
cd /Model-References &&
git checkout ${MODEL_REFERENCE_VER} &&
rm -rf .git
ENV PYTHONPATH=/Model-References:$PYTHONPATH

COPY run_test.sh /
RUN mkfifo /export-logs
CMD ( ./run_test.sh; echo $? >status; ) 2>&1 | tee output.logs &&
tar cf /export-logs status *.logs &&
sleep infinity

$ more run_test.sh
#!/bin/bash -e

export NUM_WORKERS=${NUM_WORKERS:-8}
cd Model-References/TensorFlow/examples/distribute_with_hpu_strategy
time -p ./run_mnist_keras.sh -b ${BATCH_SIZE:-256} -t ${DATA_TYPE:-fp} -e ${EPOCH:-1} -d ${DEVICE_TYPE:-hpu}

Hi @yx_intel, thanks for posting, we’re reviewing your data and will respond soon.

1 Like

Hi @yx_intel I will work on replicating your scenario and follow up with you.

In the meantime, can you please clarify what you mean by " self build AMI "? Is it not from the AWS marketplace (and if so can you please shed some light on what you have built)?

Sorry, my bad, I actually mean I used Habana prebuild AMI, I searched for the right one, us-east-1, ami-00701710e64c2a5a3.

We use self build docker image, which is inherited from:
vault.habana.ai/gaudi-docker/1.2.0/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-2.7.0:1.2.0-585

@yx_intel could you please upload the mnist yaml file here. The indentation is off because of pasting it, and I get an error when trying to run it: error converting YAML to JSON

Also for the mnist, it says:
MODEL_REFERENCE_VER=10ed2098c36
git checkout ${MODEL_REFERENCE_VER}

this commit id failed for me, so for now I am could use “git checkout 1.2.0”
Can you confirm if 10ed2098c36 is correct?

Yes, the version number for git checkout should be 1.2.0. It has been fixed.

I have the files, but how to upload it here?

You can use the “preformatted text” mode (The icon marked </>)

apiVersion: batch/v1
kind: Job
metadata:
    name: job-hl

1 Like
k8s yaml file:
apiVersion: batch/v1
kind: Job
metadata:
  name: benchmark
spec:
  template:
    spec:
      containers:
      - name: benchmark
        image: mnist-dl1
        imagePullPolicy: Always
        env:
        - name: HABANA_VISIBLE_DEVICES
          value: "all"
        - name: HCL_CONFIG_PATH
          value: "/etc/hcl/worker_config.json"
        - name: `DEVICE_TYPE'
          value: "hpu"
        - name: `BATCH_SIZE'
          value: "256"
        - name: `DATA_TYPE'
          value: "fp"
        - name: `EPOCH'
          value: "200"
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        securityContext:
          capabilities:
            add: ["SYS_RAWIO"]
        resources:
          limits:
            habana.ai/gaudi: 1
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "16Gi"
      restartPolicy: Never


----------------------------------------------------------------------------------------------------------------------------

MNIST-DL1 Dockerfile:
# mnist-dl1

ARG TENSORFLOW_VER=2.7.0
ARG DLC_VER=1.2.0
ARG DLC_VER_MINOR=585
ARG DLC_REPO=vault.habana.ai/gaudi-docker/${DLC_VER}/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-${TENSORFLOW_VER}:${DLC_VER}-${DLC_VER_MINOR}
ARG TENSORFLOW_REPO=${DLC_REPO}
FROM ${DLC_REPO}

ENV LD_LIBRARY_PATH=/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/tf2_7_0/lib/habanalabs:/usr/lib/habanalabs/openmpi/lib
RUN apt-get update && apt-get install -y lsof git
RUN python3 -m pip install tensorflow_datasets mpi4py

# install MNIST dataset
#ARG MNIST_DATASET_VER="-"
#ARG MNIST_DATASET_REPO=https://af01p-igk.devtools.intel.com/artifactory/platform_hero-igk-local/hero_features_assets/ai/dataset/mnist.npz
#ADD ${MNIST_DATASET_REPO} /root/.keras/datasets/mnist.npz

ARG MODEL_REFERENCE_VER=1.2.0
ARG MODEL_REFERENCE_REPO=https://github.com/HabanaAI/Model-References.git
RUN git clone ${MODEL_REFERENCE_REPO} && \
    cd /Model-References && \
    git checkout ${MODEL_REFERENCE_VER} && \
    rm -rf .git
ENV PYTHONPATH=/Model-References:$PYTHONPATH

COPY run_test.sh /
RUN mkfifo /export-logs
CMD ( ./run_test.sh; echo $? >status; ) 2>&1 | tee output.logs && \
    tar cf /export-logs status *.logs && \
    sleep infinity


----------------------------------------------------------------------------------------------------------------------------

run_test.sh file used by Dockerfile:
#!/bin/bash -e

export NUM_WORKERS=${NUM_WORKERS:-8} 

cd Model-References/TensorFlow/examples/distribute_with_hpu_strategy
time -p ./run_mnist_keras.sh -b ${BATCH_SIZE:-256} -t ${DATA_TYPE:-fp} -e ${EPOCH:-1} -d ${DEVICE_TYPE:-hpu}```

Hi @yx_intel, I am able to reproduce your issue. We’ll work on a solution and get back to you.

Hi @yx_intel

in the wiki example we invoke the command to run (hl-smi) in the yaml file rather than in the docker’s last line. With this change I am able to run mnist (note that I added command: ["/run_test_mnist.sh"] to the yaml, and commented out the last line from dockerfile)

docker build . -t subprocess_mnist

run_test_mnist.sh:

#!/bin/bash -e
cd /
export NUM_WORKERS=${NUM_WORKERS:-8}
cd Model-References/TensorFlow/examples/distribute_with_hpu_strategy
time -p ./run_mnist_keras.sh -b ${BATCH_SIZE:-256} -t ${DATA_TYPE:-fp} -e ${EPOCH:-1} -d ${DEVICE_TYPE:-hpu}

job_mnist.yaml

apiVersion: batch/v1
kind: Job
metadata:
    name: benchmark-mnist
spec:
  template:
    metadata:
        labels:
            app: benchmark-mnist
    spec:
        containers:
        - name: benchmark-mnist
          image: subprocess_mnist
          imagePullPolicy: IfNotPresent
          command: ["/run_test_mnist.sh"]
          workingDir: /home
          resources:
            limits:
              habana.ai/gaudi: 8
              hugepages-2Mi: "21000Mi"
              memory: 720Gi
            requests:
              habana.ai/gaudi: 8
              hugepages-2Mi: "21000Mi"
              memory: 700Gi
          securityContext:
           capabilities:
              add: ["SYS_RAWIO"]
        hostNetwork: true
        restartPolicy: Never
  backoffLimit: 0

dockerfile

#mnist-dl1
ARG TENSORFLOW_VER=2.8.0
ARG DLC_VER=1.3.0
ARG DLC_VER_MINOR=499
ARG DLC_REPO=vault.habana.ai/gaudi-docker/${DLC_VER}/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-${TENSORFLOW_VER}:${DLC_VER}-${DLC_VER_MINOR}
ARG TENSORFLOW_REPO=${DLC_REPO}
FROM ${DLC_REPO}

ENV LD_LIBRARY_PATH=/usr/local/lib/python3.7/dist-packages/habana_frameworks/tensorflow/tf2_7_0/lib/habanalabs:/usr/lib/habanalabs/openmpi/lib
RUN apt-get update && apt-get install -y lsof git
RUN python3 -m pip install tensorflow_datasets mpi4py

#install MNIST dataset
#ARG MNIST_DATASET_VER="-"
#ARG MNIST_DATASET_REPO=https://af01p-igk.devtools.intel.com/artifactory/platform_hero-igk-local/hero_features_assets/ai/dataset/mnist.npz
#ADD ${MNIST_DATASET_REPO} /root/.keras/datasets/mnist.npz

ARG MODEL_REFERENCE_VER=1.3.0
ARG MODEL_REFERENCE_REPO=https://github.com/HabanaAI/Model-References.git
RUN git clone ${MODEL_REFERENCE_REPO} && cd /Model-References && git checkout ${MODEL_REFERENCE_VER} && rm -rf .git && cd /
ENV PYTHONPATH=/Model-References:$PYTHONPATH

COPY run_test_mnist.sh / 
RUN mkfifo /export-logs
#CMD ( ./run_test_gauditest.sh; echo $? >status; ) 2>&1 | tee output.logs && tar cf /export-logs status *.logs && sleep infinity

Thanks a lot for the help! Seems we missed the hugepages and memory in the resources setting, which caused the module fail to load.

Another thing is I used DLC_VER=1.3.0 in your Dockerfile but got the following errors:
[1,1]:Habana-TensorFlow(1.3.0) and Habanalabs Driver(1.2.0-124dd38) versions differ!

[1,1]:/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources—bp-dt/repos/hcl/src/hcl_channel.cpp::50(writeWqes): The condition [ m_swq && m_rwq ] failed.

So I switch back to the version we used 1.2.0 and it seems OK now.

Right yes, I had run on 1.3, so I guess you got a mismatch error.

Another observation is if you used your original docker/yaml combination, you can change the CMD to ENTRYPOINT (ie the run_mnist_keras.sh is still called by dockerfile and not by the yaml, like in the previous solution). That will also avoid the crash.

Hopefully you are unblocked by these suggestions. Please reach out if you see any other issues.

2 Likes