ctodd
November 20, 2023, 3:44am
1
Environment: AWS DL1, Ubuntu 22.04 (bare metal driver install), Python 3.10.12, SynapseAI 1.12.1
Running in habanalabs-venv on the host OS (no container)
Followed instructions from Readme.md
$ python3 scripts/txt2img.py --prompt “a professional photograph of an astronaut riding a horse” --ckpt v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --n_samples 1 --n_iter 3 --use_hpu_graph
Seed set to 42
Loading model from v2-1_768-ema-pruned.ckpt
Global Step: 110000
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
making attention of type ‘vanilla’ with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type ‘vanilla’ with 512 in_channels
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM : 784282744 KB
Data shape for DDIM sampling is (1, 4, 96, 96), eta 0.0
Compiling HPU graph encode_with_transformer
Traceback (most recent call last):
File “/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py”, line 360, in
main(opt)
File “/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py”, line 300, in main
c_in = runner.run(model.cond_stage_model.encode_with_transformer, tokens)
File “/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py”, line 222, in run
graph.capture_begin()
File “/home/ubuntu/habanalabs-venv/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py”, line 34, in capture_begin
_hpu_C.capture_begin(self.hpu_graph, dry_run)
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generice failure].
Issue also logged on Github:
opened 04:51PM - 18 Nov 23 UTC
Environment: AWS DL1, Ubuntu 22.04 (bare metal driver install), Python 3.10.12, … SynapseAI 1.12.1
Running in habanalabs-venv on the host OS (no container)
Followed instructions from Readme.md
$ python3 scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt v2-1_768-ema-pruned.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768 --n_samples 1 --n_iter 3 --use_hpu_graph
Seed set to 42
Loading model from v2-1_768-ema-pruned.ckpt
Global Step: 110000
LatentDiffusion: Running in v-prediction mode
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM : 784282744 KB
------------------------------------------------------------------------------
Data shape for DDIM sampling is (1, 4, 96, 96), eta 0.0
Compiling HPU graph encode_with_transformer
Traceback (most recent call last):
File "/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py", line 360, in <module>
main(opt)
File "/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py", line 300, in main
c_in = runner.run(model.cond_stage_model.encode_with_transformer, tokens)
File "/home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1/scripts/txt2img.py", line 222, in run
graph.capture_begin()
File "/home/ubuntu/habanalabs-venv/lib/python3.10/site-packages/habana_frameworks/torch/hpu/graphs.py", line 34, in capture_begin
_hpu_C.capture_begin(self.hpu_graph, dry_run)
RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generice failure].
$ pip list
Package Version Editable project location
------------------------- ------------------ ----------------------------------------------------------------------------------------------
absl-py 2.0.0
accelerate 0.24.1
aiohttp 3.9.0
aiosignal 1.3.1
altair 5.1.2
annotated-types 0.6.0
antlr4-python3-runtime 4.8
anyio 3.7.1
appdirs 1.4.4
arrow 1.3.0
async-timeout 4.0.3
attrs 23.1.0
av 9.2.0
backoff 2.2.1
beautifulsoup4 4.12.2
blessed 1.20.0
boto3 1.29.3
botocore 1.32.3
cachetools 5.3.2
certifi 2023.11.17
cffi 1.15.1
cfgv 3.4.0
charset-normalizer 3.3.2
clean-fid 0.1.35
click 8.1.7
clip-anytorch 2.5.2
cmake 3.27.7
contourpy 1.2.0
croniter 1.4.1
cycler 0.12.1
dateutils 0.6.12
deepdiff 6.7.1
distlib 0.3.7
docker-pycreds 0.4.0
einops 0.3.0
exceptiongroup 1.1.3
expecttest 0.1.6
fastapi 0.104.1
ffmpy 0.3.1
filelock 3.13.1
fonttools 4.44.3
frozenlist 1.4.0
fsspec 2023.10.0
ftfy 6.1.1
gitdb 4.0.11
GitPython 3.1.40
google-auth 2.23.4
google-auth-oauthlib 0.4.6
gradio 3.13.1
grpcio 1.59.3
h11 0.12.0
habana-gpu-migration 1.12.1.10
habana-media-loader 1.12.1.10
habana-pyhlml 1.12.1.10
habana-torch-dataloader 1.12.1.10
habana-torch-plugin 1.12.1.10
httpcore 0.15.0
httpx 0.25.1
huggingface-hub 0.19.4
identify 2.5.31
idna 3.4
imageio 2.32.0
inquirer 3.1.3
intel-openmp 2023.2.0
itsdangerous 2.1.2
Jinja2 3.1.2
jmespath 1.0.1
jsonmerge 1.9.2
jsonschema 4.20.0
jsonschema-specifications 2023.11.1
k-diffusion 0.0.14
kiwisolver 1.4.5
kornia 0.7.0
lazy_loader 0.3
lightning 2.0.6
lightning-cloud 0.5.54
lightning-habana 1.0.1
lightning-utilities 0.10.0
linkify-it-py 2.0.2
Markdown 3.5.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib 3.8.2
mdit-py-plugins 0.4.0
mdurl 0.1.2
mkl 2023.1.0
mkl-include 2023.1.0
mpi4py 3.1.4
mpmath 1.3.0
multidict 6.0.4
networkx 3.2.1
ninja 1.11.1.1
nodeenv 1.8.0
numpy 1.23.5
oauthlib 3.2.2
omegaconf 2.1.1
open-clip-torch 2.7.0
ordered-set 4.1.0
orjson 3.9.10
packaging 23.2
pandas 2.0.1
pathspec 0.11.2
perfetto 0.7.0
Pillow 10.0.1
Pillow-SIMD 7.0.0.post3
pip 22.3.1
platformdirs 3.11.0
pre-commit 3.3.3
protobuf 3.20.3
psutil 5.9.6
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.10.4
pycparser 2.21
pycryptodome 3.19.0
pydantic 2.0.3
pydantic_core 2.3.0
pydub 0.25.1
Pygments 2.17.0
PyJWT 2.8.0
pynvml 8.0.4
pyparsing 3.1.1
python-dateutil 2.8.2
python-editor 1.0.4
python-multipart 0.0.6
pytorch-lightning 2.1.2
pytz 2023.3.post1
PyYAML 6.0
readchar 4.0.5
referencing 0.31.0
regex 2023.5.5
requests 2.31.0
requests-oauthlib 1.3.1
resize-right 0.0.2
rich 13.7.0
rpds-py 0.13.0
rsa 4.9
s3transfer 0.7.0
scikit-image 0.22.0
scipy 1.11.3
sentry-sdk 1.35.0
setproctitle 1.3.3
setuptools 68.2.2
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
soupsieve 2.5
stable-diffusion 0.0.1 /home/ubuntu/habanalabs-venv/Model-References/PyTorch/generative_models/stable-diffusion-v-2-1
starlette 0.27.0
starsessions 1.3.0
sympy 1.12
tbb 2021.11.0
tdqm 0.0.1
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tifffile 2023.9.26
tokenizers 0.12.1
toolz 0.12.0
torch 2.0.1a0+gitf520939
torch-tb-profiler 0.4.0
torchaudio 2.0.1+3b40834
torchdata 0.6.1+e1feeb2
torchdiffeq 0.2.3
torchmetrics 1.2.0
torchsde 0.2.6
torchtext 0.15.2a0+4571036
torchvision 0.15.1a0+42759b1
tqdm 4.66.1
traitlets 5.13.0
trampoline 0.1.2
transformers 4.19.2
types-python-dateutil 2.8.19.14
typing_extensions 4.8.0
tzdata 2023.3
uc-micro-py 1.0.2
urllib3 1.26.18
uvicorn 0.24.0.post1
virtualenv 20.24.6
wandb 0.16.0
wcwidth 0.2.10
websocket-client 1.6.4
websockets 12.0
Werkzeug 3.0.1
wheel 0.41.3
yamllint 1.33.0
yarl 1.9.2
$ hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.12.1-fw-46.0.5.0 |
| Driver Version: 1.12.1-cb7a7bc |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-205 N/A | 0000:90:1d.0 N/A | 0 |
| N/A 49C N/A 104W / 350W | 512MiB / 32768MiB | 2% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-205 N/A | 0000:20:1d.0 N/A | 0 |
| N/A 50C N/A 105W / 350W | 512MiB / 32768MiB | 3% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-205 N/A | 0000:20:1e.0 N/A | 0 |
| N/A 50C N/A 102W / 350W | 512MiB / 32768MiB | 2% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-205 N/A | 0000:10:1e.0 N/A | 0 |
| N/A 48C N/A 97W / 350W | 512MiB / 32768MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-205 N/A | 0000:10:1d.0 N/A | 0 |
| N/A 49C N/A 101W / 350W | 512MiB / 32768MiB | 1% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-205 N/A | 0000:90:1e.0 N/A | 0 |
| N/A 48C N/A 108W / 350W | 512MiB / 32768MiB | 4% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-205 N/A | 0000:a0:1d.0 N/A | 0 |
| N/A 46C N/A 104W / 350W | 512MiB / 32768MiB | 2% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-205 N/A | 0000:a0:1e.0 N/A | 0 |
| N/A 40C N/A 108W / 350W | 512MiB / 32768MiB | 4% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
Could you please let me know which stable-diffusion you are using. There are 3 here:
Probably one of stable-diffusion-v-2-1 or stable-diffusion-finetuning ?
I am able to run this on Gaudi2 on 1.13-463 docker (1.13.0 branch of model-references), with 1.13 firmware (as shown by hl-smi)
on 1.12.1docker if i checkout out 1.12.1branch on model references I can run it as well.
I see that if I run with model-references on branch=1.12.1, and docker =1.13/firmware=1.13, it errors out. Can you please confirm if your model-references, firmware and docker are all on the same version?