Issue running Llama2 pretraining using megatron deepspeed

I am following the documentation from Megatron deep speed fork to run llama2 7b FP8 training and i encounter this error
AssertionError: allreduce_gradients() is not valid when bfloat+pipeline_parallelism is enabled.

Can you please provide the cmd line, release (1.16?) you are using.

I am copy pasting the command from megatron deepspeed repo. This is the command i use:
HL_LLAMA_MODEL_SIZE=7 HL_NUM_NODES=1 HL_PP=1 HL_TP=1 HL_DP=8 HL_CKP_ACT=2 HL_SEQ_LEN=4096 HL_ZERO_STAGE=1 HL_USE_FAST_SOFTMAX=1 HL_MICRO_BATCH=1 HL_GRAD_ACCUM_DTYPE=bf16 HL_USE_TRANSFORMER_ENGINE=1 HL_USE_CACHE_FP8_WEIGHT_FWD=1 HL_USE_CACHE_FP8_WEIGHT=1 scripts/run_llama.sh

I am using 1.16.2