Gaudi eval dataset in tfrecord format to get accuracy of run

When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
• What is the observed result:
• Is the issue consistently reproducible? how long does it take to reproduce:
• If you are using AWS DL1 instance, please report the AMI name that you are using
What is the minimal script/command to reproduce the issue:
Please include any error message or stack trace observed:
Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/ --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents

Hi, I am trying to train Bert on Gaudi1 on wikipedia dataset.
mentioned Gaudi Readme defines steps to run Bert training on Gaudi, but it uses bookswiki dataset. Combining Tensorflow data preparation from Gaudi2 Readme and packing from Gaudi1 readme, I was able to run training. but as evaluation data in txt file and Gaudi1 training process expects tfrecord format or packed tfrecord format, I am not able to get accuracy of my run.

Could anyone point out if there is an way to generate wikipedia eval dataset as tfrecord format and then pack it?

Thank you

I am able to generate eval dataset. Currently run_pretraining script only performs evaluation once entire training finishes or predefined steps is reached. is there a way to do periodic model evaluation on latest saved checkpoint?

Currently the code uses estimator.train, but you can try replacing it with estimator.train_and_evaluate

Here’s another reference

I ran with unmodified as well and it also shows similar result.
why on gaudi1, with wikipedia dataset, model is not able to learn anything?

Command that I use to train:

time mpirun --allow-run-as-root       --tag-output       --merge-stderr-to-stdout       --output-filename /data3/tensorflow/bert_pur/artifacts/bert_phase_2_log      --bind-to core       --map-by socket:PE=6       -np 8       -x TF_BF16_CONVERSIIN=/root/Model-References/TensorFlow/nlp/bert/bf16_config/bert.json       $PYTHON           --input_files_dir=/root/datasets/train_packed/        --init_checkpoint /root/datasets/MLPerf_BERT_checkpoint/model.ckpt-28252   --eval_files_dir=/root/datasets/mlperf_bert_eval_dataset/           --output_dir=/data3/tensorflow/bert_pur/artifacts/phase_2           --bert_config_file=/data3/datasets/MLPerf_BERT_checkpoint/bert_config.json          --do_train=True           --do_eval=True           --train_batch_size=8           --eval_batch_size=8           --max_seq_length=512           --max_predictions_per_seq=76           --num_train_steps=100000           --num_accumulation_steps=1           --num_warmup_steps=0           --save_checkpoints_steps=1500           --learning_rate=0.0005           --horovod           --noamp           --nouse_xla           --allreduce_post_accumulation=True           --dllog_path=/root/dlllog/bert_dllog.json           --resume=False   2>&1 | tee bert_phase2_re.log

Thanks @Sayantan_S . I modified code with train_and_evaluate. two problems I am facing

  1. It uses all worker to do evaluation. How can I use only worker 0 to do evaluation?
  2. my mask_mlm_accuracy constantly decreasing as I train more.
    using pretrained checkpoint : 0.34
    checkpoint stored at 1500 steps : 0.20
    checkpoint stored at 3000 steps : 0.07
    checkpoint stored at 4500 steps : 0.05

What could be the reason for this result?

I see a typo here. Should be TF_BF16_CONVERSION instead of TF_BF16_CONVERSIIN

@Sayantan_S . I generated training data following habana bert mlcommon submission, where max_predictions_per_seq used was 76. but readme for gaudi tensorflow bert mentions --max_predictions_per_seq=80. is it something that can cause shown behaviour?

@Sayantan_S here I attached tensorboard charts. purple one is original code training without any modification and that also doesn’t converge.

@Sayantan_S even after correcting command, I am facing similar issue.

model.ckpt-1000 0.2810375988483429
model.ckpt-2000 0.2611505091190338
model.ckpt-3000 0.252014696598053
model.ckpt-4000 0.24776245653629303
model.ckpt-5000 0.22954654693603516
model.ckpt-6000 0.056894563138484955

What release (1.8.0?) and machine (gaudi1 or gaudi2) are you using?

Parsing the previous posts and summarizing so that I can repro it on my end

  1. machine used: gaudi 1, 8x, sw stack: 1.8
  2. Data generation instructions:
    Model-References/MLPERF2.1/Habana/benchmarks at master · HabanaAI/Model-References · GitHub
  3. model run instructions used: Model-References/TensorFlow/nlp/bert at master · HabanaAI/Model-References · GitHub
    Some comments in the middle suggests you tried making some code changes, but in the end it seems you used original code and instructions and are unable to get good accuracy

Please correct/add any info if I have missed anything above.


  1. machine used: gaudi 1, 8x, sw stack: 1.7. everything else is same as you mentioned.
    It manage to reach eval accuracy to 72% after changing --num_accumulation_steps=512, but it took 8x compared to Nvidia A100, where even using --num_accumulation_steps=1 was faster. any insight would be helpful. Thank you


The MLPerf version uses a dataset (wikipedia) provided by MLcommons that they share on Google Drive, which is the link mentioned in here

The non-MLPerf version uses a combination of books corpus and wiki dataset. Downloading it can be tricky. Download info here

Running a non-MLPerf version on a Wikipedia-only dataset will not have good accuracy and would require a lot of hyperparameter tuning.

If you want to reproduce our results, they need to follow the exact steps from README, and not mix dataset from one and hyperparams from another.

@Sayantan_S . That’s right. I am using Wikipedia dataset provided by MLcommons, which downloaded from Google Drive.

Our experiments suggest that mixing dataset prep and run command (hyperparameters) will not give good accuracy.
non-mlperf data is finicky to download, mlperf data is more easily available. if you are sticking to mlperf data, can you please try ml perf run command