Gaudi eval dataset in tfrecord format to get accuracy of run

Purvang1 · March 23, 2023, 11:06pm

When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
• What is the observed result:
• Is the issue consistently reproducible? how long does it take to reproduce:
• If you are using AWS DL1 instance, please report the AMI name that you are using
What is the minimal script/command to reproduce the issue:
Please include any error message or stack trace observed:
Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents

Hi, I am trying to train Bert on Gaudi1 on wikipedia dataset.
mentioned Gaudi Readme defines steps to run Bert training on Gaudi, but it uses bookswiki dataset. Combining Tensorflow data preparation from Gaudi2 Readme and packing from Gaudi1 readme, I was able to run training. but as evaluation data in txt file and Gaudi1 training process expects tfrecord format or packed tfrecord format, I am not able to get accuracy of my run.

Could anyone point out if there is an way to generate wikipedia eval dataset as tfrecord format and then pack it?

Thank you

Purvang1 · March 28, 2023, 6:13pm

I am able to generate eval dataset. Currently run_pretraining script only performs evaluation once entire training finishes or predefined steps is reached. is there a way to do periodic model evaluation on latest saved checkpoint?

Sayantan_S · March 28, 2023, 6:22pm

Currently the code uses estimator.train, but you can try replacing it with estimator.train_and_evaluate

Here’s another reference

Purvang1 · March 30, 2023, 4:04am

I ran with unmodified run_pretraining.py as well and it also shows similar result.
why on gaudi1, with wikipedia dataset, model is not able to learn anything?

Purvang1 · March 30, 2023, 4:04am

Command that I use to train:

time mpirun --allow-run-as-root       --tag-output       --merge-stderr-to-stdout       --output-filename /data3/tensorflow/bert_pur/artifacts/bert_phase_2_log      --bind-to core       --map-by socket:PE=6       -np 8       -x TF_BF16_CONVERSIIN=/root/Model-References/TensorFlow/nlp/bert/bf16_config/bert.json       $PYTHON run_pretraining.py           --input_files_dir=/root/datasets/train_packed/        --init_checkpoint /root/datasets/MLPerf_BERT_checkpoint/model.ckpt-28252   --eval_files_dir=/root/datasets/mlperf_bert_eval_dataset/           --output_dir=/data3/tensorflow/bert_pur/artifacts/phase_2           --bert_config_file=/data3/datasets/MLPerf_BERT_checkpoint/bert_config.json          --do_train=True           --do_eval=True           --train_batch_size=8           --eval_batch_size=8           --max_seq_length=512           --max_predictions_per_seq=76           --num_train_steps=100000           --num_accumulation_steps=1           --num_warmup_steps=0           --save_checkpoints_steps=1500           --learning_rate=0.0005           --horovod           --noamp           --nouse_xla           --allreduce_post_accumulation=True           --dllog_path=/root/dlllog/bert_dllog.json           --resume=False   2>&1 | tee bert_phase2_re.log

Purvang1 · March 30, 2023, 4:04am

Thanks @Sayantan_S . I modified code with train_and_evaluate. two problems I am facing

It uses all worker to do evaluation. How can I use only worker 0 to do evaluation?
my mask_mlm_accuracy constantly decreasing as I train more.
using pretrained checkpoint : 0.34
checkpoint stored at 1500 steps : 0.20
checkpoint stored at 3000 steps : 0.07
checkpoint stored at 4500 steps : 0.05

What could be the reason for this result?

Sayantan_S · March 30, 2023, 4:16am

I see a typo here. Should be TF_BF16_CONVERSION instead of TF_BF16_CONVERSIIN

Purvang1 · March 31, 2023, 9:12pm

@Sayantan_S . I generated training data following habana bert mlcommon submission, where max_predictions_per_seq used was 76. but readme for gaudi tensorflow bert mentions --max_predictions_per_seq=80. is it something that can cause shown behaviour?

Purvang1 · March 31, 2023, 9:12pm

@Sayantan_S here I attached tensorboard charts. purple one is original code training without any modification and that also doesn’t converge.

Purvang1 · March 31, 2023, 9:12pm

@Sayantan_S even after correcting command, I am facing similar issue.

model.ckpt-1000 0.2810375988483429
model.ckpt-2000 0.2611505091190338
model.ckpt-3000 0.252014696598053
model.ckpt-4000 0.24776245653629303
model.ckpt-5000 0.22954654693603516
model.ckpt-6000 0.056894563138484955

Sayantan_S · March 31, 2023, 10:56pm

What release (1.8.0?) and machine (gaudi1 or gaudi2) are you using?

Sayantan_S · April 4, 2023, 6:37pm

@Purvang1
Parsing the previous posts and summarizing so that I can repro it on my end

machine used: gaudi 1, 8x, sw stack: 1.8
Data generation instructions:
Model-References/MLPERF2.1/Habana/benchmarks at master · HabanaAI/Model-References · GitHub
model run instructions used: Model-References/TensorFlow/nlp/bert at master · HabanaAI/Model-References · GitHub
Some comments in the middle suggests you tried making some code changes, but in the end it seems you used original code and instructions and are unable to get good accuracy

Please correct/add any info if I have missed anything above.

Purvang1 · April 5, 2023, 3:41am

@Sayantan_S

machine used: gaudi 1, 8x, sw stack: 1.7. everything else is same as you mentioned.
It manage to reach eval accuracy to 72% after changing --num_accumulation_steps=512, but it took 8x compared to Nvidia A100, where even using --num_accumulation_steps=1 was faster. any insight would be helpful. Thank you

Sayantan_S · April 6, 2023, 5:08pm

@Purvang1

The MLPerf version uses a dataset (wikipedia) provided by MLcommons that they share on Google Drive, which is the link mentioned in here

The non-MLPerf version uses a combination of books corpus and wiki dataset. Downloading it can be tricky. Download info here

Running a non-MLPerf version on a Wikipedia-only dataset will not have good accuracy and would require a lot of hyperparameter tuning.

If you want to reproduce our results, they need to follow the exact steps from README, and not mix dataset from one and hyperparams from another.

Purvang1 · April 6, 2023, 9:19pm

@Sayantan_S . That’s right. I am using Wikipedia dataset provided by MLcommons, which downloaded from Google Drive.

Sayantan_S · April 11, 2023, 5:41pm

Our experiments suggest that mixing dataset prep and run command (hyperparameters) will not give good accuracy.
non-mlperf data is finicky to download, mlperf data is more easily available. if you are sticking to mlperf data, can you please try ml perf run command

Topic		Replies	Views
Habana Gaudi Hpus Training time improvement TensorFlow	2	654	September 30, 2022
unet2d training crash for 8 gaudis Training pytorch	2	667	March 17, 2023
What do I have to do to ensure that a model can run on Gaudi? FAQ models	0	716	June 30, 2021
Gaudi2 slower compared to A100 Training	10	652	June 7, 2023
T5 model reference training/inference not working on Tensorflow AMI TensorFlow tensorflow	2	831	February 22, 2022

Gaudi eval dataset in tfrecord format to get accuracy of run

Related topics