When posting a technical issue, please describe the issue; be as descriptive as possible, you can include things like:
• What was the expected behavior:
• What is the observed result:
• Is the issue consistently reproducible? how long does it take to reproduce:
• If you are using AWS DL1 instance, please report the AMI name that you are using What is the minimal script/command to reproduce the issue: Please include any error message or stack trace observed: Please run the Snapshot for Debug tool and post to the issue
• git clone GitHub - HabanaAI/Snapshot_For_Debug: Snapshot scripts for gathering information about the model and Habana training session for Habana analysis and debug
• touch OUT_DOCKER.txt
• python src/gather_info_docker.py --lite --cmd=<command_script> -s OUT_DOCKER.txt
• post the generated tar file (gather_info_docker.tar.gz) after checking its contents
Hi, I am trying to train Bert on Gaudi1 on wikipedia dataset.
mentioned Gaudi Readme defines steps to run Bert training on Gaudi, but it uses bookswiki dataset. Combining Tensorflow data preparation from Gaudi2 Readme and packing from Gaudi1 readme, I was able to run training. but as evaluation data in txt file and Gaudi1 training process expects tfrecord format or packed tfrecord format, I am not able to get accuracy of my run.
Could anyone point out if there is an way to generate wikipedia eval dataset as tfrecord format and then pack it?
I am able to generate eval dataset. Currently run_pretraining script only performs evaluation once entire training finishes or predefined steps is reached. is there a way to do periodic model evaluation on latest saved checkpoint?
I ran with unmodified run_pretraining.py as well and it also shows similar result.
why on gaudi1, with wikipedia dataset, model is not able to learn anything?
Thanks @Sayantan_S . I modified code with train_and_evaluate. two problems I am facing
It uses all worker to do evaluation. How can I use only worker 0 to do evaluation?
my mask_mlm_accuracy constantly decreasing as I train more.
using pretrained checkpoint : 0.34
checkpoint stored at 1500 steps : 0.20
checkpoint stored at 3000 steps : 0.07
checkpoint stored at 4500 steps : 0.05
@Sayantan_S . I generated training data following habana bert mlcommon submission, where max_predictions_per_seq used was 76. but readme for gaudi tensorflow bert mentions --max_predictions_per_seq=80. is it something that can cause shown behaviour?
machine used: gaudi 1, 8x, sw stack: 1.7. everything else is same as you mentioned.
It manage to reach eval accuracy to 72% after changing --num_accumulation_steps=512, but it took 8x compared to Nvidia A100, where even using --num_accumulation_steps=1 was faster. any insight would be helpful. Thank you
Our experiments suggest that mixing dataset prep and run command (hyperparameters) will not give good accuracy.
non-mlperf data is finicky to download, mlperf data is more easily available. if you are sticking to mlperf data, can you please try ml perf run command