nav emailalert searchbtn searchbox tablepage yinyongbenwen piczone journalimg journalInfo journalinfonormal searchdiv searchzone qikanlogo popupnotification paper paperNew
2023, 06, v.21 9-15
大模型分布式训练方法研究综述
基金项目(Foundation):
邮箱(Email):
DOI:
发布时间: 2023-12-28
出版时间: 2023-12-28
移动端阅读
摘要:

随着ChatGPT的问世,各种大模型(Large Language Model,LLM)产品不断涌现,一个属于大模型的时代正在来临。然而,由于大模型面临着参数规模大、训练时间长的难点,现有传统机器学习模型训练方法并不适用于大模型的训练,亟需探索新的分布式训练方法与策略。针对这些问题,从三个方面综述大模型分布式训练方法在过去十几年里的进展,包含分布式训练的架构并行加速策略以及内存和计算优化方面的内容,最后提出了未来可以探索的研究方向。

Abstract:

With the advent of ChatGPT, various Large Language Model( LLM) products are constantly emerging, and an era of large models is coming. However, due to the difficulties of large language models with large parameters and long training time, the existing traditional machine learning model training methods are not suitable for the training of large language models, and it is urgent to explore new distributed training methods and strategies. To solve these problems, this paper reviews the progress of distributed training methods for large models in the past decade from three aspects, including the parallel acceleration strategy of distributed training architecture and the optimization of memory and computation, and finally puts forward the research directions that can be explored in the future.

参考文献

[1] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation, 1997,9(8):1735-1780.

[2] Cho K,Van Merri?nboer B,Gulcehre C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL].2014[2023-08-25]. https://arxiv.org/pdf/1406.1078.pdf.

[3] Chung J, Gulcehre C,Cho K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].2014[2023-08-25]. https://arxiv.org/pdf/1412.3555.pdf.

[4] Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[J]. Advances in neural information processing systems,2017(30):244.

[5] Devlin J, Chang M W, Lee K, et al. Bert:Pre-training of deep bidirectional transformers for language understanding[EB/OL].2018[2023-08-25]. https://arxiv.org/pdf/1810.04805.pdf.

[6] Brown T, Mann B, Ryder N, et al. Language models are fewshot learners[J]. Advances in neural information processing systems, 2020(33):1877-1901.

[7] Chowdhery A, Narang S, Devlin J, et al. Palm:Scaling language modeling with pathways[EB/OL].2022[2023-08-25]. https://arxiv.org/pdf/2204.02311.pdf.

[8] Touvron H, Lavril T, Izacard G, et al. Llama:Open and efficient foundation language models[EB/OL].2023[2023-09]. https://arxiv.org/pdf/2302.13971.pdf.

[9] Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C].11th USENIX Symposium on operating systems design and implementation(OSDI 14),2014.

[10] Andrew Gibiansky. Bringing HPC techniques to deep learning[EB/OJ].2017[2023-08-25].http://research.baidu.com/bringing-hpc-techniques-deep-learning.

[11] Narayanan D, Shoeybi M, Casper J, et al. Efficient largescale language model training on gpu clusters using megatron-lm[C].Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,2021.

[12] Lepikhin D, Lee H J, Xu Y, et al. Gshard:Scaling giant models with conditional computation and automatic sharding[EB/OL].2020[2023-08-25]. https://arxiv.org/pdf/2006.16668.pdf.

[13] Fedus W, Zoph B, Shazeer N. Switch transformers:Scaling to trillion parameter models with simple and efficient sparsity[J].The Journal of Machine Learning Research,2022,23(1):5232-5270.

[14] Rajbhandari S,Rasley J,Ruwase O,et al.Zero:Memory optimizations toward training trillion parameter models[C].SC20:International Conference for High Performance Computing, Networking, Storage and Analysis.IEEE,2020.

[15] Rasley J,Rajbhandari S,Ruwase O,et al. Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C].Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining. 2020.

[16] Shazeer N,Stern M.Adafactor:Adaptive learning rates with sublinear memory cost[C].International Conference on Machine Learning.PMLR, 2018.

[17] Micikevicius P,Narang S,Alben J,et al.Mixed precision training[EB/OL].2017[2023-08-25]. https://arxiv.org/pdf/1710.03740.pdf.

[18] Scao T L,Fan A,Akiki C,et al.Bloom:A 176b-parameter open-access multilingual language model[EB/OL].2022[2023-08-25]. https://arxiv.org/pdf/2211.05100.pdf.

[19] Shoeybi M,Patwary M,Puri R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[EB/OL].2019[2023-08-25]. https://arxiv.org/pdf/1909.08053.pdf.

[20] Wang G,Qin H,Jacobs S A,et al.ZeRO++:Extremely Efficient Collective Communication for Giant Model Training[EB/OL].2023[2023-08-25]. https://arxiv.org/pdf/2306.10209.pdf.

[21] Ren J,Rajbhandari S,Aminabadi R Y,et al.ZeRO-Offload:Democratizing Billion-Scale model training[C].2021USENIX Annual Technical Conference(USENIX ATC21).2021.

[22] Smith S,Patwary M,Norick B,et al.Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model[EB/OL].2022[2023-08-25].https://arxiv.org/pdf/2201.11990.pdf.

[23] He J,Qiu J,Zeng A,et al.Fastmoe:A fast mixture-of-expert training system[EB/OL].2021[2023-08-25]. https://arxiv.org/pdf/2103.13262.pdf.

[24] Xu Q,You Y.An efficient 2d method for training super-large deep learning models[C].2023 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2023.

[25] Korthikanti V A,Casper J,Lym S,et al.Reducing activation recomputation in large transformer models[J]. Proceedings of Machine Learning and Systems, 2023(5):105.

[26] Wang B,Xu Q,Bian Z,et al.Tesseract:Parallelize the tensor parallelism efficiently[C].Proceedings of the 51st International Conference on Parallel Processing.2022.

[27] Harlap A,Narayanan D,Phanishayee A,et al.Pipedream:Fast and efficient pipeline parallel dnn training[EB/OL].2018[2023-08-25]. https://arxiv.org/pdf/1806.03377.pdf.

基本信息:

中图分类号:TP18

引用信息:

[1]蒋丰泽.大模型分布式训练方法研究综述[J].深圳信息职业技术学院学报,2023,21(06):9-15.

发布时间:

2023-12-28

出版时间:

2023-12-28

检 索 高级检索

引用

GB/T 7714-2015 格式引文
MLA格式引文
APA格式引文