深圳信息职业技术学院学报

2023, 06, v.21 9-15

大模型分布式训练方法研究综述

蒋丰泽

1.深圳信息职业技术学院软件学院

基金项目(Foundation):

邮箱(Email):

DOI:

发布时间： 2023-12-28

出版时间： 2023-12-28

移动端阅读

797	3	259
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

随着ChatGPT的问世，各种大模型（Large Language Model,LLM）产品不断涌现，一个属于大模型的时代正在来临。然而，由于大模型面临着参数规模大、训练时间长的难点，现有传统机器学习模型训练方法并不适用于大模型的训练，亟需探索新的分布式训练方法与策略。针对这些问题，从三个方面综述大模型分布式训练方法在过去十几年里的进展,包含分布式训练的架构并行加速策略以及内存和计算优化方面的内容，最后提出了未来可以探索的研究方向。

关键词： 大模型; 分布式训练; 并行处理;

Abstract：

With the advent of ChatGPT, various Large Language Model( LLM) products are constantly emerging, and an era of large models is coming. However, due to the difficulties of large language models with large parameters and long training time, the existing traditional machine learning model training methods are not suitable for the training of large language models, and it is urgent to explore new distributed training methods and strategies. To solve these problems, this paper reviews the progress of distributed training methods for large models in the past decade from three aspects, including the parallel acceleration strategy of distributed training architecture and the optimization of memory and computation, and finally puts forward the research directions that can be explored in the future.

KeyWords： large language model; distributed training method; parallel processing;

如需获取全文，请访问cnki.net

参考文献

[1] Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation, 1997,9(8):1735-1780.

[2] Cho K,Van Merri?nboer B,Gulcehre C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL].2014[2023-08-25]. https://arxiv.org/pdf/1406.1078.pdf.

[3] Chung J, Gulcehre C,Cho K H,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].2014[2023-08-25]. https://arxiv.org/pdf/1412.3555.pdf.

[4] Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[J]. Advances in neural information processing systems,2017(30):244.

[5] Devlin J, Chang M W, Lee K, et al. Bert:Pre-training of deep bidirectional transformers for language understanding[EB/OL].2018[2023-08-25]. https://arxiv.org/pdf/1810.04805.pdf.

[6] Brown T, Mann B, Ryder N, et al. Language models are fewshot learners[J]. Advances in neural information processing systems, 2020(33):1877-1901.

[7] Chowdhery A, Narang S, Devlin J, et al. Palm:Scaling language modeling with pathways[EB/OL].2022[2023-08-25]. https://arxiv.org/pdf/2204.02311.pdf.

[8] Touvron H, Lavril T, Izacard G, et al. Llama:Open and efficient foundation language models[EB/OL].2023[2023-09]. https://arxiv.org/pdf/2302.13971.pdf.

[9] Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C].11th USENIX Symposium on operating systems design and implementation(OSDI 14),2014.

[10] Andrew Gibiansky. Bringing HPC techniques to deep learning[EB/OJ].2017[2023-08-25].http://research.baidu.com/bringing-hpc-techniques-deep-learning.

[11] Narayanan D, Shoeybi M, Casper J, et al. Efficient largescale language model training on gpu clusters using megatron-lm[C].Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,2021.

[12] Lepikhin D, Lee H J, Xu Y, et al. Gshard:Scaling giant models with conditional computation and automatic sharding[EB/OL].2020[2023-08-25]. https://arxiv.org/pdf/2006.16668.pdf.

[13] Fedus W, Zoph B, Shazeer N. Switch transformers:Scaling to trillion parameter models with simple and efficient sparsity[J].The Journal of Machine Learning Research,2022,23(1):5232-5270.

[14] Rajbhandari S,Rasley J,Ruwase O,et al.Zero:Memory optimizations toward training trillion parameter models[C].SC20:International Conference for High Performance Computing, Networking, Storage and Analysis.IEEE,2020.

[15] Rasley J,Rajbhandari S,Ruwase O,et al. Deepspeed:System optimizations enable training deep learning models with over 100 billion parameters[C].Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery&Data Mining. 2020.

[16] Shazeer N,Stern M.Adafactor:Adaptive learning rates with sublinear memory cost[C].International Conference on Machine Learning.PMLR, 2018.

[17] Micikevicius P,Narang S,Alben J,et al.Mixed precision training[EB/OL].2017[2023-08-25]. https://arxiv.org/pdf/1710.03740.pdf.

[18] Scao T L,Fan A,Akiki C,et al.Bloom:A 176b-parameter open-access multilingual language model[EB/OL].2022[2023-08-25]. https://arxiv.org/pdf/2211.05100.pdf.

[19] Shoeybi M,Patwary M,Puri R,et al.Megatron-lm:Training multi-billion parameter language models using model parallelism[EB/OL].2019[2023-08-25]. https://arxiv.org/pdf/1909.08053.pdf.

[20] Wang G,Qin H,Jacobs S A,et al.ZeRO++:Extremely Efficient Collective Communication for Giant Model Training[EB/OL].2023[2023-08-25]. https://arxiv.org/pdf/2306.10209.pdf.

[21] Ren J,Rajbhandari S,Aminabadi R Y,et al.ZeRO-Offload:Democratizing Billion-Scale model training[C].2021USENIX Annual Technical Conference(USENIX ATC21).2021.

[22] Smith S,Patwary M,Norick B,et al.Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model[EB/OL].2022[2023-08-25].https://arxiv.org/pdf/2201.11990.pdf.

[23] He J,Qiu J,Zeng A,et al.Fastmoe:A fast mixture-of-expert training system[EB/OL].2021[2023-08-25]. https://arxiv.org/pdf/2103.13262.pdf.

[24] Xu Q,You Y.An efficient 2d method for training super-large deep learning models[C].2023 IEEE International Parallel and Distributed Processing Symposium(IPDPS).IEEE,2023.

[25] Korthikanti V A,Casper J,Lym S,et al.Reducing activation recomputation in large transformer models[J]. Proceedings of Machine Learning and Systems, 2023(5):105.

[26] Wang B,Xu Q,Bian Z,et al.Tesseract:Parallelize the tensor parallelism efficiently[C].Proceedings of the 51st International Conference on Parallel Processing.2022.

[27] Harlap A,Narayanan D,Phanishayee A,et al.Pipedream:Fast and efficient pipeline parallel dnn training[EB/OL].2018[2023-08-25]. https://arxiv.org/pdf/1806.03377.pdf.

基本信息:

中图分类号:TP18

引用信息:

[1]蒋丰泽.大模型分布式训练方法研究综述[J].深圳信息职业技术学院学报,2023,21(06):9-15.

发布时间：

2023-12-28

出版时间：

2023-12-28

请选择需要下载的pdf数据

深圳信息职业技术学院学报

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文

请选择需要下载的pdf数据

深圳信息职业技术学院学报

使用微信“扫一扫”功能。将此内容分享给您的微信好友或者朋友圈

引用

使用微信“扫一扫”功能。
将此内容分享给您的微信好友或者朋友圈