1.介紹 語言模型用於估計單詞或標記序列的概率。通常,它們為罕見序列或具有語法錯誤的序列分配低概率。例如,通過電腦科學文章訓練的語言模型可能為s1分配更高的概率 :“VLDB is a database conference” ,而不是s2:“VLDB eases a data base conference”。語言模型廣泛用於自然語言處理[15],特別是在生成文字的應用中,包括自動語音識別,機器翻譯和資訊檢索。 通常,語言模型可以用於對生成器生成的候選輸出進行排名。例如,在ASR中,生成器是聲學模型,其接受音訊然後輸出候選詞序列。對於來自聲學模型的若干個類似分數的候選者,例如s1和s2,語言模型對於選擇正確的答案至關重要。
n-gram是一種簡單且非常有效的語言模型。它基於對序列n-gram的統計(例如頻率)來估計單詞序列的概率。n-gram是n個單詞的子序列。 例如,“VLDB”和“database”是1-gram; “VLDB is”和“VLDB eases”是2-gram。n-gram語言模型為頻繁出現的n-gram的序列賦予更高的概率分數。最終概率統計資料是由特定文字語料庫計算出來。統計的概率反映了序列從訓練文字語料庫生成的可能性。對於樣本序列s1和s2,3-gram模型會給s1一個更高的概率,因為“VLDB is a”比“VLDB eases a”更常見,同樣,“a database conference”比“data base conference”在電腦科學文章中更常見。
使用n-gram語言模型的一個重要問題是其儲存成本高。首先為了準確性,n-gram中的n越大越好。例如,1-gram模型會給s2一個比s1更大的分數,因為“data”和“base”比“database”都要出現更頻繁。相比之下,2-gram語言模型可能給s1提供比s2更高的概率,因為“database conference”比“data base conference”更常見。 其次,較大的n-gram集合包括更多的n-gram(會有更好的覆蓋),因此對相對不常見的n-gram序列也能給出更好的概率估計。例如,如果n-gram集合中不包含“database conference”,則會基於其後綴即“conference”來估計s1的概率。此過程稱為回退(參見第2.2節),準確度會有所下降。實驗證實,較大的模型確實具有更好的效能。但是,大型模型在單個計算機的記憶體儲存不容易,且快速訪問的效率非常低。
參考文獻[1] C. Allauzen, M. Riley, and B. Roark. Distributed representation and estimation of wfst-based n-gram models. In Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata (StatFSM), pages 32-41, 2016.[2] D. Amodei, R. Anubhai, E. Battenberg, C. Case,J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski,A. Coates, G. Diamos,E. Elsen, J. Engel, L. Fan,C. Fougner, T. Han, A. Y. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Y. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun,S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao,D. Yogatama, J.Zhan, and Z. Zhu. Deep speech 2:End-to-end speech recognition in english and mandarin. CoRR, abs/1512.02595, 2015.[3] P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541-556, 2017.[4] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137-1155, Mar. 2003.[5] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In EMNLP-CoNLL, pages 858-867, Prague, Czech Republic, June 2007.[6] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359-394, 1999.[7] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica. Clipper: A low-latency online prediction serving system. In NSDI, pages 613-627, Boston, MA, 2017. USENIX Association.[8] A. Emami, K. Papineni, and J. Sorensen. Large-scale distributed language modeling. In ICASSP, volume 4, pages IV-37-V-40, April 2007.[9] A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML - Volume 32, ICML'14, pages II-1764-II-1772. JMLR.org, 2014.[10] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.[11] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82-97, Nov 2012.[12] S. Ji, S. V. N. Vishwanathan, N. Satish, M. J. Anderson, and P. Dubey. Blackout: Speeding up recurrent neural network language models with very large vocabularies. CoRR, abs/1511.06909, 2015.[13] J. Jiang, F. Fu, T. Yang, and B. Cui. Sketchml: Accelerating distributed machine learning with data sketches. In SIGMOD, SIGMOD '18, pages 1269-1284, New York, NY, USA, 2018. ACM.[14] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. CoRR, abs/1602.02410, 2016.[15] D. Jurafsky and J. H. Martin. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009.[16] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. Noscope: Optimizing neural network queries over video at scale. PVLDB, 10(11):1586-1597, 2017.[17] R. Kneser and H. Ney. Improved backing-o for m-gram language modeling. In ICASSP, volume 1,pages 181-184 vol.1, May 1995.[18] Y. Lu, A. Chowdhery, S. Kandula, and S. Chaudhuri. Accelerating machine learning inference with probabilistic predicates. In SIGMOD, SIGMOD '18, pages 1493-1508, New York, NY, USA, 2018. ACM.[19] A. L. Maas, A. Y. Hannun, D. Jurafsky, and A. Y. Ng. First-pass large vocabulary continuous speech recognition using bi directional recurrent dnns. CoRR, abs/1408.2873, 2014.[20] C. Mandery. Distributed n-gram language models : Application of large models to automatic speech recognition. 2011.[21] T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur. Extensions of recurrent neural network language model. pages 5528 - 5531, 06 2011.[22] M. Mohri, F. Pereira, and M. Riley. Weighted nite-state transducers in speech recognition. Comput. Speech Lang., 16(1):69-88, Jan. 2002.[23] Y. Shen, G. Chen, H. V. Jagadish, W. Lu, B. C. Ooi, and B. M. Tudor. Fast failure recovery in distributed graph processing systems. PVLDB, 8(4):437-448, 2014.[24] D. Shi. A study on neural network language modeling. CoRR, abs/1708.07252, 2017.[25] A. Stolcke. Entropy-based pruning of backo language models. CoRR, cs.CL/0006025, 2000.[26] W. Williams, N. Prasad, D. Mrva, T. Ash, and T. Robinson. Scaling recurrent neural network language models. In ICASSP, pages 5391-5395, April 2015.[27] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK book. 01 2002.[28] Y. Zhang, A. S. Hildebrand, and S. Vogel. Distributed language modeling for n-best list re-ranking. In EMNLP, EMNLP '06, pages 216-223, Stroudsburg, PA, USA, 2006.
朋友會在“發現-看一看”看到你“在看”的內容