EasyNLP玩轉文本摘要(新聞標題)生成

語言: CN / TW / HK

作者:王明、黃俊

導讀

文本生成是自然語言處理領域的一個重要研究方向,具有豐富的實際應用場景以及研究價值。其中,生成式文本摘要作為文本生成的一個重要子任務,在實際應用場景中,包括新聞標題生成、摘要生成、關鍵詞生成等任務形式。預訓練語言模型,如BERT、MASS、uniLM等雖然在NLU場景中取得了令人矚目的性能,但模型採用的單詞、子詞遮蓋語言模型並不適用於文本生成場景中,特別是生成式文本摘要場景。其原因是,生成式文本摘要任務往往要求模型具有更粗粒度的語義理解,如句子、段落語義理解,以此進行摘要生成。為了解決上述問題,PEGASUS模型(PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization)針對文本摘要任務設計了無監督預訓練任務(Gap Sentence Generation,簡稱GSG),即隨機遮蓋文檔中的幾個完整句子,讓模型生成被遮蓋的句子。該預訓練任務能夠很好地與實際地文本摘要任務匹配,從而使得預訓練後的模型經過簡單的微調後達到較好的摘要生成效果。因此,我們在EasyNLP框架中集成了PEGASUS算法和模型,使用户能夠方便地使用該模型進行文本摘要生成相關任務的訓練和預測。

EasyNLP(​ ​https://github.com/alibaba/EasyNLP​ ​)是阿⾥雲機器學習PAI 團隊基於 PyTorch 開發的易⽤且豐富的中⽂NLP算法框架,⽀持常⽤的中⽂預訓練模型和⼤模型落地技術,並且提供了從訓練到部署的⼀站式 NLP 開發體驗。EasyNLP 提供了簡潔的接⼝供⽤户開發 NLP 模型,包括NLP應⽤ AppZoo 和預訓練 ModelZoo,同時提供技術幫助⽤户⾼效的落地超⼤預訓練模型到業務。文本生成作為自然語言處理的一大子任務,具有眾多的實際應用,包括標題生成、文本摘要、機器翻譯、問答系統、對話系統等等。因此,EasyNLP也在逐步增加對文本生成子任務的支持,希望能夠服務更多的NLP以及NLG算法開發者和研究者,也希望和社區一起推動NLG技術的發展和落地。

本⽂將提供關於PEGASUS的技術解讀,以及如何在EasyNLP框架中使⽤與PEGASUS相關的文本摘要(新聞標題)生成模型。

Pegasus模型詳解

在此之前,文本生成預訓練模型T5、BART等模型雖然在眾多文本生成任務中取得了明顯的性能增益,但是在文本摘要任務中,模型的預訓練目標與文本摘要目標還是存在較大的差異。這導致此類預訓練模型在遷移至不用領域的摘要任務時,仍然需要較多的訓練數據對模型進行微調才能達到較好的效果。為了緩解上述問題,PEGASUS模型在原始的子詞遮蓋損失的基礎上,增加了完整句子遮蓋損失,即將輸入文檔中的隨機幾個完整句子進行遮蓋,讓模型復原。

具體地,如上圖所示,PEGASUS採用編碼器-解碼器架構(標準transformer架構)。模型對輸入採用兩種遮蓋,一種是BERT採用的子詞遮蓋,用【mask2】表示,讓模型的編碼器還原被遮蓋的子詞(該類損失在消融實驗中被證明對下游任務無性能增益,因此在最終的PEGASUS模型中並未採用)。另一種是GSG,用【mask1】表示,即讓解碼器生成輸入中被遮蓋的隨機完整句子。針對此損失,作者同時提出三種可選方案,包括Random(隨機選擇m個句子)、Lead(選擇前m個句子)、Ind-Orig(根據重要性分數選擇m個句子)。其中,重要性分數具體通過計算每句話與文檔中其它句子集合的ROUGE分數得到。可以認為,該策略選擇能夠很大程度代表文檔中其它句子的句子作為遮蓋對象。下圖展示了三種選句子方案的一個例子,所選句子分別被標記為綠色、紅棕色、藍色。實驗表明,採用第三種句子選擇策略的模型能夠取得最優性能。

文本摘要模型使用教程

以下我們簡要介紹如何在EasyNLP框架中使用PEGASUS以及其他文本摘要模型。

安裝EasyNLP

用户可以直接參考GitHub(​ ​https://github.com/alibaba/EasyNLP​ ​)上的説明安裝EasyNLP算法框架。

數據準備

在具體的文本摘要場景中,需要用户提供下游任務的訓練與驗證數據,為tsv文件。對於文本摘要任務,這一文件包含以製表符\t分隔的兩列數據,第一列是摘要列,第二列為原文列。樣例如下:

湖北:“四上企業”復工率已達93.8% 央視網消息:4月1日,記者從湖北省新冠肺炎疫情防控工作新聞發佈會上獲悉,在各方面共同努力下,湖北省復工復產工作取得了階段性成效。截至3月31日,湖北省“四上企業”包括規模以上工業、規模以上服務業法人單位等的復工率已達93.8%,復崗率69.3%。武漢市的復工率、復崗率也分別達到了85.4%、40.4%。責任編輯:王詩堯

下列文件為已經完成預處理的新聞標題生成訓練和驗證數據,可用於測試:

https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/title_gen.zip

中文新聞標題生成

由於PEGASUS原文產出的模型僅支持英文,為了方便中文社區用户的使用,我們基於mT5的模型架構預訓練了一個針對中文新聞標題摘要的模型mT5,並將其集成進EasyNLP的模型庫中。同時,我們還集成了IDEA機構預訓練的文本摘要中文模型Randeng(可以認為是中文版的PEGASUS),便於用户探索不同模型的性能。以下彙總了EasyNLP中可用的模型,並對比模型在上述數據集上的性能表現。推薦用户選擇前兩個模型進行文本摘要,後三個模型進行新聞標題生成。

中文

新聞標題(Rouge1/2/L)

論文標題摘要(Rouge1/2/L)

hfl/randeng-238M-Summary-Chinese

59.66/46.26/55.95

54.55/39.37/50.69

hfl/randeng-523M-Summary-Chinese

62.86/49.67/58.89

53.83/39.17/49.92

alibaba-pai/mt5-title-generation-zh-275m

62.35/48.63/58.96

54.28/40.26/50.55

alibaba-pai/randeng-238M-Summary-Chinese-tuned

64.31/51.80/60.97

58.83/45.28/55.72

alibaba-pai/randeng-523M-Summary-Chinese-tuned

64.76/51.65/61.06

59.27/45.58/55.92

在新聞標題生成任務中,我們採用以下命令對模型進行訓練。用户可以根據超參數‘save_checkpoint_steps’來決定保存模型的步數,框架在此時會對訓練的模型進行評測,會根據模型在驗證集上的表現決定是否更新保存的模型參數。其中,運行的main.py文件在 EasyNLP/examples/appzoo_tutorials/sequence_generation 目錄下,同時需要將訓練和驗證集數據放到該目錄下。可以在‘user_defined_parameters’超參數下的‘pretrain_model_name_or_path’指定上述表格中的模型。

python main.py \
    --mode train \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./cn_train.tsv,./cn_dev.tsv  \
    --input_schema=title_tokens:str:1,content_tokens:str:1 \
    --first_sequence=content_tokens \
    --second_sequence=title_tokens \
    --label_name=title_tokens \
    --checkpoint_dir=./finetuned_zh_model \
    --micro_batch_size=8 \
    --sequence_length=512 \
    --epoch_num=1  \
    --save_checkpoint_steps=150 \
    --export_tf_checkpoint_type none \
    --user_defined_parameters 'pretrain_model_name_or_path=alibaba-pai/mt5-title-generation-zh language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

另外,用户可以利用以下命令使用模型進行摘要生成,模型的路徑由‘checkpoint_dir’指定。用户可以通過‘append_cols’指定在輸出文件中添加輸入列,如果不指定則填none。

python main.py \
    --mode=predict \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./cn_dev.tsv  \
    --outputs=./cn.preds.txt \
    --input_schema=title:str:1,content:str:1,title_tokens:str:1,content_tokens:str:1,tag:str:1 \
    --output_schema=predictions,beams \
    --append_cols=content,title,tag \
    --first_sequence=content_tokens \
    --checkpoint_dir=./finetuned_zh_model \
    --micro_batch_size=32 \
    --sequence_length=512 \
    --user_defined_parameters 'language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

以下為模型對近期熱點事件預測的幾條樣例,每條樣例包含5列數據(以製表符\t隔開),分別為預測的摘要列(新聞標題)、beam search的5條候選(用||隔開)、輸入的原文、輸入的新聞標籤。其中後三列是從對應的輸入數據中直接拷貝過來。由於新聞文本過長,以下僅展示每條樣例的前四列結果。

**費德勒告別信:未來我還會打更多的網球**  費德勒告別信:未來我還會打更多的網球||費德勒告別信:未來我還會打更多網球||費德勒告別信:未來我還會打更多網球但不是在大滿貫或巡迴賽||費德勒告別信:未來我還會打更多的網球||詳訊:費德勒宣佈退役,並告別信  **一代傳奇落幕!網球天王費德勒宣佈退役**  央視網消息:北京時間9月15日晚,網球天王羅傑-費德勒在個人社媒上宣佈退役。41歲的費德勒是男子網壇歷史最偉大球員之一,曾103次斬獲單打冠軍,大滿貫單打奪冠20次(澳網6冠、法網1冠、温網8冠、美網5冠),共計310周位於男單世界第一。附費德勒告別信:在這些年網球給我的所有禮物中,最棒的毫無疑問是我一路上所遇到的人:我的朋友、我的競爭對手、以及最重要的球迷,是他們給予了這項運動生命。今天,我想和大家分享一些消息。正如你們中的許多人所知道的,過去三年中,我遇到了受傷和手術的挑戰。......
**颱風“梅花”將在大連沿海登陸將逐步變性為温帶氣旋**  颱風“梅花”將在大連沿海登陸將逐步變性為温帶氣旋||颱風“梅花”將在大連沿海登陸後逐漸變性為温帶氣旋||颱風“梅花”將在大連沿海登陸將逐漸變性為温帶氣旋||颱風“梅花”將在大連沿海登陸後變性為温帶氣旋||颱風“梅花”將在大連沿海登陸後逐漸變性 **颱風“梅花”將於16日傍晚前後在遼寧大連沿海登陸**  記者9月16日從遼寧省大連市氣象部門獲悉,今年第12號颱風“梅花”將於16日傍晚前後在大連市旅順口區至莊河市一帶沿海登陸,之後逐漸變性為温帶氣旋。  受颱風“梅花”影響,14日8時至16日10時,大連全市平均降雨量為132毫米,最大降雨量出現在金普新區大李家街道正明寺村,為283.6毫米;一小時最大降雨量出現在長海縣廣鹿島鎮,為49.4毫米......

英文文本摘要

EasyNLP模型庫中同樣集成了英文文本摘要模型,包括PEGASUS和BRIO。以下表格展示了兩個模型在英文文本摘要數據上的性能表現。用户同樣可以使用上述代碼對模型進行訓練和預測。需要注意的是,EasyNLP默認的是對中文的處理,因此,當需要處理英文文本時,需要在‘user_defined_parameters’中指定language為en,如不提供,則默認為中文(zh)。

英文

文本摘要(Rouge1/2/L)

alibaba-pai/pegasus-summary-generation-en

37.79/18.69/35.44

hfl/brio-cnndm-uncased

41.46/23.34/38.91

訓練過程如下:

wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_train.tsv
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_dev.tsv

python main.py \
    --mode train \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./en_train.tsv,./en_dev.tsv  \
    --input_schema=title:str:1,content:str:1 \
    --first_sequence=content \
    --second_sequence=title \
    --label_name=title \
    --checkpoint_dir=./finetuned_en_model \
    --micro_batch_size=1 \
    --sequence_length=512 \
    --epoch_num 1 \
    --save_checkpoint_steps=500 \
    --export_tf_checkpoint_type none \
    --user_defined_parameters 'language=en pretrain_model_name_or_path=alibaba-pai/pegasus-summary-generation-en copy=false max_encoder_length=512 min_decoder_length=64 max_decoder_length=128 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

預測過程如下:

python main.py \
    --mode=predict \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./en_dev.tsv  \
    --outputs=./en.preds.txt \
    --input_schema=title:str:1,content:str:1 \
    --output_schema=predictions,beams \
    --append_cols=title,content \
    --first_sequence=content \
    --checkpoint_dir=./finetuned_en_model \
    --micro_batch_size 32 \
    --sequence_length 512 \
    --user_defined_parameters 'language=en copy=false max_encoder_length=512 min_decoder_length=64 max_decoder_length=128 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

以下展示了模型對一篇熱點科技新聞稿的摘要預測結果:

With the image generator Stable Diffusion, you can conjure within seconds a potrait of Beyoncé as if painted by Vincent van Gogh, a cyberpunk cityscape in the style of 18th century Japanese artist Hokusai and a complex alien world straight out of science fiction. Released to the public just two weeks ago, it’s become one of several popular AI-powered text-to-image generators, including DALL-E 2, that have taken the internet by storm. Now, the company behind Stable Diffusion is in discussions to raise $100 million from investors, according to three people with knowledge of the matter. Investment firm Coatue expressed initial interest in a deal that would value the London-based startup Stability AI at $500 million, according to two of the people. Lightspeed Venture Partners then entered talks — which are still underway — to invest at a valuation up to $1 billion, two sources said. Stability AI, Coatue and Lightspeed declined requests for comment. The London-based startup previously raised at least $10 million in SAFE notes (a form of convertible security popular among early-stage startups) at a valuation of up to $100 million, according to one of the sources. An additional fourth source with direct knowledge confirmed Stability AI’s previous round. Much of the company’s funds came directly from founder and CEO Emad Mostaque, a former hedge fund manager. News of the prior financing was previously unreported. By nature of being open source, Stability AI’s underlying technology is free to use. So far, the company does not have a clear business model in place, according to three of the sources. However, Mostaque said in an interview last month with Yannic Kilcher, a machine learning engineer and YouTube personality, that he has already penned partnerships with “governments and leading institutions” to sell the technology. “We’ve negotiated massive deals so we’d be profitable at the door versus most money-losing big corporations,” he claims. The first version of Stable Diffusion itself cost just $600,000 to train, he wrote on Twitter — a fraction of the company’s total funding. Mostaque, 39, hails from Bangladesh and grew up in England. He received a master’s degree in mathematics and computer science from Oxford University in 2005 and spent 13 years working at U.K. hedge funds. In 2019, he launched Symmitree, a startup that aimed to reduce the cost of technology for people in poverty; it shuttered after one year, according to his LinkedIn profile. He then founded Stability AI in late 2020 with the mission of building open-source AI projects. According to its website, text-to-image generation is only one component of a broader apparatus of AI-powered offerings that the company is helping to build. Other open-source research groups it backs are developing tools for language, audio and biology. Stable Diffusion — created in collaboration with RunwayML, a video editing startup also backed by Coatue, and researchers at the Ludwig Maximilian University of Munich — has generated by far the most buzz among the company’s projects. It comes as AI image generators entered the zeitgeist this year, with the release of OpenAI’s DALL-E 2 in April and independent research lab Midjourney’s eponymous product in July. Google also revealed a text-to-image system, Imagen, in May, though it is not available to the public. Mostaque and his peers have said that the existing technology only represents the tip of the iceberg of what AI art is capable of creating: Future use cases could include drastically improved photorealism, video and animation. These image generators are already facing controversy: Many of them have been trained by processing billions of images on the internet without the consent of the copyright holder, prompting debate over ethics and legality. Last week, a testy debate broke out online after a Colorado fine arts competition awarded a top prize to an AI-generated work of art. Moreover, unlike DALL-E and Midjourney, which have restrictions in place to prevent the generation of gory or pornographic images, Stable Diffusion’s open source nature allows users to bypass such a block. On 4chan, numerous threads have appeared with AI-generated deepfakes of celebrity nudes, while Reddit has banned at least four communities that were dedicated to posting “not safe for work” AI imagery made using Stable Diffusion. It’s a double-edged sword for Stability AI, which has accumulated community goodwill precisely due to its open source approach that gives its users full access to its code. The company’s website states that the company is “building open AI tools,” a mission that mirrors the initial intent of OpenAI to democratize access to artificial intelligence. OpenAI was launched as a nonprofit research organization by prominent technologists including Sam Altman and Elon Musk, but upon accepting a $1 billion investment from Microsoft in 2019, it became a for-profit business. The move led it to focus on commercializing its technology rather than making it more widely available, drawing criticism from the AI community — and Musk himself.  Stability AI has been a for-profit corporation from its inception, which Mostaque has said is meant to allow the open source research to reach more people. In an interviewwith TechCrunch last month, he said that the company was fully independent. “Nobody has any voting rights except our 75 employees — no billionaires, big funds, governments or anyone else with control of the company or the communities we support,” he said. At a $1 billion valuation, Mostaque would be ceding up to 10% of the company to the new financiers. Venture capital investors who take significant stakes in startups typically ask for board positions so they can influence the decisions the company is making using their money. Lightspeed, which manages $10 billion of assets, and Coatue, which is in charge of $73 billion, both have a track record of taking board seats, though it’s unclear if that will be the case with Stability AI. Follow me on Twitter. Send me a secure tip. 

上述文本來自於​ ​https://www.forbes.com/sites/kenrickcai/2022/09/07/stability-ai-funding-round-1-billion-valuation-stable-diffusion-text-to-image/?sh=33ecbe8724d6​

針對上述新聞原稿,以下為兩個最新模型的摘要生成結果:

stable Diffusion is in discussions to raise $100 million from investors, three people say. The image generator is one of several popular AI-powered text-to-image generators.
company behind the popular image generator Stable Diffusion is in talks to raise $100 million from investors, according to sources

以上是如何利用EasyNLP進行文本摘要模型訓練和預測的全部過程,更詳細的使用教程可加入以下課程進行學習。​ ​標題黨速成班:基於機器學習PAI EasyNLP的中文新聞標題生成​

未來展望

在未來,我們計劃在EasyNLP框架中集成面向知識的中⽂預訓練模型,覆蓋各個常⻅的NLU和NLG中⽂領域,敬請期待。我們也將在EasyNLP框架中集成更多SOTA模型(特別是中⽂模型),來⽀持各種NLP和多模態任務。此外, 阿⾥雲機器學習PAI團隊也在持續推進中文文本生成和中⽂多模態模型的⾃研⼯作,歡迎⽤户持續關注我們,也歡迎加⼊ 我們的開源社區,共建中⽂NLP和多模態算法庫!

Github地址:​ ​https://github.com/alibaba/EasyNLP​

參考文獻

  1. Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin. EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing. arXiv
  2. Zhang, Jingqing, et al. "Pegasus: Pre-training with extracted gap-sentences for abstractive summarization." International Conference on Machine Learning. PMLR, 2020.
  3. Xue, Linting, et al. "mT5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934(2020).
  4. Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
  5. Song, Kaitao, et al. "Mass: Masked sequence to sequence pre-training for language generation." arXiv preprint arXiv:1905.02450 (2019).
  6. Dong, Li, et al. "Unified language model pre-training for natural language understanding and generation." Advances in Neural Information Processing Systems 32 (2019).
  7. Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing Order to Abstractive Summarization . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2890–2903, Dublin, Ireland. Association for Computational Linguistics.

阿里靈傑回顧