【NLP】漏洞類情報資訊抽取-- 模型分析與訓練資料生成

語言: CN / TW / HK

theme: smartblue

image.png


持續創作,加速成長!這是我參與「掘金日新計劃 · 10 月更文挑戰」的第3天,點選檢視活動詳情

前言

在前兩天的文章中,記錄了漏洞類情報的採集處理流程,經過處理後,將我們採集到的原始資料轉換為帶有BIOES標籤的序列標註資料,今天的文章中,主要介紹模型所需如token詞典、label詞典構建,以及模型結構介紹以及使用tensorflow1.4搭建模型。

image.png

命名實體識別

命名實體識(NER)別作為自然語言處理中序列標註的任務形式之一,近些年也受到廣泛關注,是指識別文字中具有特定意義的實體,主要包括人名、地名、機構名、時間等。是資訊提取、問答系統、句法分析、詞性標註、知識圖譜等任務的重要步驟或組成部分之一。同時任務應用領域也逐漸廣泛,例如科大訊飛AI演算法大賽中的農業類別的實體抽取任務,以及阿里天池賽道中的藥物說明書識別。

通常命名實體識別的方法分為有監督、無監督和基於規則,有監督通常表現為序列標註任務,即對一句話中的每一個字元進行標籤預測 ,從而預測實體的起始和結束位置,這也是本文所使用的方法。同時,也有學者通過無監督的聚類方法,根據不同實體的特徵差異,將具有相同特徵的實體集合通過無監督方式歸為不同簇,預測時選取最近的簇判斷實體標籤。而基於規則的方法則是通過句法分析、語法分析指定相應的規則模板,從而匹配出合適的實體類別,當實體類別較少且語料較為單一的時候能夠達到很好的效果。

本文使用最基本的Bilstm+CRF模式訓練實體抽取模型,模型的輸入(Input)為一句漏洞情報,輸出(Output)為每一個token對應的標籤,再經過標籤詞典的轉義和處理,便能夠匹配出文字包含有哪些廠商、版本、CVE編號、以及產品名稱等。模型結構圖如下:

image.png

模型準備

在構建模型前,需要生成標籤詞典與token詞典,目的在於將文字轉換為embedding矩陣的索引,同時需要增加[unk]字元以及[pad]字元。同時標籤也需要轉為label_index,用於計算模型loss時,預測標籤與真實標籤額度偏差,程式碼如下:

``` import codecs import pickle import random from tqdm import tqdm

def write_pickle(fileName, obj): f = open(fileName, 'wb') pickle.dump(obj, f) f.close()

def load_pickle(fileName): f = open(fileName, 'rb') d = pickle.load(f) f.close() return d

def make_dict(): print("正在生成詞典") vocabulary = {} lines = codecs.open("NER_data.txt",'r','UTF-8').readlines() for line in tqdm(lines): line = (line.strip()) if line != "": word = line.split('\t')[0] if word not in vocabulary: vocabulary[word] = 1 else: vocabulary[word] += 1 print(len(vocabulary)) vocabulary_other = {} vocabulary_other["[PAD]"] = 0 vocabulary_other["[UNK]"] = 1 for k,v in vocabulary.items(): if v > 5: vocabulary_other[k] = len(vocabulary_other) print(len(vocabulary_other)) for k,v in vocabulary_other.items(): print(k,v) write_pickle("word_dic.pkl",vocabulary_other)

def make_label_dic(): vocabulary = {} lines = codecs.open("NER_data.txt",'r','UTF-8').readlines() for line in tqdm(lines): line = (line.strip()) if line != "" and len(line.split('\t')) > 1: label = line.split('\t')[1] if label not in vocabulary: vocabulary[label] = len(vocabulary) for k,v in vocabulary.items(): print(k,v) write_pickle("label_dic.pkl",vocabulary)

def make_dataset(): lines = codecs.open("NER_data.txt", 'r', 'UTF-8').readlines() total = [] temp = [] for line in lines: if len(line.strip().split("\t")) > 1: temp.append(line) if line.strip() == "": temp = [] total.append(temp)

print(len(total))
random.shuffle(total)
train = total[:int(len(total) * 0.9)]
test = total[int(len(total) * 0.9):]
print(len(train))
print(len(test))
writer = codecs.open("train.txt",'w',"UTF-8")
for item in train:
    for word in item:
        writer.write(word)
    writer.write("\n")
writer.close()

writer = codecs.open("test.txt",'w',"UTF-8")
for item in test:
    for word in item:
        writer.write(word)
    writer.write("\n")
writer.close()

if name == 'main': make_dict() make_label_dic() make_dataset() load_pickle("label_dic.pkl") ```

其中NER_data.txt為抓取資料處理後的結果,格式如下:

``` 安全漏洞 O umbraco B_company 是 O 丹麥 O umbraco B_company 公司 O 的 O 一套 O c O # O 編寫 O 的 O 開源 O 的 O 內容 O 管理系統 O ( O cms O ) O 。 O umbraco B_company cms O 8.5 B_version . I_version 3 E_version 版本 O 中 O 存在 O 安全漏洞 O 。 O 攻擊者 O 可 O 藉助 O install O package O 功能 O 利用 O 該 O 漏洞 O 上傳 O 檔案 O , O 執行 O 程式碼 O 。 O

```

上述處理指令碼中,pickle模組為python特有的序列化或反序列化模組,其生成的序列化檔案不能被其他語言讀寫,基本思想是將python物件直接存入二進位制檔案裡,無需將其轉化為字串等格式,當需要使用時,直接載入序列化檔案,便能夠得到python物件,該模組能夠高效儲存python的負載資料格式的資料,缺點是不能夠被其他語言讀取。

image.png

通過上述指令碼生成的檔案如下:

  • 其中包含訓練集
  • train.txt
  • 驗證集
  • test.txt
  • 詞典pickle形式
  • word_dic.pkl
  • 標籤詞典pickle形式
  • label_dic.pkl

模型構建

模型使用最基本的Bilsit+CRF結構,使用tensorflow1.4完成,程式碼如下:

``` import numpy as np import os, time, sys import tensorflow as tf from tensorflow.contrib.rnn import LSTMCell from tensorflow.contrib.crf import crf_log_likelihood from tensorflow.contrib.crf import viterbi_decode

from data_pro import pad_sequences, batch_yield

from data_helper import batch_yield ,pad_sequences from utils import get_logger from eval import conlleval

class BiLSTM_CRF(object): def init(self, args, embeddings, tag2label, vocab, paths, config): # 模型初始化 self.batch_size = args.batch_size self.epoch_num = args.epoch self.hidden_dim = args.hidden_dim self.embeddings = embeddings self.CRF = args.CRF self.update_embedding = args.update_embedding self.dropout_keep_prob = args.dropout self.optimizer = args.optimizer self.lr = args.lr self.clip_grad = args.clip self.tag2label = tag2label self.num_tags = len(tag2label) self.vocab = vocab self.shuffle = args.shuffle self.model_path = paths['model_path'] self.summary_path = paths['summary_path'] self.logger = get_logger(paths['log_path']) self.result_path = paths['result_path'] self.config = config

def build_graph(self):
    # 模型構建
    self.add_placeholders()  # 佔位符初始化
    self.lookup_layer_op()   # lookup_layer初始化  用於word_id -> embediing
    self.biLSTM_layer_op()   # biLSTM_layer初始化  用於sentence encoder
    self.softmax_pred_op()   # softmax_pred初始化  用於CRF
    self.loss_op()           # loss初始化
    self.trainstep_op()      # train函式初始化
    self.init_op()           # 模型初始化

def add_placeholders(self):
    self.word_ids = tf.placeholder(tf.int32, shape=[None, None], name="word_ids")
    self.labels = tf.placeholder(tf.int32, shape=[None, None], name="labels")
    self.sequence_lengths = tf.placeholder(tf.int32, shape=[None], name="sequence_lengths")

    self.dropout_pl = tf.placeholder(dtype=tf.float32, shape=[], name="dropout")
    self.lr_pl = tf.placeholder(dtype=tf.float32, shape=[], name="lr")

def lookup_layer_op(self):
    with tf.variable_scope("words"):
        _word_embeddings = tf.Variable(self.embeddings,
                                       dtype=tf.float32,
                                       trainable=self.update_embedding,
                                       name="_word_embeddings")
        word_embeddings = tf.nn.embedding_lookup(params=_word_embeddings,
                                                 ids=self.word_ids,
                                                 name="word_embeddings")
    self.word_embeddings =  tf.nn.dropout(word_embeddings, self.dropout_pl)

def biLSTM_layer_op(self):
    with tf.variable_scope("bi-lstm"):
        # 使用雙向LSTM作為網路單元
        cell_fw = LSTMCell(self.hidden_dim)
        cell_bw = LSTMCell(self.hidden_dim)
        # 前向輸出 後向輸出 ,末態
        (output_fw_seq, output_bw_seq), _ = tf.nn.bidirectional_dynamic_rnn(
            cell_fw=cell_fw,
            cell_bw=cell_bw,
            inputs=self.word_embeddings,
            sequence_length=self.sequence_lengths,
            dtype=tf.float32)
        # 前向後向相連
        output = tf.concat([output_fw_seq, output_bw_seq], axis=-1)
        # 增加dropout 防止過擬合
        output = tf.nn.dropout(output, self.dropout_pl)
    # 以下為全連線操作
    with tf.variable_scope("proj"):
        W = tf.get_variable(name="W",
                            shape=[2 * self.hidden_dim, self.num_tags],
                            initializer=tf.contrib.layers.xavier_initializer(),
                            dtype=tf.float32)

        b = tf.get_variable(name="b",
                            shape=[self.num_tags],
                            initializer=tf.zeros_initializer(),
                            dtype=tf.float32)

        s = tf.shape(output)
        output = tf.reshape(output, [-1, 2*self.hidden_dim])
        pred = tf.matmul(output, W) + b

        self.logits = tf.reshape(pred, [-1, s[1], self.num_tags])

def loss_op(self):
    # 使用CRF進行預測
    if self.CRF:
        log_likelihood, self.transition_params = crf_log_likelihood(inputs=self.logits,
                                                               tag_indices=self.labels,
                                                               sequence_lengths=self.sequence_lengths)
        self.loss = -tf.reduce_mean(log_likelihood)

    else:
        losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=self.logits,
                                                                labels=self.labels)
        mask = tf.sequence_mask(self.sequence_lengths)
        losses = tf.boolean_mask(losses, mask)
        self.loss = tf.reduce_mean(losses)

    tf.summary.scalar("loss", self.loss)

def softmax_pred_op(self):
    if not self.CRF:
        self.labels_softmax_ = tf.argmax(self.logits, axis=-1)
        self.labels_softmax_ = tf.cast(self.labels_softmax_, tf.int32)

def trainstep_op(self):
    # 訓練配置
    with tf.variable_scope("train_step"):
        self.global_step = tf.Variable(0, name="global_step", trainable=False)
        if self.optimizer == 'Adam':
            optim = tf.train.AdamOptimizer(learning_rate=self.lr_pl)
        elif self.optimizer == 'Adadelta':
            optim = tf.train.AdadeltaOptimizer(learning_rate=self.lr_pl)
        elif self.optimizer == 'Adagrad':
            optim = tf.train.AdagradOptimizer(learning_rate=self.lr_pl)
        elif self.optimizer == 'RMSProp':
            optim = tf.train.RMSPropOptimizer(learning_rate=self.lr_pl)
        elif self.optimizer == 'Momentum':
            optim = tf.train.MomentumOptimizer(learning_rate=self.lr_pl, momentum=0.9)
        elif self.optimizer == 'SGD':
            optim = tf.train.GradientDescentOptimizer(learning_rate=self.lr_pl)
        else:
            optim = tf.train.GradientDescentOptimizer(learning_rate=self.lr_pl)

        grads_and_vars = optim.compute_gradients(self.loss)
        grads_and_vars_clip = [[tf.clip_by_value(g, -self.clip_grad, self.clip_grad), v] for g, v in grads_and_vars]
        self.train_op = optim.apply_gradients(grads_and_vars_clip, global_step=self.global_step)

def init_op(self):
    self.init_op = tf.global_variables_initializer()

def add_summary(self, sess):
    """

    :param sess:
    :return:
    """
    self.merged = tf.summary.merge_all()
    self.file_writer = tf.summary.FileWriter(self.summary_path, sess.graph)

def train(self, train_data, dev_data, train_label, dev_label):
    saver = tf.train.Saver(tf.global_variables())
    with tf.Session(config=self.config) as sess:
        sess.run(self.init_op)
        self.add_summary(sess)
        for epoch in range(self.epoch_num):
            self.run_one_epoch(sess, [train_data, train_label], [dev_data, dev_label], epoch, saver)


def test(self, test):
    saver = tf.train.Saver()
    with tf.Session(config=self.config) as sess:
        self.logger.info('=========== testing ===========')
        saver.restore(sess, self.model_path)
        label_list, seq_len_list = self.dev_one_epoch(sess, test)
        self.evaluate(label_list, seq_len_list, test)

def demo_one(self, sess, sent):
    """

    :param sess:
    :param sent: 
    :return:
    """
    label_list = []
    for seqs, labels in batch_yield(sent, 1, self.vocab, self.tag2label, shuffle=False):
        label_list_, _ = self.predict_one_batch(sess, seqs)
        label_list.extend(label_list_)
    label2tag = {}
    for tag, label in self.tag2label.items():
        label2tag[label] = tag if label != 0 else label
    tag = [label2tag[label] for label in label_list[0]]
    return tag

def run_one_epoch(self, sess, train, dev, epoch, saver):
    train_length = np.array(train).shape[1]
    num_batches = (train_length + self.batch_size - 1) // self.batch_size
    print('num_batches :{}'.format(num_batches))
    start_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
    batches = batch_yield(train, self.batch_size, self.vocab, self.tag2label, shuffle=self.shuffle)
    for step, (seqs, labels) in enumerate(batches):
        # print(' processing: {} batch / {} batches.'.format(step + 1, num_batches) + '\r')
        step_num = epoch * num_batches + step + 1
        feed_dict, _ = self.get_feed_dict(seqs, labels, self.lr, self.dropout_keep_prob)
        _, loss_train, summary, step_num_ = sess.run([self.train_op, self.loss, self.merged, self.global_step],feed_dict=feed_dict)
        if step + 1 == 1 or (step + 1) % 300 == 1 or step + 1 == num_batches:
            print('{} epoch {}, step {}, loss: {:.4}, global_step: {}'.format(start_time, epoch + 1, step + 1,loss_train, step_num))
            self.file_writer.add_summary(summary, step_num)
            saver.save(sess, self.model_path, global_step=step_num)
            print('===========validation / test===========')
            label_list_dev, seq_len_list_dev = self.dev_one_epoch(sess, dev)
            self.evaluate(label_list_dev, seq_len_list_dev, dev, epoch)


def get_feed_dict(self, seqs, labels=None, lr=None, dropout=None):
    """

    :param seqs:
    :param labels:
    :param lr:
    :param dropout:
    :return: feed_dict
    """
    word_ids, seq_len_list = pad_sequences(seqs, pad_mark=0)

    feed_dict = {self.word_ids: word_ids,
                 self.sequence_lengths: seq_len_list}
    if labels is not None:
        labels_, _ = pad_sequences(labels, pad_mark=0)
        feed_dict[self.labels] = labels_
    if lr is not None:
        feed_dict[self.lr_pl] = lr
    if dropout is not None:
        feed_dict[self.dropout_pl] = dropout

    return feed_dict, seq_len_list

def dev_one_epoch(self, sess, dev):
    """

    :param sess:
    :param dev:
    :return:
    """
    label_list, seq_len_list = [], []
    for seqs, labels in batch_yield(dev, self.batch_size, self.vocab, self.tag2label, shuffle=False):
        label_list_, seq_len_list_ = self.predict_one_batch(sess, seqs)
        label_list.extend(label_list_)
        seq_len_list.extend(seq_len_list_)
    return label_list, seq_len_list

def predict_one_batch(self, sess, seqs):
    """

    :param sess:
    :param seqs:
    :return: label_list
             seq_len_list
    """
    feed_dict, seq_len_list = self.get_feed_dict(seqs, dropout=1.0)

    if self.CRF:
        logits, transition_params = sess.run([self.logits, self.transition_params],
                                             feed_dict=feed_dict)
        label_list = []
        for logit, seq_len in zip(logits, seq_len_list):
            viterbi_seq, _ = viterbi_decode(logit[:seq_len], transition_params)
            label_list.append(viterbi_seq)
        return label_list, seq_len_list

    else:
        label_list = sess.run(self.labels_softmax_, feed_dict=feed_dict)
        return label_list, seq_len_list

def evaluate(self, label_list, seq_len_list, data, epoch=None):
    label2tag = {}
    for tag, label in self.tag2label.items():
        label2tag[label] = tag
    label = data[1]
    total = 0
    true = 0
    for index ,item in enumerate(label_list):
        predict_result = [label2tag[label_] for label_ in item]
        ground_truth = label[index]
        assert len(predict_result) == len(ground_truth)
        total += len(predict_result)
        for index,item in enumerate(ground_truth):
            if ground_truth[index] == predict_result[index]:
                true += 1
    print('Evaluate accuracy is :{}'.format(true/total))

```

添加了基本的註釋,有興趣的同學可以看下原始碼 ,明天將會更新訓練測試程式碼以及模型程式碼的基本意義~ 蟹蟹~