AKT基于上下文的知识追踪代码及原理讲解

AKT基于上下文的知识追踪代码及原理讲解

码农世界 2024-05-28 前端 78 次浏览 0个评论

这篇文章主要是用来帮助初学者(也包括我自己)梳理框架,以及AKT代码是怎样实现的,我觉得我的想法不确定的地方,或者我也存在问题的地方我会进行标红,并且标上序号,大家在评论区一起讨论,取长补短,一起进步,希望大家不吝赐教!!!!

在AKT代码中主要需要看这五个文件

大家可以在评论区报一下自己跑出来的acc,auc,我怎么感觉我跑出来的数有点问题

auc90左右,acc81.多

目录

prepare_dataset.ipynb

外层AKT.py文件

load_data.py

回到外部的AKT.py

EduKTM包里面的AKT文件

loss函数指定

评价指标auc调用

评价指标acc调用

train_one_epoch

看代码时也是有顺序的,这五个文件应该先看prepare_dataset.ipynb这个文件,即先看他是怎么处理数据的,将数据处理成什么样子,这里即使不明白AKT的构建也没关系,只是在处理数据 。

prepare_dataset.ipynb

.ipynb文件是jupyter notebook的文件形式,和.py文件之间互相调用是没有任何问题的,但是在jupyter notebook中运行的是.ipynb文件,在pycharm中运行的是.py文件,在起始的运行文件必须符合软件所需要的文件形式,那么怎么解决,,最简单方法是新建一个.py文件,然后将.ipynb文件里面的内容粘贴到.py文件中(反过来也同理)。#1.里还有其他方法吗?

# -*- coding: utf-8 -*-
from EduData import get_data
get_data("assistment-2009-2010-skill", "../../data")

这里的EduData是发布过的一个包,没有的话需要 pip install一下

这个EduData的具体内容可以查看这个文档https://edudata.readthedocs.io/en/latest/tutorial/zh/DataSet.html

上面就是下载assistment-2009-2010-skill数据集,并且保存在相对路径为../../data的文件夹中

# -*- coding: utf-8 -*-
import random
import pandas as pd
import tqdm
data = pd.read_csv(
    '../../data/2009_skill_builder_data_corrected/skill_builder_data_corrected.csv',
    usecols=['order_id', 'user_id', 'skill_id', 'problem_id', 'correct']
).dropna(subset=['skill_id', 'problem_id'])

 读取文件中,只用了其中的五个属性,并且对于skill_id和problem_id为空的数据去掉

用pd.read_csv读取的数据类型为dataframe

展示一下

print(data.readhead(10))

'''
This is a example for pid.
If the dataset you use doesn't have the field of problem id, please remove problem id used in this example.
'''
raw_skill = data.skill_id.unique().tolist()
raw_problem = data.problem_id.unique().tolist()
num_skill = len(raw_skill)
n_problem = len(raw_problem)
# question id from 1 to #num_skill
skills = { p: i+1 for i, p in enumerate(raw_skill) }
problems = { p: i+1 for i, p in enumerate(raw_problem) }
print("number of skills: %d" % num_skill)
print("number of problems: %d" % n_problem)

输出

number of skills: 123
number of problems: 17751

这一步是将所有的skill和problem进行重新编号,但其实他们本身是有编号的,但是从上面的数据展示中可以看到有一个problem_id是51424,而总的problem数是17751,所以他的编号并不是传统意义上的从0或1开始编号,其实这样也不影响这种编号的使用,问题在于,在embedding或者one-hot的时候,是从0开始的,这样就会产生大量使用不到的参数或者one-hot的长度会大量增加。

所以对其进行了重新编号

def parse_all_seq(students):
    all_sequences = []
    for student_id in tqdm.tqdm(students, 'parse student sequence\t'):
        student_sequence = parse_student_seq(data[data.user_id == student_id])
        all_sequences.extend([student_sequence])
    return all_sequences
def parse_student_seq(student):
    seq = student.sort_values('order_id')
    s = [skills[q] for q in seq.skill_id.tolist()]
    p = [problems[q] for q in seq.problem_id.tolist()]
    a = seq.correct.tolist()
    return s, p, a
# [(skill_seq_0, problem_seq_0, answer_seq_0), ..., (skill_seq_n, problem_seq_n, answer_seq_n)]
sequences = parse_all_seq(data.user_id.unique())
print(len(data.user_id.unique()))

这段代码的的读的顺序应该是:倒数第二行sequences=、再到parse_all_seq函数

将不同学生的信息分开,并将每个学生的信息分成三条存储,分别是学生所做知识点skill序列、学生所做题目problem序列以及对应的回答正误序列

all_sequences.extend([student_sequence]) 最后将一个学生的三条数据已列表的形式储存在sequence中

def train_test_split(data, train_size=.7, shuffle=True):
    if shuffle:
        random.shuffle(data)
    boundary = round(len(data) * train_size)
    return data[: boundary], data[boundary:]
train_sequences, test_sequences = train_test_split(sequences)

 将构建好的sequence以7:3的比例划分成训练集和测试集

def sequences2l(sequences, trgpath):
    with open(trgpath, 'a', encoding='utf8') as f:
        for seq in tqdm.tqdm(sequences, 'write into file: '):
            skills, problems, answers = seq
            seq_len = len(skills)
            f.write(str(seq_len) + '\n')
            f.write(','.join([str(q) for q in problems]) + '\n')
            f.write(','.join([str(q) for q in skills]) + '\n')
            f.write(','.join([str(a) for a in answers]) + '\n')
# save triple line format for other tasks
sequences2l(train_sequences, '../../data/2009_skill_builder_data_corrected/train_pid.txt')
sequences2l(test_sequences, '../../data/2009_skill_builder_data_corrected/test_pid.txt')

将训练数据和测试数据写入文本进行存储

在原来的基础上在三条数据的基础上增加了第四条数据,该学生练习的长度(因为每个学生的做题不同,所以每组数据的长度也不同),每个学生的每组数据以列表形式存储,并且类表中的元素了类型为str,列表与列表之间用’,‘分割,学生与学生之间没有分割标记(应该会根据行数进行区分)

外层AKT.py文件

剩下的文件是包含与被包含的关系,或者说是调用与被调用的关系,所以会交叉阅读

# coding: utf-8
# 2021/8/5 @ zengxiaonan
from load_data import DATA, PID_DATA
import logging
from EduKTM import AKT
batch_size = 64
model_type = 'pid'
n_question = 123
n_pid = 17751
seqlen = 200
n_blocks = 1
d_model = 256
dropout = 0.05
kq_same = 1
l2 = 1e-5
maxgradnorm = -1

load_data是自建的一个模块,具体应用后面会讲

logging 是python库,用于输出日志的,但这个地方没有这个也完全没有影响。

比如说logging的debug功能,这篇文章里的示例内容

https://blog.csdn.net/Nana8874/article/details/126041032

import logging
# 配置logger并设置等级为DEBUG
logger = logging.getLogger('logging_debug')
logger.setLevel(logging.DEBUG)
# 配置控制台Handler并设置等级为DEBUG
consoleHandler = logging.StreamHandler()
consoleHandler.setLevel(logging.DEBUG)
# 将Handler加入logger
logger.addHandler(consoleHandler)
logger.debug('This is a logging.debug')

输出

This is a logging.debug

完全不如pycharm的debug功能好吧,这里指定了debug时出现的语句,不明白这样的debug有什么用,2、有谁可以教教我logging的真正用法

ok继续

# coding: utf-8
# 2021/8/5 @ zengxiaonan
from load_data import DATA, PID_DATA
import logging
from EduKTM import AKT
batch_size = 64
model_type = 'pid'
n_question = 123
n_pid = 17751
seqlen = 200
n_blocks = 1
d_model = 256
dropout = 0.05
kq_same = 1
l2 = 1e-5
maxgradnorm = -1

EduKtm是该代码中的自建包,里面有AKT模型的具体代码

if model_type == 'pid':
    dat = PID_DATA(n_question=n_question, seqlen=seqlen, separate_char=',')
else:
    dat = DATA(n_question=n_question, seqlen=seqlen, separate_char=',')
train_data = dat.load_data('../../data/2009_skill_builder_data_corrected/train_pid.txt')
test_data = dat.load_data('../../data/2009_skill_builder_data_corrected/test_pid.txt')

从上一段代码中可以看出,model_type被赋的值是pid

这里选用PID_DATA还是DATA已经是确定的了,PID_DATA对应着每个学生四条数据的情况,DATA对应着每个学生三条数据的情况。在prepare_dataset中(前面)已经将每个学生的数据存储为四条数据了。后面load_data.py文件中有具体的使用方法。

如果model_type(模型类型)为pid,那么调用load_data包里面的PID_DATA函数,那么现在去看load_data.py文件

load_data.py

class PID_DATA(object):
    def __init__(self, n_question, seqlen, separate_char):
        self.separate_char = separate_char
        self.seqlen = seqlen
        self.n_question = n_question
    # data format
    # length
    # pid1, pid2, ...
    # 1,1,1,1,7,7,9,10,10,10,10,11,11,45,54
    # 0,1,1,1,1,1,0,0,1,1,1,1,1,0,0
    def load_data(self, path):
        f_data = open(path, 'r')
        q_data = []
        qa_data = []
        p_data = []
        for lineID, line in enumerate(f_data):
            line = line.strip()
            if lineID % 4 == 2:
                Q = line.split(self.separate_char)
                if len(Q[len(Q) - 1]) == 0:
                    Q = Q[:-1]
                # print(len(Q))
            if lineID % 4 == 1:
                P = line.split(self.separate_char)
                if len(P[len(P) - 1]) == 0:
                    P = P[:-1]
            elif lineID % 4 == 3:
                A = line.split(self.separate_char)
                if len(A[len(A) - 1]) == 0:
                    A = A[:-1]
                # print(len(A),A)
                # start split the data
                n_split = 1
                # print('len(Q):',len(Q))
                if len(Q) > self.seqlen:
                    n_split = math.floor(len(Q) / self.seqlen)
                    if len(Q) % self.seqlen:
                        n_split = n_split + 1
                # print('n_split:',n_split)
                for k in range(n_split):
                    question_sequence = []
                    problem_sequence = []
                    answer_sequence = []
                    if k == n_split - 1:
                        endINdex = len(A)
                    else:
                        endINdex = (k + 1) * self.seqlen
                    for i in range(k * self.seqlen, endINdex):
                        if len(Q[i]) > 0:
                            Xindex = int(Q[i]) + int(A[i]) * self.n_question
                            question_sequence.append(int(Q[i]))
                            problem_sequence.append(int(P[i]))
                            answer_sequence.append(Xindex)
                        else:
                            print(Q[i])
                    q_data.append(question_sequence)
                    qa_data.append(answer_sequence)
                    p_data.append(problem_sequence)
        f_data.close()
        # data: [[],[],[],...] <-- set_max_seqlen is used
        # convert data into ndarrays for better speed during training
        q_dataArray = np.zeros((len(q_data), self.seqlen))
        for j in range(len(q_data)):
            dat = q_data[j]
            q_dataArray[j, :len(dat)] = dat
        qa_dataArray = np.zeros((len(qa_data), self.seqlen))
        for j in range(len(qa_data)):
            dat = qa_data[j]
            qa_dataArray[j, :len(dat)] = dat
        p_dataArray = np.zeros((len(p_data), self.seqlen))
        for j in range(len(p_data)):
            dat = p_data[j]
            p_dataArray[j, :len(dat)] = dat
        return q_dataArray, qa_dataArray, p_dataArray

line.strip()是指从开头或者结尾去掉指定的字符,但是在没有指定的情况,默认去掉'\n'换行符

分开解释

          if lineID % 4 == 2:
                Q = line.split(self.separate_char)
                if len(Q[len(Q) - 1]) == 0:
                    Q = Q[:-1]
                # print(len(Q))
            if lineID % 4 == 1:
                P = line.split(self.separate_char)
                if len(P[len(P) - 1]) == 0:
                    P = P[:-1]
            elif lineID % 4 == 3:
                A = line.split(self.separate_char)
                if len(A[len(A) - 1]) == 0:
                    A = A[:-1]
                # print(len(A),A)

第一个if 是获取skill知识点编号,第二个if是获取problems题目编号,最后一个elif是获得answer,学生回答的答案

  if len(P[len(P) - 1]) == 0:
      P = P[:-1]

每个条件语句中都嵌套一个if语句,这个语句的作用是  如果最后一个元素是'‘,里面什么也没有,那么就去掉,3.虽然我也不知道为什么会出现''这样的元素

              # start split the data
                n_split = 1
                # print('len(Q):',len(Q))
                if len(Q) > self.seqlen:
                    n_split = math.floor(len(Q) / self.seqlen)
                    if len(Q) % self.seqlen:
                        n_split = n_split + 1
                # print('n_split:',n_split)
                for k in range(n_split):
                    question_sequence = []
                    problem_sequence = []
                    answer_sequence = []
                    if k == n_split - 1:
                        endINdex = len(A)
                    else:
                        endINdex = (k + 1) * self.seqlen
                    for i in range(k * self.seqlen, endINdex):
                        if len(Q[i]) > 0:
                            Xindex = int(Q[i]) + int(A[i]) * self.n_question
                            question_sequence.append(int(Q[i]))
                            problem_sequence.append(int(P[i]))
                            answer_sequence.append(Xindex)
                        else:
                            print(Q[i])
                    q_data.append(question_sequence)
                    qa_data.append(answer_sequence)
                    p_data.append(problem_sequence)

对数据进行了切块,按照seqlen进行切块,这里的seqlen=200

对于answer也进行了处理,Xindex = int(Q[i]) + int(A[i]) * self.n_question,如果answer是0,则值为question的值,如果answer是1,则值为200+question

在这里并没有统一长度,虽然大多数长度为200

        q_dataArray = np.zeros((len(q_data), self.seqlen))
        for j in range(len(q_data)):
            dat = q_data[j]
            q_dataArray[j, :len(dat)] = dat
        qa_dataArray = np.zeros((len(qa_data), self.seqlen))
        for j in range(len(qa_data)):
            dat = qa_data[j]
            qa_dataArray[j, :len(dat)] = dat
        p_dataArray = np.zeros((len(p_data), self.seqlen))
        for j in range(len(p_data)):
            dat = p_data[j]
            p_dataArray[j, :len(dat)] = dat
        return q_dataArray, qa_dataArray, p_dataArray

因为有的数据长度不满seq_len(200),所以现在要将所有的数据统一成相同的长度(200),不足200的进行补0

 打印一下q_datade长度

训练数据为(3470,200)

测试数据为(1507,200)

回到外部的AKT.py

from EduKTM import AKT
akt = AKT(n_question, n_pid, n_blocks, d_model, dropout, kq_same, l2, batch_size, maxgradnorm)
akt.train(train_data, test_data, epoch=10)
akt.save("akt.params")
akt.load("akt.params")
auc, accuracy, _ = akt.eval(test_data,1)
print("auc: %.6f, accuracy: %.6f" % (auc, accuracy))

第二行三行的含义为:用来自于EduKTM包里面的AKT初始化一个实例akt,并调用了实例akt的train方法

那么现在就要去看EduKTM包里面的AKT文件了

EduKTM包里面的AKT文件

# coding: utf-8
# 2021/7/15 @ sone
import logging
import math
import torch
import numpy as np
from sklearn import metrics
import tqdm
from EduKTM import KTM
from .AKTNet import AKTNet
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def binary_entropy(target, pred):
    loss = target * np.log(np.maximum(1e-10, pred)) + (1.0 - target) * np.log(np.maximum(1e-10, 1.0 - pred))
    return np.average(loss) * -1.0
def compute_auc(all_target, all_pred):
    return metrics.roc_auc_score(all_target, all_pred)
def compute_accuracy(all_target, all_pred):
    all_pred[all_pred > 0.5] = 1.0
    all_pred[all_pred <= 0.5] = 0.0
    return metrics.accuracy_score(all_target, all_pred)
def train_one_epoch(net, params, optimizer, q_data, qa_data, pid_data ,id):
    net.train()
    pid_flag, batch_size, n_question, maxgradnorm = (
        params['is_pid'], params['batch_size'], params['n_question'], params['maxgradnorm'])
    n = int(math.ceil(len(q_data) / batch_size))
    q_data = q_data.T
    qa_data = qa_data.T
    # shuffle the data
    shuffled_ind = np.arange(q_data.shape[1])
    np.random.shuffle(shuffled_ind)
    q_data = q_data[:, shuffled_ind]
    qa_data = qa_data[:, shuffled_ind]
    if pid_flag:
        pid_data = pid_data.T
        pid_data = pid_data[:, shuffled_ind]
    pred_list = []
    target_list = []
    true_el = 0
    for idx in tqdm.tqdm(range(n),"Train Epoch %s" % id):
        optimizer.zero_grad()
        q_one_seq = q_data[:, idx * batch_size: (idx + 1) * batch_size]
        qa_one_seq = qa_data[:, idx * batch_size: (idx + 1) * batch_size]
        input_q = np.transpose(q_one_seq[:, :])
        input_qa = np.transpose(qa_one_seq[:, :])
        target = np.transpose(qa_one_seq[:, :])
        target = (target - 1) / n_question
        target_1 = np.floor(target)
        input_q = torch.from_numpy(input_q).long().to(device)
        input_qa = torch.from_numpy(input_qa).long().to(device)
        target = torch.from_numpy(target_1).float().to(device)
        if pid_flag:
            pid_one_seq = pid_data[:, idx * batch_size: (idx + 1) * batch_size]
            input_pid = np.transpose(pid_one_seq[:, :])
            input_pid = torch.from_numpy(input_pid).long().to(device)
            loss, pred, true_ct = net(input_q, input_qa, target, input_pid)
        else:
            loss, pred, true_ct = net(input_q, input_qa, target)
        pred = pred.detach().cpu().numpy()
        loss.backward()
        true_el += true_ct.cpu().numpy()
        if maxgradnorm > 0.:
            torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=maxgradnorm)
        optimizer.step()
        # correct: 1.0; wrong 0.0; padding -1.0
        target = target_1.reshape((-1,))
        nopadding_index = np.flatnonzero(target >= -0.9)
        nopadding_index = nopadding_index.tolist()
        pred_nopadding = pred[nopadding_index]
        target_nopadding = target[nopadding_index]
        pred_list.append(pred_nopadding)
        target_list.append(target_nopadding)
    all_pred = np.concatenate(pred_list, axis=0)
    all_target = np.concatenate(target_list, axis=0)
    loss = binary_entropy(all_target, all_pred)
    auc = compute_auc(all_target, all_pred)
    accuracy = compute_accuracy(all_target, all_pred)
    return loss, auc, accuracy
def test_one_epoch(net, params, q_data, qa_data, pid_data ,id):
    pid_flag, batch_size, n_question = params['is_pid'], params['batch_size'], params['n_question']
    net.eval()
    n = int(math.ceil(len(q_data) / batch_size))
    q_data = q_data.T
    qa_data = qa_data.T
    if pid_flag:
        pid_data = pid_data.T
    seq_num = q_data.shape[1]
    pred_list = []
    target_list = []
    count = 0
    true_el = 0
    for idx in tqdm.tqdm(range(n),'Test Epoch %s' % id):
        q_one_seq = q_data[:, idx * batch_size: (idx + 1) * batch_size]
        qa_one_seq = qa_data[:, idx * batch_size: (idx + 1) * batch_size]
        input_q = np.transpose(q_one_seq[:, :])
        input_qa = np.transpose(qa_one_seq[:, :])
        target = np.transpose(qa_one_seq[:, :])
        target = (target - 1) / n_question
        target_1 = np.floor(target)
        input_q = torch.from_numpy(input_q).long().to(device)
        input_qa = torch.from_numpy(input_qa).long().to(device)
        target = torch.from_numpy(target_1).float().to(device)
        if pid_flag:
            pid_one_seq = pid_data[:, idx * batch_size: (idx + 1) * batch_size]
            input_pid = np.transpose(pid_one_seq[:, :])
            input_pid = torch.from_numpy(input_pid).long().to(device)
        with torch.no_grad():
            if pid_flag:
                loss, pred, ct = net(input_q, input_qa, target, input_pid)
            else:
                loss, pred, ct = net(input_q, input_qa, target)
        pred = pred.cpu().numpy()
        true_el += ct.cpu().numpy()
        if (idx + 1) * batch_size > seq_num:
            real_batch_size = seq_num - idx * batch_size
            count += real_batch_size
        else:
            count += batch_size
        # correct: 1.0; wrong 0.0; padding -1.0
        target = target_1.reshape((-1,))
        nopadding_index = np.flatnonzero(target >= -0.9)
        nopadding_index = nopadding_index.tolist()
        pred_nopadding = pred[nopadding_index]
        target_nopadding = target[nopadding_index]
        pred_list.append(pred_nopadding)
        target_list.append(target_nopadding)
    assert count == seq_num, 'Seq not matching'
    all_pred = np.concatenate(pred_list, axis=0)
    all_target = np.concatenate(target_list, axis=0)
    loss = binary_entropy(all_target, all_pred)
    auc = compute_auc(all_target, all_pred)
    accuracy = compute_accuracy(all_target, all_pred)
    return loss, auc, accuracy
class AKT(KTM):
    def __init__(self, n_question, n_pid, n_blocks, d_model, dropout, kq_same, l2, batch_size, maxgradnorm,
                 separate_qa=False):
        super(AKT, self).__init__()
        self.params = {
            'is_pid': n_pid > 0,
            'batch_size': batch_size,
            'n_question': n_question,
            'maxgradnorm': maxgradnorm,
        }
        self.akt_net = AKTNet(n_question=n_question, n_pid=n_pid, n_blocks=n_blocks, d_model=d_model, dropout=dropout,
                              kq_same=kq_same, l2=l2, separate_qa=separate_qa).to(device)
    def train(self, train_data, test_data=None, *, epoch: int, lr=0.002) -> ...:
        optimizer = torch.optim.Adam(self.akt_net.parameters(), lr=lr, betas=(0.0, 0.999), eps=1e-8)
        maxAcc=0
        maxAuc=0
        minLoss=100
        for idx in range(epoch):
            train_loss, train_accuracy, train_acc = train_one_epoch(self.akt_net, self.params, optimizer, *train_data,idx)
            print("[Epoch %d] LogisticLoss: %.6f" % (idx, train_loss))
            if test_data is not None:
                valid_loss, valid_accuracy, valid_acc = self.eval(test_data,idx)
                print("[Epoch %d] loss:%.6f,auc: %.6f, accuracy: %.6f" % (idx, valid_loss,valid_acc, valid_accuracy))
                if valid_lossmaxAcc:
                    maxAcc=valid_acc
                if valid_accuracy >maxAuc:
                    maxAuc=valid_accuracy
        print('loss最小值:%f' %minLoss)
        print('auc最大值:%f' % maxAuc)
        print('acc最大值:%f' % maxAcc)
    def eval(self, test_data,id) -> ...:
        self.akt_net.eval()
        return test_one_epoch(self.akt_net, self.params, *test_data,id)
    def save(self, filepath) -> ...:
        torch.save(self.akt_net.state_dict(), filepath)
        logging.info("save parameters to %s" % filepath)
    def load(self, filepath) -> ...:
        self.akt_net.load_state_dict(torch.load(filepath))
        logging.info("load parameters from %s" % filepath)

里面主要有五个函数(def),一个类(class)

下面分别分析

loss函数指定

def binary_entropy(target, pred):
    loss = target * np.log(np.maximum(1e-10, pred)) + (1.0 - target) * np.log(np.maximum(1e-10, 1.0 - pred))
    return np.average(loss) * -1.0

二进制交叉熵,是一个专门用于计算二分类问题的损失函数,应该是有封装好的,可以直接调用,但在这里作者自己重写了一下

评价指标auc调用

def compute_auc(all_target, all_pred):
    return metrics.roc_auc_score(all_target, all_pred)
从sklearn的metrics包中直接调用

评价指标acc调用

def compute_accuracy(all_target, all_pred):
    all_pred[all_pred > 0.5] = 1.0
    all_pred[all_pred <= 0.5] = 0.0
    return metrics.accuracy_score(all_target, all_pred)

从sklearn的metrics包中直接调用

但是对于accuracy_score来说,两个参数只接受整数,本数据中的target都为整数0或1

但预测值是0-1之间的一个概率值,中间两行代码的意思是,规定如果预测值大于0.5那么就相当于预测这个学生可以做对这道题,如果小于等于0.5,那么相当于预测这个学生不会做对这道题,所以如果这里的设定不同,计算出来的acc也就不同。

train_one_epoch

def train_one_epoch(net, params, optimizer, q_data, qa_data, pid_data ,id):
    net.train()
    pid_flag, batch_size, n_question, maxgradnorm = (
        params['is_pid'], params['batch_size'], params['n_question'], params['maxgradnorm'])
    n = int(math.ceil(len(q_data) / batch_size))
    q_data = q_data.T
    qa_data = qa_data.T
    # shuffle the data
    shuffled_ind = np.arange(q_data.shape[1])
    np.random.shuffle(shuffled_ind)
    q_data = q_data[:, shuffled_ind]
    qa_data = qa_data[:, shuffled_ind]
    if pid_flag:
        pid_data = pid_data.T
        pid_data = pid_data[:, shuffled_ind]
    pred_list = []
    target_list = []
    true_el = 0
    for idx in tqdm.tqdm(range(n),"Train Epoch %s" % id):
        optimizer.zero_grad()
        q_one_seq = q_data[:, idx * batch_size: (idx + 1) * batch_size]
        qa_one_seq = qa_data[:, idx * batch_size: (idx + 1) * batch_size]
        input_q = np.transpose(q_one_seq[:, :])
        input_qa = np.transpose(qa_one_seq[:, :])
        target = np.transpose(qa_one_seq[:, :])
        target = (target - 1) / n_question
        target_1 = np.floor(target)
        input_q = torch.from_numpy(input_q).long().to(device)
        input_qa = torch.from_numpy(input_qa).long().to(device)
        target = torch.from_numpy(target_1).float().to(device)
        if pid_flag:
            pid_one_seq = pid_data[:, idx * batch_size: (idx + 1) * batch_size]
            input_pid = np.transpose(pid_one_seq[:, :])
            input_pid = torch.from_numpy(input_pid).long().to(device)
            loss, pred, true_ct = net(input_q, input_qa, target, input_pid)
        else:
            loss, pred, true_ct = net(input_q, input_qa, target)
        pred = pred.detach().cpu().numpy()
        loss.backward()
        true_el += true_ct.cpu().numpy()
        if maxgradnorm > 0.:
            torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=maxgradnorm)
        optimizer.step()
        # correct: 1.0; wrong 0.0; padding -1.0
        target = target_1.reshape((-1,))
        nopadding_index = np.flatnonzero(target >= -0.9)
        nopadding_index = nopadding_index.tolist()
        pred_nopadding = pred[nopadding_index]
        target_nopadding = target[nopadding_index]
        pred_list.append(pred_nopadding)
        target_list.append(target_nopadding)
    all_pred = np.concatenate(pred_list, axis=0)
    all_target = np.concatenate(target_list, axis=0)
    loss = binary_entropy(all_target, all_pred)
    auc = compute_auc(all_target, all_pred)
    accuracy = compute_accuracy(all_target, all_pred)
    return loss, auc, accuracy

以上是整段代码,下面将对该段代码的重点内容进行阅读

net.train()

说明整个train_one_epoch将参与模型的训练,模型会为了更加拟合数据,调整参数(parameters)

    pid_flag, batch_size, n_question, maxgradnorm = (
        params['is_pid'], params['batch_size'], params['n_question'], params['maxgradnorm'])
    n = int(math.ceil(len(q_data) / batch_size))
    q_data = q_data.T
    qa_data = qa_data.T
    # shuffle the data
    shuffled_ind = np.arange(q_data.shape[1])
    np.random.shuffle(shuffled_ind)
    q_data = q_data[:, shuffled_ind]
    qa_data = qa_data[:, shuffled_ind]
    if pid_flag:
        pid_data = pid_data.T
        pid_data = pid_data[:, shuffled_ind]
    pred_list = []
    target_list = []
    true_el = 0
    for idx in tqdm.tqdm(range(n),"Train Epoch %s" % id):
        optimizer.zero_grad()
        q_one_seq = q_data[:, idx * batch_size: (idx + 1) * batch_size]
        qa_one_seq = qa_data[:, idx * batch_size: (idx + 1) * batch_size]
        input_q = np.transpose(q_one_seq[:, :])
        input_qa = np.transpose(qa_one_seq[:, :])
        target = np.transpose(qa_one_seq[:, :])
        target = (target - 1) / n_question
        target_1 = np.floor(target)
        input_q = torch.from_numpy(input_q).long().to(device)
        input_qa = torch.from_numpy(input_qa).long().to(device)
        target = torch.from_numpy(target_1).float().to(device)

 我们发现在调用网络之前(net)之前还有一大段代码,所以这段代码是做什么的

q_data = q_data.T

 .T是指将数据进行转置(Transpose),原来q_data(3470, 200),现在q_data(200, 3470)

例如

输入:

import numpy as np
a=[1,2,3,4,5]
print('a')
print(a)
b=np.zeros((5,10))
print('b')
print(b)
for i in range(len(a)):
    b[i,:]=a[i]
print('b')
print(b)
c=b.T
print('c=b.T')
print(c)

输出:

a
[1, 2, 3, 4, 5]
b
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
b
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
 [3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]
 [4. 4. 4. 4. 4. 4. 4. 4. 4. 4.]
 [5. 5. 5. 5. 5. 5. 5. 5. 5. 5.]]
c=b.T
[[1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]
 [1. 2. 3. 4. 5.]]
n = int(math.ceil(len(q_data) / batch_size))

这个n就是要分成几个batch,但是为什么要分,都是要进模型的

这就牵扯到了梯度下降的几种方式:

当我们求损失loss用于梯度下降,更行权重时,有几种方式。一种是全部的样本用来求loss,这种方式称为批量梯度下降(BGD Batch Gradient Descent);一种是随机的选取一个样本,求loss,进而求梯度,这种方式称为随机梯度下降(SGD Stochastic Gradient Descent);BGD和SGB的这种,产生了第三种梯度下降的方法:小批量梯度下降(MBGD Mini Batch Gradient Descent)。
原文链接

以上就是为什么分batch的原因,但是怎么在代码中区分这三种梯度下降呢?

在代码中找到这三句代码中有loss,如图:

 最后定位在nn.BCEWithLogitsLoss那里

BCELoss和BCEWithLogitsLoss其实是一样的,只是BCEWithLogitsLoss比BCELoss在最后结果上面多加了一个sigmoid函数。BCE的全称是binary_cross_entropy,二进制交叉熵损失函数,是专门对于二分类问题的损失函数,所以我觉得BGD,SGD或者MBGD是一个大的分类,具体术语哪一个还需要看具体的操作,在后面到达计算loss的时候再进行说明

    shuffled_ind = np.arange(q_data.shape[1])
    np.random.shuffle(shuffled_ind)
    q_data = q_data[:, shuffled_ind]
    qa_data = qa_data[:, shuffled_ind]
    if pid_flag:
        pid_data = pid_data.T
        pid_data = pid_data[:, shuffled_ind]

这七行代码就是对数据进行shuffle的

但其实在dataloader的时候就可以一行代码解决

 在dataloader中指定是否shuffle就可以啦,而上面的七行代码就是他的原理

optimizer.zero_grad()

对过往的梯度清零,但为什么要梯度清零,梯度又是在哪里累计的?

1,由于pytorch的动态计算图,当我们使用loss.backward()和opimizer.step()进行梯度下降更新参数的时候,梯度并不会自动清零。并且这两个操作是独立操作。

2,backward():反向传播求解梯度。

3,step():更新权重参数。

基于以上几点,正好说明了pytorch的一个特点是每一步都是独立功能的操作,因此也就有需要梯度清零的说法,如若不显示的进 optimizer.zero_grad()这一步操作,backward()的时候就会累加梯度

上面这段话已经能很好的解释上面的问题了,这个解释来自于下面这个链接
https://blog.csdn.net/WhiffeYF/article/details/105053952

q_one_seq = q_data[:, idx * batch_size: (idx + 1) * batch_size]
qa_one_seq = qa_data[:, idx * batch_size: (idx + 1) * batch_size]

这两句话是这个epoch选用的数据,,,但是一般情况下模型每个epoch都会放入全部的数据进行训练才是,而这里的epoch只用了一个batch来训练。(如果超出范围怎么办 )

q_one_seq形状为(200,64)

qa_one_seq形状为(200,64)

input_q = np.transpose(q_one_seq[:, :])
input_qa = np.transpose(qa_one_seq[:, :])
target = np.transpose(qa_one_seq[:, :])
target = (target - 1) / n_question
target_1 = np.floor(target)

当时对于answer的处理是这样的,然后再来看target = (target - 1) / n_question这句话是什么意思

其实我觉得这些代码有些地方写的非常冗余,这里的target为什么不直接由函数接收过来呢,非要变过去再变过来。。。。。。

target = (target - 1) / n_question  这行代码先从大里面理解,这里肯定是要将数据处理成0或者1,0表示做错,1表示学生做对,,'/'表示整除,(target - 1) / n_question也确实是将数转换成0或者1,但是为什么-1呢,不-1应该也会算出同样的结果才对,这个是为了避免一种问题,当int(Q[I])就是最后一道题,也就是=self.question的时候,并且还做对了,即A[i]=1,则Xindex算出来的值为两倍的self.question,如果不-1,则算出来的值就是2,,,,,只是为了避免这种情况

target_1 = np.floor(target)

这行代码猜测是为了能让模型训练起来,因为模型在计算loss时不可以是整数类型()都要是float类型

input_q = torch.from_numpy(input_q).long().to(device)
input_qa = torch.from_numpy(input_qa).long().to(device)
target = torch.from_numpy(target_1).float().to(device)

将要用到的数据转换成torch类型  to(device)指定使用时使用CPU还是GPU

  ​​​​​​torch.from_numpy()方法把数组转换成张量,且二者共享内存,对张量进行修改比如重新赋值,那么原始数组也会相应发生改变。

数组类型就是指numpy数据类型,如果是list类型就不可以

        if pid_flag:
            pid_one_seq = pid_data[:, idx * batch_size: (idx + 1) * batch_size]
            input_pid = np.transpose(pid_one_seq[:, :])
            input_pid = torch.from_numpy(input_pid).long().to(device)
            loss, pred, true_ct = net(input_q, input_qa, target, input_pid)
        else:
            loss, pred, true_ct = net(input_q, input_qa, target)

终于到达使用net的地方了

if pid_flag:.........else 是什么含义

 找到了三个相关的地方,从这些代码中可以看出这里的信息分为知识点和问题,那么p指的就是问题,,,,,那么pid_flag指的就应该是是否使用problem问题这个信息

pid_one_seq = pid_data[:, idx * batch_size: (idx + 1) * batch_size]
input_pid = np.transpose(pid_one_seq[:, :])
input_pid = torch.from_numpy(input_pid).long().to(device)

如果使用,那么先将pid_data进行和上面一样的处理

然后输入到net里面

AKTNet.py文件

self.akt_net = AKTNet(n_question=n_question, n_pid=n_pid, n_blocks=n_blocks, d_model=d_model, dropout=dropout,
                              kq_same=kq_same, l2=l2, separate_qa=separate_qa).to(device)

AKTNet.py的结构如下 

如果不理解函数和类的区别,可以读这篇博客https://zhuanlan.zhihu.com/p/113472619

attention:

def attention(q, k, v, d_k, mask, dropout, zero_pad, gamma=None):
    """
    This is called by Multi-head atention object to find the values.
    """
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
    bs, head, seqlen = scores.size(0), scores.size(1), scores.size(2)
    x1 = torch.arange(seqlen).expand(seqlen, -1).to(device)
    x2 = x1.transpose(0, 1).contiguous()
    with torch.no_grad():
        scores_ = scores.masked_fill(mask == 0, -1e32)
        scores_ = F.softmax(scores_, dim=-1)
        scores_ = scores_ * mask.float().to(device)
        distcum_scores = torch.cumsum(scores_, dim=-1)
        disttotal_scores = torch.sum(scores_, dim=-1, keepdim=True)
        position_effect = torch.abs(x1 - x2)[None, None, :, :].type(torch.FloatTensor).to(device)
        dist_scores = torch.clamp((disttotal_scores - distcum_scores) * position_effect, min=0.)
        dist_scores = dist_scores.sqrt().detach()
    m = nn.Softplus()
    gamma = -1. * m(gamma).unsqueeze(0)
    # Now after do exp(gamma*distance) and then clamp to 1e-5 to 1e5
    total_effect = torch.clamp(torch.clamp((dist_scores * gamma).exp(), min=1e-5), max=1e5)
    scores = scores * total_effect
    scores.masked_fill(mask == 0, -1e23)
    scores = F.softmax(scores, dim=-1)
    if zero_pad:
        pad_zero = torch.zeros(bs, head, 1, seqlen).to(device)
        scores = torch.cat([pad_zero, scores[:, :, 1:, :]], dim=2)
    scores = dropout(scores)
    output = torch.matmul(scores, v)
    return output

矩阵相乘可以看这篇博客,写的很直观 torch.matmul()用法介绍_明日何其多_的博客-CSDN博客

注意:torch.expend只能在0维上扩充

torch中一些操作会改变原数据,比如:narrow() view() expand() transpose()等操作,在使用transpose()进行转置操作时,pytorch并不会创建新的、转置后的tensor,而是修改了tensor中的一些属性(也就是元数据),使得此时的offset和stride是与转置tensor相对应的。转置的tensor和原tensor的内存是共享的,即改变转置后的tensor, 原先tensor中内容也会改变,而contiguous方法就类似深拷贝,使得上面这些操作不会改变元数据

 

scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
output = torch.matmul(scores, v)

这两句其实已经是attention的全部了,那么剩下的代码是用来干什么的,剩下的代码应该是做掩码的,就是决定哪些计算,哪些不计算

       self.blocks_1 = nn.ModuleList([
            TransformerLayer(d_model, d_feature, d_ff, n_heads, dropout, kq_same)
            for _ in range(n_blocks)
            # for循环怎么还有往后写的?
        ])
        # print(' self.blocks_1')
        # print( self.blocks_1)
        self.blocks_2 = nn.ModuleList([
            TransformerLayer(d_model, d_feature, d_ff, n_heads, dropout, kq_same)
            for _ in range(n_blocks * 2)
        ])
self.blocks_1和self.blocks_2的区别是什么,为什么for循环这样写的含义是什么

打印了一下self.blocks_1和self.blocks_2,发现两个完全是一样的

 所以区别是什么,作用又是什么?

nn.ModuleList和nn.Sequential

nn.ModuleList和nn.Sequential的用法以及区别

triu及trul

numpy 的triu及trul 函数以及参数k的解释_znsoft的博客-CSDN博客

上层节点的输出和下层节点的输入之间具有一个函数关系,这个函数称为激活函数。

ReLU激活函数 - 知乎

后面的代码就不分开讲解了,部分想法直接放到注释里面了

# coding: utf-8
# 2021/7/15 @ sone
import math
import torch
from torch import nn
from torch.nn.init import xavier_uniform_, constant_
import torch.nn.functional as F
from enum import IntEnum
import numpy as np
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class Dim(IntEnum):
    batch = 0
    seq = 1
    feature = 2
class AKTNet(nn.Module):
    def __init__(self, n_question, n_pid, d_model, n_blocks, kq_same, dropout, final_fc_dim=512, n_heads=8,
                 d_ff=2048, l2=1e-5, separate_qa=False):
        # separate_qa什么意思,根据下面就是时使用0、1,还是Q[i]+A[i]*self.seqlen
        super(AKTNet, self).__init__()
        """
        Input:
            d_model: dimension of attention block
            final_fc_dim: dimension of final fully connected net before prediction
            n_heads: number of heads in multi-headed attention
            d_ff : dimension for fully connected net inside the basic block
        """
        self.n_question = n_question
        self.dropout = dropout
        self.kq_same = kq_same
        self.n_pid = n_pid
        self.l2 = l2
        self.separate_qa = separate_qa
        embed_l = d_model
        # d_model=256
        if self.n_pid > 0:
            self.difficult_param = nn.Embedding(self.n_pid + 1, 1)
            # difficult_param难度参数,每个问题都有一个难度参数
            self.q_embed_diff = nn.Embedding(self.n_question + 1, embed_l)
            self.qa_embed_diff = nn.Embedding(2 * self.n_question + 1, embed_l)
            # 上面两行的作用是什么
        # n_question+1 ,d_model
        self.q_embed = nn.Embedding(self.n_question + 1, embed_l)
        if self.separate_qa:
            self.qa_embed = nn.Embedding(2 * self.n_question + 1, embed_l)
        else:
            self.qa_embed = nn.Embedding(2, embed_l)
        # Architecture Object. It contains stack of attention block
        self.model = Architecture(n_blocks, d_model, d_model // n_heads, d_ff, n_heads, dropout, kq_same)
        self.out = nn.Sequential(
            nn.Linear(d_model + embed_l, final_fc_dim),
            nn.ReLU(),
            nn.Dropout(self.dropout),
            nn.Linear(final_fc_dim, 256),
            nn.ReLU(),
            nn.Dropout(self.dropout),
            nn.Linear(256, 1)
            # 因为知道是做的哪一道题目,所以,最后的维度是1,最后的线性层部分是由三层
        )
        self.reset()
    def reset(self):
        for p in self.parameters():
            if p.size(0) == self.n_pid + 1 and self.n_pid > 0:
                constant_(p, 0.)
                # 将里面的值全部填充成0,其实并不理解为什么resnet这样做
    def forward(self, q_data, qa_data, target, pid_data=None):
        # Batch First
        q_embed_data = self.q_embed(q_data)
        if self.separate_qa:
            qa_embed_data = self.qa_embed(qa_data)
        else:
            qa_data = (qa_data - q_data) // self.n_question
            qa_embed_data = self.qa_embed(qa_data) + q_embed_data
            # 在这里就已经将题目的信息加进去了
        if self.n_pid > 0:
            # 下面做这些的意图是什么
            q_embed_diff_data = self.q_embed_diff(q_data)
            # print('q_embed_diff_data ')
            # print(q_embed_diff_data.shape)
            # torch.Size([64, 200, 256])
            pid_embed_data = self.difficult_param(pid_data)
            # print('pid_embed_data')
            # print(pid_embed_data.shape)
            # torch.Size([64, 200, 1])
            q_embed_data = q_embed_data + pid_embed_data * q_embed_diff_data
            # print((pid_embed_data * q_embed_diff_data).shape)
            # torch.Size([64, 200, 256])
            qa_embed_diff_data = self.qa_embed_diff(qa_data)
            if self.separate_qa:
                qa_embed_data = qa_embed_data + pid_embed_data * qa_embed_diff_data
            else:
                qa_embed_data = qa_embed_data + pid_embed_data * (qa_embed_diff_data + q_embed_diff_data)
                # print('我是else里面的')
                # yes
            c_reg_loss = (pid_embed_data ** 2.).sum() * self.l2
        else:
            c_reg_loss = 0.
        # BS.seqlen,d_model
        # Pass to the decoder
        # output shape BS,seqlen,d_model or d_model//2
        d_output = self.model(q_embed_data, qa_embed_data)
        concat_q = torch.cat([d_output, q_embed_data], dim=-1)
        output = self.out(concat_q)
        labels = target.reshape(-1)
        m = nn.Sigmoid()
        preds = output.reshape(-1)
        mask = labels > -0.9
        masked_lables = labels[mask].float()
        masked_preds = preds[mask]
        loss = nn.BCEWithLogitsLoss(reduction='none')
        output = loss(masked_preds, masked_lables)
        return output.sum() + c_reg_loss, m(preds), mask.sum()
class Architecture(nn.Module):
    def __init__(self, n_blocks, d_model, d_feature, d_ff, n_heads, dropout, kq_same):
        super(Architecture, self).__init__()
        """
            n_block : number of stacked blocks in the attention
            d_model : dimension of attention input/output
            d_feature : dimension of input in each of the multi-head attention part.
            n_head : number of heads. n_heads*d_feature = d_model
        """
        self.d_model = d_model
        self.blocks_1 = nn.ModuleList([
            TransformerLayer(d_model, d_feature, d_ff, n_heads, dropout, kq_same)
            for _ in range(n_blocks)
            # for循环怎么还有往后写的? 这是列表表达式的写法
        ])
        # print(' self.blocks_1')
        # print( self.blocks_1)
        self.blocks_2 = nn.ModuleList([
            TransformerLayer(d_model, d_feature, d_ff, n_heads, dropout, kq_same)
            for _ in range(n_blocks * 2)
        ])
        # print(' self.blocks_2')
        # print( self.blocks_1)
    def forward(self, q_embed_data, qa_embed_data):
        x = q_embed_data
        y = qa_embed_data
        # print(x.shape)
        # torch.Size([64, 200, 256])
        # print(y.shape)
        # torch.Size([64, 200, 256])
        # encoder
        for block in self.blocks_1:  # encode qas
            y = block(mask=1, query=y, key=y, values=y)
            # print('我是block_1')
        flag_first = True
        for block in self.blocks_2:
            if flag_first:  # peek current question
                x = block(mask=1, query=x, key=x, values=x, apply_pos=False)
                flag_first = False
            else:  # dont peek current response
                x = block(mask=0, query=x, key=x, values=y, apply_pos=True)
                flag_first = True
            # print('我是block_2')
        return x
class TransformerLayer(nn.Module):
    def __init__(self, d_model, d_feature, d_ff, n_heads, dropout, kq_same):
        super(TransformerLayer, self).__init__()
        """
        This is a Basic Block of Transformer paper. It contains one Multi-head attention object. Followed by layer
        norm and position wise feedforward net and dropout layer.
        """
        kq_same = kq_same == 1
        # Multi-Head Attention Block
        self.masked_attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout, kq_same=kq_same)
        # Two layer norm layer and two dropout layer
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.linear1 = nn.Linear(d_model, d_ff)
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)
        self.dropout2 = nn.Dropout(dropout)
    def forward(self, mask, query, key, values, apply_pos=True):
        # print(query.shape)
        # torch.Size([64, 200, 256])
        """
        Input:
            block : object of type BasicBlock(nn.Module).
                    It contains masked_attn_head objects which is of type MultiHeadAttention(nn.Module).
            mask : 0 means, it can peek only past values. 1 means, block can peek only current and pas values
            query : Query. In transformer paper it is the input for both encoder and decoder
            key : Keys. In transformer paper it is the input for both encoder and decoder
            values: In transformer paper,
                    it is the input for encoder and encoded output for decoder (in masked attention part)
        Output:
            query: Input gets changed over the layer and returned.
        """
        seqlen = query.size(1)
        # torch.Size([64, 200, 256])
        nopeek_mask = np.triu(np.ones((1, 1, seqlen, seqlen)), k=mask).astype('uint8')
        # 通过上面的赋值来看,mask有两个值0,1,0的时候是从主对角线上开始保留,也就是对于自身的计算的值没有保留,1的时候向上已从一步,自身的计算就保留了 有值的地方是mask掉的
        # 上三角矩阵
        # np.ones里面是它的形状,所以是四维的
        src_mask = (torch.from_numpy(nopeek_mask) == 0).to(device)
        if mask == 0:  # If 0, zero-padding is needed.
            # Calls block.masked_attn_head.forward() method
            query2 = self.masked_attn_head(query, key, values, mask=src_mask, zero_pad=True)
        else:
            query2 = self.masked_attn_head(query, key, values, mask=src_mask, zero_pad=False)
        query = query + self.dropout1(query2)
        query = self.layer_norm1(query)
        # 单调注意力就在这里?
        if apply_pos:
            query2 = self.linear2(self.dropout(self.activation(self.linear1(query))))
            query = query + self.dropout2(query2)
            query = self.layer_norm2(query)
        return query
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, d_feature, n_heads, dropout, kq_same, bias=True):
        super(MultiHeadAttention, self).__init__()
        """
        It has projection layer for getting keys, queries and values. Followed by attention and a connected layer.
        """
        self.d_model = d_model
        self.d_k = d_feature
        self.h = n_heads
        self.kq_same = kq_same
        self.v_linear = nn.Linear(d_model, d_model, bias=bias)
        self.k_linear = nn.Linear(d_model, d_model, bias=bias)
        if kq_same is False:
            self.q_linear = nn.Linear(d_model, d_model, bias=bias)
        self.dropout = nn.Dropout(dropout)
        self.proj_bias = bias
        self.out_proj = nn.Linear(d_model, d_model, bias=bias)
        self.gammas = nn.Parameter(torch.zeros(n_heads, 1, 1))
        xavier_uniform_(self.gammas)
    def forward(self, q, k, v, mask, zero_pad):
        bs = q.size(0)
        # perform linear operation and split into h heads
        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
        if self.kq_same is False:
            q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
        else:
            q = self.k_linear(q).view(bs, -1, self.h, self.d_k)
        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
        # transpose to get dimensions bs * h * sl * d_model
        k = k.transpose(1, 2)
        q = q.transpose(1, 2)
        v = v.transpose(1, 2)
        # calculate attention using function we will define next
        scores = attention(q, k, v, self.d_k, mask, self.dropout, zero_pad, self.gammas)
        # concatenate heads and put through final linear layer
        concat = scores.transpose(1, 2).contiguous().view(bs, -1, self.d_model)
        output = self.out_proj(concat)
        return output
def attention(q, k, v, d_k, mask, dropout, zero_pad, gamma=None):
    """
    This is called by Multi-head atention object to find the values.
    """
    # print(q.shape)
    # torch.Size([64, 8, 200, 32])
    # print(k.shape) same
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
    # print(scores.shape)
    # ([64, 8, 200, 200])
    bs, head, seqlen = scores.size(0), scores.size(1), scores.size(2)
    x1 = torch.arange(seqlen).expand(seqlen, -1).to(device)
    # print(x1.shape) 200*200
    x2 = x1.transpose(0, 1).contiguous()
    with torch.no_grad():
        scores_ = scores.masked_fill(mask == 0, -1e32)
        scores_ = F.softmax(scores_, dim=-1)
        scores_ = scores_ * mask.float().to(device)
        distcum_scores = torch.cumsum(scores_, dim=-1)
        disttotal_scores = torch.sum(scores_, dim=-1, keepdim=True)
        position_effect = torch.abs(x1 - x2)[None, None, :, :].type(torch.FloatTensor).to(device)
        dist_scores = torch.clamp((disttotal_scores - distcum_scores) * position_effect, min=0.)
        dist_scores = dist_scores.sqrt().detach()
    m = nn.Softplus()
    gamma = -1. * m(gamma).unsqueeze(0)
    # Now after do exp(gamma*distance) and then clamp to 1e-5 to 1e5
    total_effect = torch.clamp(torch.clamp((dist_scores * gamma).exp(), min=1e-5), max=1e5)
    scores = scores * total_effect
    scores.masked_fill(mask == 0, -1e23)
    scores = F.softmax(scores, dim=-1)
    if zero_pad:
        pad_zero = torch.zeros(bs, head, 1, seqlen).to(device)
        scores = torch.cat([pad_zero, scores[:, :, 1:, :]], dim=2)
    scores = dropout(scores)
    output = torch.matmul(scores, v)
    return output

这段代码中 唯一不明白的地方

 AKT的其中一个创新点是提出一种新的单调注意机制,使用指数衰减曲线来降低遥远过去问题的重要性,单调注意力这里没有找到在哪里

题外话:

幸福时刻学会感恩

苦涩时候保持优雅

推荐一部纪录片《女人》https://www.bilibili.com/video/BV1fQ4y1y7V5/

不是女权,是让我们更了解女性这个群体,思考女性到底是什么,拥有怎样的力量

转载请注明来自码农世界,本文标题:《AKT基于上下文的知识追踪代码及原理讲解》

百度分享代码,如果开启HTTPS请参考李洋个人博客
每一天,每一秒,你所做的决定都会改变你的人生!

发表评论

快捷回复:

评论列表 (暂无评论,78人围观)参与讨论

还没有评论,来说两句吧...

Top