nanobot-mid-train

nanobot

mid_train

Published on

任务相关数据(但仍是next-token prediction),学习数学、选择题、对话等”局部模式”

举个例子:

Base Model看过: "What is the capital of France?"

Mid-Train后: "What is the capital of France? The capital of France is Paris."(看了大量MMLU/GSM8K问&答对)

SFT后: [user_start] What is the capital of France? [user_end][assistant_start]The capital of France is Paris.[assistant_end]"(学会了对话格式和停止)

具体的训练数据848K rows = 460K + 100K + 8K + 200K + 80K

train_dataset = TaskMixture([
    SmolTalk(split="train"), # 460K rows of general conversations
    MMLU(subset="auxiliary_train", split="train"), # 100K rows of multiple choice problems drawn from ARC, MC_TEST, OBQA, RACE
    GSM8K(subset="main", split="train"), # 8K rows teaching simple math and (calculator) tool use
    CustomJSON(filepath=identity_conversations_filepath), # 1000 rows of synthetic identity conversations
    CustomJSON(filepath=identity_conversations_filepath), # let's do 2 epochs of these
    SimpleSpelling(size=200000, split="train"), # 200K rows of Simple Spelling (e.g. spell the word 'apple')
    SpellingBee(size=80000, split="train"), # 80K rows of Spelling Bee (e.g. how many 'r' are in 'strawberry'?)
])