Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in ''hallucination'', generating textual outputs that are not grounded by the multimodal information in context.
Based on the LLaVA, we collect 10k human preference dataset with sampled input from 10k LLaVA-Instruct and output from LLaVA with temperature as 0.7.
Data file name | File Size | Sample Size |
---|---|---|
human_preference_10k.json | 31 MB | 10K |
vqav2_83k.json | 13 MB | 83K |
aokvqa_16k.json | 4.9 MB | 16K |
flickr_23k.json | 3.9 MB | 23K |
LLaVa-RLHF connects pre-trained CLIP ViT-L/14 visual encoder and large language model Vicuna, using a simple projection matrix and a LoRA module. We consider a three-stage alignment procedure:
On 90 language-image instructions of LLaVA-Bench, we test LLaVA-RLHF, LLaVA and GPT-4, and use GPT-4 to rate their responses from score 1 to 10. The summed score and relative score per type is reported. Overall, LLaVA-RLHF achieves 95.6% (v.s. LLaVA's 85.1%) relative score compared with GPT-4.
@article{2023llavarlhf,
author = {Zhiqing Sun and Sheng Shen and Shengcao Cao and Haotian Liu and Chunyuan Li and Yikang Shen and Chuang Gan and Liang-Yan Gui and Yu-Xiong Wang and Yiming Yang and Kurt Keutzer and Trevor Darrell},
title = {Aligning Large Multimodal Models with Factually Augmented RLHF},
publisher = {arXiv:2309.14525},
year = {2023}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Related Links: [LLaVA]