Aligning Large Multimodal Models with Factually Augmented RLHF

UC Berkeley CMU UIUC UW–Madison Microsoft Research MIT-IBM Watson AI Lab   *Equal Contribution, Equal Advising

LLaVA-RLHF represents the first open-source RLHF-trained large multimodal model for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on LLaVA-Bench, MMBench, and MMHal-Bench.
We propose a new alignment algorithm called Factually Augmented RLHF (Fact-RLHF) that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance.
LLaVA-RLHF combines a CLIP vision encoder and Vicuna, is fine-tuned with high-quality vision instruction tuning data and Fact-RLHF, and is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs.


Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in ''hallucination'', generating textual outputs that are not grounded by the multimodal information in context.

  1. High-Quality Instruct Data. We convert VQA-v2 (83k) and A-OKVQA (16k) into a multi-round QA task, and Flickr30k (23k) into a Spotting Captioning task, and train the LLaVA-SFT+ models based on the new mixture of data including LLaVA-Instruct-90k (randomly sampled from LLaVA-Instruct-150K)
  2. Factually-Augmented RLHF. We introduce a novel algorithm named Factually Augmented RLHF (Fact-RLHF), which calibrates the reward signals by augmenting them with additional information such as image captions or ground-truth multi-choice option. The reward model is trained on 10k hallucination-aware human preference data.
  3. MMHal-Bench. To evaluate the hallucination in real-world scenarios, we develop a new evaluation benchmark MMHal-Bench with a special focus on penalizing hallucinations with 96 image-question pairs, ranging in 8 question categoriesx12 object topics from OpenImages.
  4. Performance. Our early experiments show that LLaVA-RLHF demonstrates impressive visual reasoning, perception abilities while being less hallucinated and more human aligned, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 96.6% (v.s. LLaVA's 85.1%) relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset, and 60.1 (v.s. LLaVA's 47.5) overall performance on MMBench.
  5. Open-source. We make human preference data, high-quality visual instruction tuning data, our model and code base publicly available.

High-Quality Multimodal Instrucion-Following Data and Human Preference Data

Based on the LLaVA, we collect 10k human preference dataset with sampled input from 10k LLaVA-Instruct and output from LLaVA with temperature as 0.7.

Data file name File Size Sample Size
human_preference_10k.json 31 MB 10K
vqav2_83k.json 13 MB 83K
aokvqa_16k.json 4.9 MB 16K
flickr_23k.json 3.9 MB 23K

LLaVA-RLHF: Aligned Large Language-and-Vision Assistant

LLaVa-RLHF connects pre-trained CLIP ViT-L/14 visual encoder and large language model Vicuna, using a simple projection matrix and a LoRA module. We consider a three-stage alignment procedure:

  • Stage 1: Supervised Fine-tuning.
    • Following LLaVA, we conduct pre-training for Feature Alignment. Only the projection matrix is updated, based on a subset of CC3M.
    • Visual Chat and HQ Multimodal Instruction: LLaVA-SFT+ is fine-tuned on 90k LLaVA-Instruct task, 83k VQA-v2 and 16k A-OKVQA multi-round QA task, and 23k Flickr30k Spotting Caption task.
  • Stage 2: Human Preference Collection & Preference Modeling.
    • We collect 10k human preferences where human annotators are asked to compare two responses and pinpoint the more hallucinated one.
  • Stage 3: Factually-Augmented RLHF.
    • Only the LoRA module on top of LLaVA-SFT+ is fine-tuned to get the Reward Model on 10k human preference data and the RL Model via reinforcement learning (PPO) from simulated human preferences.
Please check out our [LLaVA-RLHF-13bx336-v1.5] model checkpoint and [LLaVA-RLHF-7bx224-v1.5] model checkpoint.


Visual Chat: Towards building multimodal GPT-4 level chatbot

On 90 language-image instructions of LLaVA-Bench, we test LLaVA-RLHF, LLaVA and GPT-4, and use GPT-4 to rate their responses from score 1 to 10. The summed score and relative score per type is reported. Overall, LLaVA-RLHF achieves 95.6% (v.s. LLaVA's 85.1%) relative score compared with GPT-4.

MMHal-Bench: New SoTA with the synergy of LLaVA with Fact-RLHF

Examples on more Helpful and less Hallucinated Visual Instruction Following


    author      = {Zhiqing Sun and Sheng Shen and Shengcao Cao and Haotian Liu and Chunyuan Li and Yikang Shen and Chuang Gan and Liang-Yan Gui and Yu-Xiong Wang and Yiming Yang and Kurt Keutzer and Trevor Darrell},
    title       = {Aligning Large Multimodal Models with Factually Augmented RLHF},
    publisher   = {arXiv:2309.14525},
    year        = {2023}


This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [LLaVA]