Multimodal Chain-of-Thought Reasoning in Language Models
Feb 2023
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola
[Shanghai Jiao Tong University, Amazon Web Services]
https://arxiv.org/abs/2302.00923
https://github.com/amazon-science/mm-cot? ? ?2.2k stars
大型語言模型(LLM)通過利用思想鏈(CoT)提示生成中間推理鏈作為推斷答案的基本原理覆获,在復(fù)雜推理方面表現(xiàn)出令人印象深刻的性能模狭。然而,現(xiàn)有的CoT研究集中于語言模態(tài)卿啡。我們提出了多模式CoT昔瞧,它將語言(文本)和視覺(圖像)模式結(jié)合到一個兩階段框架中指蚁,該框架將理論基礎(chǔ)生成和答案推理分開。通過這種方式自晰,答案推理可以更好地利用基于多模態(tài)信息的推理凝化。使用多模態(tài)CoT,我們的模型在10億個參數(shù)下的性能在ScienceQA基準(zhǔn)上比之前的最先進(jìn)LLM(GPT-3.5)高出16個百分點(75.17%->91.68%的準(zhǔn)確度)酬荞,甚至超過了人類性能搓劫。代碼可在以下網(wǎng)址公開獲取https://github.com/amazon-science/mm-cot.
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%->91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available available at https://github.com/amazon-science/mm-cot.