5-generation

MMML Tutorial Challenge 4: Generation

generation的定义是生成raw modality,也就是说应该和input modalities是不同的modality:

Learning a generative process to produce raw modalities that reflects cross-modal interactions, structure, and coherence.

generation的两个维度:

Sub-challenge 1: Translation

translation定义:

Translating from one modality to another and keeping information content while being consistent with cross-modal interactions.

比如DALLE(Ramesh et al., Zero-Shot Text-to-Image Generation. ICML 2021):

从content和generation的角度来看,因为我们做的translation,因此我们不需要存在信息损失,所以利用coordination来保持两个模态的信息能够互相协作。

比如DALL E 2(Ramesh et al., Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022)和DALL-E核心原理是一致的:

Sub-challenge 2: Summarization

summarization的定义是缩减信息量并且找出重要的信息:

Summarizing multimodal data to reduce information content while highlighting the most salient parts of the input.

比如下面的例子,通过video和language生成summary(Palaskar et al., Multimodal Abstractive Summarization for How2 Videos. ACL 2019):

summarization的content就需要是进行模态的fusion,并且生成的时候需要进行信息的缩减:

Sub-challenge 3: Creation

creation需要创造新的modalities,是一个非常具有挑战性的方向:

Simultaneously generating multiple modalities to increase information content while maintaining coherence within and across modalities.

实际上现在没有特别符合creation方向的方法,一个非常初步的方法是(Tsai et al., Learning Factorized Multimodal Representations. ICLR 2019):

还存在很多的可以研究的点: