5-generation

发表于 2022-09-04 分类于 tutorial ， multimodal 阅读次数：本文字数： 1.2k 阅读时长 ≈ 1 分钟

MMML Tutorial Challenge 4: Generation

generation的定义是生成raw modality，也就是说应该和input modalities是不同的modality：

Learning a generative process to produce raw modalities that reflects cross-modal interactions, structure, and coherence.

generation的两个维度：

translation定义：

Translating from one modality to another and keeping information content while being consistent with cross-modal interactions.

比如DALLE（Ramesh et al., Zero-Shot Text-to-Image Generation. ICML 2021）：

从content和generation的角度来看，因为我们做的translation，因此我们不需要存在信息损失，所以利用coordination来保持两个模态的信息能够互相协作。

比如DALL E 2（Ramesh et al., Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022）和DALL-E核心原理是一致的：

summarization的定义是缩减信息量并且找出重要的信息：

Summarizing multimodal data to reduce information content while highlighting the most salient parts of the input.

比如下面的例子，通过video和language生成summary（Palaskar et al., Multimodal Abstractive Summarization for How2 Videos. ACL 2019）：

summarization的content就需要是进行模态的fusion，并且生成的时候需要进行信息的缩减：

creation需要创造新的modalities，是一个非常具有挑战性的方向：

Simultaneously generating multiple modalities to increase information content while maintaining coherence within and across modalities.

实际上现在没有特别符合creation方向的方法，一个非常初步的方法是（Tsai et al., Learning Factorized Multimodal Representations. ICLR 2019）：

还存在很多的可以研究的点：