MAF

发表于 2022-11-17 更新于 2022-12-26 分类于 Paper ， MMKG 本文字数： 3.2k 阅读时长 ≈ 3 分钟

MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition

WSDM 2022，代码，复旦大学。

作者通过判断post的text和image的匹配程度，计算进入文本表征中的图像信息，并且期望能够通过保持text和image不同模态表征的一致性。

In this paper, we study multimodal named entity recognition in social media posts. Existing works mainly focus on using a crossmodal attention mechanism to combine text representation with image representation. However, they still suffer from two weaknesses: (1) the current methods are based on a strong assumption that each text and its accompanying image are matched, and the image can be used to help identify named entities in the text. However, this assumption is not always true in real scenarios, and the strong assumption may reduce the recognition effect of the MNER model; (2) the current methods fail to construct a consistent representation to bridge the semantic gap between two modalities, which prevents the model from establishing a good connection between the text and image. To address these issues, we propose a general matching and alignment framework (MAF) for multimodal named entity recognition in social media posts. Specifically, to solve the first issue, we propose a novel cross-modal matching (CM) module to calculate the similarity score between text and image, and use the score to determine the proportion of visual information that should be retained. To solve the second issue, we propose a novel cross-modal alignment (CA) module to make the representations of the two modalities more consistent.We conduct extensive experiments, ablation studies, and case studies to demonstrate the effectiveness and efficiency of our method.The source code of this paper can be found in https://github.com/xubodhu/MAF.

阅读全文 »

TRC_Dataset

发表于 2022-11-16 更新于 2022-11-18 分类于 Paper ， multimodal 本文字数： 2.7k 阅读时长 ≈ 2 分钟

Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts

ACL 2019，代码，彭博社

Text in social media posts is frequently accompanied by images in order to provide content, supply context, or to express feelings. This paper studies how the meaning of the entire tweet is composed through the relationship between its textual content and its image. We build and release a data set of image tweets annotated with four classes which express whether the text or the image provides additional information to the other modality. We show that by combining the text and image information, we can build a machine learning approach that accurately distinguishes between the relationship types. Further, we derive insights into how these relationships are materialized through text and image content analysis and how they are impacted by user demographic traits. These methods can be used in several downstream applications including pre-training image tagging models, collecting distantly supervised data for image captioning, and can be directly used in end-user applications to optimize screen estate.

作者对tweet的文本和图像之间的关系进行了定性与定量的分析，提出了文本和图像之间存在两个维度的关系：

文本内容是否在图像中表示（Text is represented / Text is not represented），关注文本和图像之间是否存在信息的重叠overlap
图像内容是否增加了tweet的语义（Image adds / Image does not add），关注图像的语义在整个tweet语义的作用，关注图像能否提供文本之外的信息

作者创建了基于Twitter数据的文本-图像分类数据集TRC（Text-image relation classiﬁcation）

阅读全文 »

camera-tutorial1

发表于 2022-10-23 分类于 interest ， camera 本文字数： 684 阅读时长 ≈ 1 分钟

相机与摄像入门笔记

教程来源，B站，从零开始手把手教你学摄影，20节课带你从小白到大师

阅读全文 »

modality-discriminator

发表于 2022-10-20 更新于 2022-10-21 分类于 Paper ， MMKG 本文字数： 3k 阅读时长 ≈ 3 分钟

COLING 2022，代码。

作者认为，不是所有的social media post都需要多模态信息，可能有的post更适合单模态模型，如果加入多模态信息反而可能造成错误的后果。因此，作者基于强化学习，提出了一种可以把social post分为单模态集合和多模态集合的方法。

Recently, multimodal information extraction from social media posts has gained increasing attention in the natural language processing community. Despite their success, current approaches overestimate the significance of images. In this paper, we argue that different social media posts should consider different modalities for multimodal information extraction. Multimodal models cannot always outperform unimodal models. Some posts are more suitable for the multimodal model, while others are more suitable for the unimodal model. Therefore, we propose a general data splitting strategy to divide the social media posts into two sets so that these two sets can achieve better performance under the information extraction models of the corresponding modalities. Specifically, for an information extraction task, we first propose a data discriminator that divides social media posts into a multimodal and a unimodal set. Then we feed these sets into the corresponding models. Finally, we combine the results of these two models to obtain the final extraction results. Due to the lack of explicit knowledge, we use reinforcement learning to train the data discriminator. Experiments on two different multimodal information extraction tasks demonstrate the effectiveness of our method. The source code of this paper can be found in https://github.com/xubodhu/RDS.

阅读全文 »

HVPNeT

发表于 2022-10-20 分类于 Paper ， MMKG 本文字数： 2.6k 阅读时长 ≈ 2 分钟

Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

Findings of NAACL 2022，代码。

作者认为目前的MNER和MRE方法无法很好的处理图像和文本内容不匹配的问题，因此提出了一种从图像中提取object-level的层级信息，用于补充文本信息的多模态信息抽取方法HVPNeT (Hierarchical Visual Prefix fusion NeTwork)。

Multimodal named entity recognition and relation extraction (MNER and MRE) is a fundamental and crucial branch in information extraction. However, existing approaches for MNER and MRE usually suffer from error sensitivity when irrelevant object images incorporated in texts. To deal with these issues, we propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction, aiming to achieve more effective and robust performance. Specifically, we regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. We further propose a dynamic gated aggregation strategy to achieve hierarchical multiscaled visual features as visual prefix for fusion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance 1 .

阅读全文 »

MEGA

发表于 2022-10-19 更新于 2022-10-22 分类于 Paper ， MMKG 本文字数： 3k 阅读时长 ≈ 3 分钟

Multimodal Relation Extraction with Efficient Graph Alignment

ACM MM 21，代码

作者提出了一种，通过识别图像的scene graph和textual graph，进行图对齐的多模态关系抽取方法MEGA。

Relation extraction (RE) is a fundamental process in constructing knowledge graphs. However, previous methods on relation extraction suffer sharp performance decline in short and noisy social media texts due to a lack of contexts. Fortunately, the related visual contents (objects and their relations) in social media posts can supplement the missing semantics and help to extract relations precisely. We introduce the multimodal relation extraction (MRE), a task that identifies textual relations with visual clues. To tackle this problem, we present a large-scale dataset which contains 15000+ sentences with 23 pre-defined relation categories. Considering that the visual relations among objects are corresponding to textual relations, we develop a dual graph alignment method to capture this correlation for better performance. Experimental results demonstrate that visual contents help to identify relations more precisely against the text-only baselines. Besides, our alignment method can find the correlations between vision and language, resulting in better performance. Our dataset and code are available at https://github.com/thecharm/Mega.

阅读全文 »

FL-MSRE

发表于 2022-10-17 更新于 2022-10-20 分类于 Paper ， MMKG 本文字数： 2.8k 阅读时长 ≈ 3 分钟

AAAI 2021，代码。

Social relation extraction (SRE for short), which aims to infer the social relation between two people in daily life, has been demonstrated to be of great value in reality. Existing methods for SRE consider extracting social relation only from unimodal information such as text or image, ignoring the high coupling of multimodal information. Moreover, previous studies overlook the serious unbalance distribution on social relations. To address these issues, this paper proposes FL-MSRE, a few-shot learning based approach to extracting social relations from both texts and face images. Considering the lack of multimodal social relation datasets, this paper also presents three multimodal datasets annotated from four classical masterpieces and corresponding TV series. Inspired by the success of BERT, we propose a strong BERT based baseline to extract social relation from text only. FL-MSRE is empirically shown to outperform the baseline signiﬁcantly. This demonstrates that using face images beneﬁts text-based SRE. Further experiments also show that using two faces from different images achieves similar performance as from the same image. This means that FL-MSRE is suitable for a wide range of SRE applications where the faces of two people can only be collected from different images.

作者在这篇工作中，创建了包括文本和脸部图像的多模态social relation extraction数据集，Dream of the Red Chamber (DRC-TF), Outlaws of the Marsh (OM-TF) and the Four Classic (FC-TF)。红楼梦、水浒传和四大名著数据集，TF指text and face。

并且由于不同social relation的分布差异很大，作者考虑使用少次学习来解决，提出了方法FL-MSRE。

阅读全文 »

MNRE-dataset

发表于 2022-10-12 更新于 2023-09-26 分类于 Paper ， MRE 本文字数： 3.8k 阅读时长 ≈ 3 分钟

MNRE，ICME 2021。作者创建了首个用于multimodal relation extraction的数据集MNRE，地址。

数据来源于Twitter posts，关注点是文本中的上下文信息不够充分时，通过post中的image，来补充上下文信息。

Extracting relations in social media posts is challenging when sentences lack of contexts. However, images related to these sentences can supplement such missing contexts and help to identify relations precisely. To this end, we present a multimodal neural relation extraction dataset (MNRE), consisting of 10000+ sentences on 31 relations derived from Twitter and annotated by crowdworkers. The subject and object entities are recognized by a pretrained NER tool and then ﬁltered by crowdworkers. All the relations are identiﬁed manually. One sentence is tagged with one related image. We develop a multimodal relation extraction baseline model and the experimental results show that introducing multimodal information improves relation extraction performance in social media texts. Still, our detailed analysis points out the difﬁculties of aligning relations in texts and images, which can be addressed for future research. All details and resources about the dataset and baselines are released on https://github.com/thecharm/MNRE.

阅读全文 »

entropy-softmax

发表于 2022-09-22 更新于 2022-11-22 分类于 ML ， Theory 本文字数： 4.3k 阅读时长 ≈ 4 分钟

机器学习中的Sigmoid、Softmax与entropy

这篇文章期望总结与讨论机器学习中常见的sigmoid、softmax函数与entropy熵。

参考资料：

总结：

sigmoid可以看做是神经网络输出\([p,0]\)的softmax变形\([e^x/(e^x+1), 1/(e^x+e^0)]\)，只不过由于对应标签1的概率\(p\)是我们的期望值，另外一个0不做过多讨论。
softmax+交叉熵基本是绑定的，这是因为会使得loss的计算和求导都更简单。
我们经常使用交叉熵，是因为它作为KL散度的核心变化部分，能够衡量输出分布和真实分布之间的差异。
使用softmax而不是hardmax的目的是期望能够让模型从不同类的预测值上获得更多的梯度。

阅读全文 »

IKRL

发表于 2022-09-07 分类于 Paper ， MMKG 本文字数： 1.9k 阅读时长 ≈ 2 分钟

Image-embodied Knowledge Representation Learning

清华大学2017年发表在IJCAI上的paper，IKRL，应该是第一个把图像信息注入到KGE中的方法。

基于TransE的思想，为不同的entity学习一个额外的image embedding，然后image embedding和原来的entity embedding通过\(h+r\approx t\)评估三元组是否成立。

Entity images could provide signiﬁcant visual information for knowledge representation learning. Most conventional methods learn knowledge representations merely from structured triples, ignoring rich visual information extracted from entity images. In this paper, we propose a novel Imageembodied Knowledge Representation Learning model (IKRL), where knowledge representations are learned with both triple facts and images. More speciﬁcally, we ﬁrst construct representations for all images of an entity with a neural image encoder. These image representations are then integrated into an aggregated image-based representation via an attention-based method. We evaluate our IKRL models on knowledge graph completion and triple classiﬁcation. Experimental results demonstrate that our models outperform all baselines on both tasks, which indicates the signiﬁcance of visual information for knowledge representations and the capability of our models in learning knowledge representations with images.

阅读全文 »

MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition

Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts

相机与摄像入门笔记

Different Data, Different Modalities! Reinforced Data Splitting for Effective Multimodal Information Extraction from Social Media Posts

Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

Multimodal Relation Extraction with Efficient Graph Alignment

FL-MSRE: A Few-Shot Learning based Approach to Multimodal Social Relation Extraction

MNRE: A Challenge Multimodal Dataset for Neural Relation Extraction with Visual Evidence in Social Media Posts

机器学习中的Sigmoid、Softmax与entropy

Image-embodied Knowledge Representation Learning