A General Survey on Attention Mechanisms in Deep Learning

TKDE 2021

Attention is an important mechanism that can be employed for a variety of deep learning models across many different domains and tasks. This survey provides an overview of the most important attention mechanisms proposed in the literature. The various attention mechanisms are explained by means of a framework consisting of a general attention model, uniform notation, and a comprehensive taxonomy of attention mechanisms. Furthermore, the various measures for evaluating attention models are reviewed, and methods to characterize the structure of attention models based on the proposed framework are discussed. Last, future work in the field of attention models is considered.

这篇文章调研了大量的注意力方法,集中在surprised learning领域。

阅读全文 »

Language Models are Few-Shot Learners

GPT-3,NIPS 2020 技术报告63页,不是投稿的论文,OpenAI,2020-05

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

GPT-3比GPT-2强调的zero-shot的设置,稍微回退了一点,变为强调few-shot的设置。

阅读全文 »

Language Models are Unsupervised Multitask Learners

GPT-2 OpenAI,15亿参数量,2019-02

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

遗憾的是尽管比BERT-large的3.5亿参数量还要大,但是效果并没有超过BERT。因此在GPT-2主要在zero-shot设置下进行探究。

阅读全文 »

Improving Language Understanding by Generative Pre-Training

2018-06年,OpenAI,GPT-1

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

阅读全文 »

HOW NEURAL NETWORKS EXTRAPOLATE: FROM FEEDFORWARD TO GRAPH NEURAL NETWORKS

ICLR 2021

这篇文章主要从理论和实验角度研究了MLP和GNN的外推(extrapolate)性能。

阅读全文 »

Using Fast Weights to Attend to the Recent Past

NIPS 2016

作者将fast weights引入到RNN中实现了更好的效果。本质上是在RNN的t时刻到t+1时刻中间,插入了一段新的RNN结构,每个step计算之前的隐藏状态和当前隐藏状态的关系权重,不断累加,最后达到比较好的效果。

阅读全文 »

Meta-Learning: Learning to Learn Fast

这是一篇博客(Meta-Learning: Learning to Learn Fast)的笔记,另外参考了对应的中文博客,简单了解什么是meta-learning。

元学习尝试解决深度学习经常需要大量实例数据才能收敛的问题。我们期望好的元学习模型拥有好的泛化能力和适应能力,能够根据少量的样本就学习到比较合适的信息。

元学习可以解决一类定义好的预测任务,这篇文章主要讨论的是监督学习下的元学习问题。例如让一个图片分类器在训练集中没有猫的情况下,在测试集中能够实现只看到几张猫的图片就能够学会识别猫。

阅读全文 »

From Local Structures to Size Generalization in Graph Neural Networks

ICML 2021

作者主要讨论了GNN对于graph size generalization问题的性质探究。具体一点是指GNN在一个small graph上训练,然后在一个更大的large graph上测试的场景。

主要贡献:

  • 提出了graph的local structure的一种定义,d-pattern。GNN在对于相同的d-pattern会产生相同的输出。因此使用d-pattern可以作为GNN表达能力的一种抽象。
  • 理论上和实验上证明了GNN在size不同的graph上,不能保证学习到的模型是有足够size generalization能力的。
  • 提出了一种基于自监督的方法(Self-Supervised Learning,SSL)来提升size generalization能力,分别有无监督(unsupervised)和半监督(semi-supervised)两种loss设置。训练过程采用了预训练和多任务学习两种不同的学习过程。
阅读全文 »

Deep Residual Learning for Image Recognition

深度残差网络,将CNN拓展到152层乃至更深层,同时表现出更好效果的里程碑文章。核心是将residual connection代入到深层CNN中,使得深层的模型效果不比浅层的模型效果差。

image-20220324174625698

阅读全文 »