Relation Extraction Models
Most previous models in this domain rely on features obtained by NLP tools such as part-ofspeech (POS) tagger, and named entity recognizers (NER). The famous pretrained BERT model (Devlin et al., 2018) has not been applied to relation classification, which relies not only on the information of the whole sentence but also on the information of the specific target entities. Lately, few researches that tried to improve the results of their previous were published.
BiLSTM with Attention using Latent Entity Typing (LET)
An end-to-end recurrent neural and entity-aware attention model with LET were published at 2019 — “Semantic relation classification via bidirectional lstm networks with entity-aware attention using latent entity typing”, proposed a method that uses high-level features and extract the important semantic information in the sentence for the prediction. This model utilizes entities and their latent types as features effectively. For example, tagged entity pairs could be powerful hints for the relation classification. This model create word representations using self attention mechanism in order to capture the context of sentences.
As shown in the figure above, this model consists of four main components:
- Word Representation — converts each word in a sentence into vector representations that is called X.
- Self Attention — captures the meaning of the words considering the context.
- BLSTM — sequentially encodes the representations of self attention layer.
- Entity-aware Attention — calculates attention weights with respect to the entity pairs, word positions relative to these pairs, and their latent types obtained by LET. Then, the features are averaged along the time steps to produce the sentence-level features.
R-BERT
At 2019 another paper was published by Alibaba Group — Enriching Pre-trained Language Model with Entity Information for Relation Classification. The pretrained BERT model (Devlin et al., 2018) has not been applied to relation classification task. Thus, they added to the original pretrained BERT additional information from the target entities to adapt it for the relation classification problem. The target entities were padded with ‘$’ token for the first and # for the second, before they were inserted to the model for fine-tuning.
For example, after the padding the sentence will become to: “[CLS] The $ dog$ was vaccinated by the # veterinarian#” .
The demarcation is intended to identify the locations of the target entities and feed BERT with this information. Then the BERT output of the embeddings of the target entities is used to encode the sentence (embedding of the special first token in the setting of BERT).
The diagram below shows the architecture of the model:
Suppose we have two target entities in the sentence Entity1 and Entity2 that we gave to the model and the final hidden state output from BERT module is H. The model will locate Hi to Hj, which are are the final hidden state vectors from BERT for Entity1, and Hk to Hm for Entity2. An average is then made for each pair of vectors to get a vector representation for each entity. This average is then transfers to activation and fully connected layers and output H’1 and H’2 respectively for each of the entities.
H’1 and H’2 share the same parameters W1 and W2, b1 and b2 (W1 = W2, b1 = b2).
The token [CLS] going through the same process and output H’0.
All of the previous output vector representations are concatenated to h’’ and are transferred to another fully connected layer. Then h’’ passes through a softmax layer to get the final probability.
A dropout layer was applied before each fully connected layer during training and a cross entropy was used as the loss function.
The final parameters were set as below:
R-BERT was evaluated on SemEval-2010 Task 8 dataset that contains 9 semantic relation types (Cause-Effect, Component-Whole, Content-Container, EntityDestination, Entity-Origin, Instrument-Agency, Member-Collection, Message-Topic and ProductProduce) and 1 artificial relation type ‘Other’. This dataset contains regular data and not distantly supervised data (e.g; no noisy labels). The final result they achieved for macro-averaged F1-scores is 89.25.
End Notes
CNN and RNN used to be the headed methods for solving relation extraction tasks. Lately, the pre-trained BERT model achieves very successful results in many NLP classification tasks. Many new and different versions of BERT came out and became the state-of-the-art for multiple NLP tasks, and as so for solving the relation classification task.
References
[https://arxiv.org/pdf/1905.08284.pdf]