Bert + KorQuAD

Computer Science

Bert + KorQuAD

Tanya 탄야 2019. 11. 7. 21:00

BERT

- https://github.com/google-research/bert 버트 오픈소스

- 깃허브 코드엔

- Tensorflow code for BERT model (Bert -Base , Bert-large)

- Pre-trained checkpoints for both the lowercase and cased version of BERT-Base and BERT-Large from the paper.

- TensorFlow code for push-button replication of the most important fine-tuning experiments from the paper, including SQuAD, MultiNLI, and MRPC.

- 버트 깃허브에 가보면 BERT 모델에 대해 다음과 같이 설명하고 있다.

first unsupervised, deeply bidirectional system for pre-training NLP

<Unsupervised>

: plain text corpus 를 인풋으로 집어넣을 수 있다는 것이고, 이는 인터넷 상의 여러 언어로 되어있는 여러 text data 들을 쓸 수 있다는 것.

<Deeply Bidirectional System>

BERT was built upon recent work in pre-training contextual representations. (e.g.) Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit
but crucially these models are all unidirectional or shallowly bidirectional. This means that each word is only contextualized using the words to its left (or right).
BERT는 deeply bidirectional 인데, 예를 들어 bank 를 represent 한다면 "I made a --- deposit" 문장 전체를 본다는 것.
(c.f.) context free - 문맥 안따진 임베딩 <-> contextual
- 이게 어떻게 가능? <Mask>를 이용해서!
  - mask out 15% of the words in the input, run the entire sequence through a deep bidirectional Transformer encoder, and then predict only the masked words.
- Mask 방법에 더해, 문장 사이의 방법을 보는 Next Sentence Prediction

Pre-trained 과정 (Mask LM + NSP) 가 끝나고 나면,

then train a large model (12 - 24 layer Transformer) on a large corpus for a long time

and then, ====>>BERT 탄생!

** Using Bert has two stages : Pre-Training and Fine-Tuning

--------------------

- Korquad files

Korquad v1.0 에서 train set, dev set, evaluation script 다운로드

- Pretrained files

(Bert Pretrained Model 의 checkpoint)

- bert_model.ckpt.meta

- bert_mode.ckpt.index

- bert_model.ckpt.data

- vocab

- bert_config

* Pretrained Files를 더 작은 한국어 버트 모델로 대체 - https://github.com/MrBananaHuman/KoreanCharacterBert?fbclid=IwAR13agfkN9NjlMku8g2YRD0WjM37w96zpbXgSghqs_RanymC5ppOtn8tDtk

(hidden layer 3개, 음절단위 토크나이징, 사전 사이즈 7000개)

- bert_files

modeling, optimization, run_squad

tokenization(smallermodel 에 제공된 tokenization 이용)

Transformer

https://arxiv.org/abs/1706.03762

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org

https://arxiv.org/pdf/1810.04805.pdf

불러오는 중입니다...