음성-문자 간 교차-모드 지식 증류를 이용한 종단 간 방식의 음성언어이해 :End-to-End Spoken Language Understanding Using Speech-Text Cross-modal Knowledge Distillation

김성빈

자료유형: 학위논문

저자정보: 김성빈 (인하대학교, 인하대학교 대학원)

지도교수: 이상민

발행연도: 2021

저작권: 인하대학교 논문은 저작권에 의해 보호받습니다.

이용수28

초록· 키워드

음성언어이해(spoken language understanding) 기술은 발화된 음성을 통해 화자의 명령이나 의도(intent)를 인식하는 것을 목적으로 한다. 전통적인 음성언어이해 기술은 발화된 음성을 문자로 전사(transcript)한 뒤, 전사된 문자를 통해 의도나 명령를 분류하는 음성인식과 자연어이해(natural language understanding) 기술의 결합으로 구성되는데, 음성인식의 오류가 자연어이해 알고리즘에 전파되어 음성인식 정확도에 많은 영향을 받는 단점이 존재했다. 이에 최근 음성인식을 통해 문자로 전사하지 않고 바로 의도를 인식하는 종단 간(end-to-end) 방식의 음성언어이해 기술들이 많이 연구되고 있지만, 기존의 방식을 대체하기에는 아직 연구가 완성되지 않았고, 또한, 종단 간 방식의 음성언어이해를 위해서는 음성에 해당하는 의도나 명령의 쌍으로 구성된 데이터가 필요하지만, 데이터를 구축하는 데에는 큰 비용이 소요되기 때문에 아직 그 수가 많이 부족하다.
따라서, 이 논문에서는 이를 해결하기 위해 최근 음성인식을 위한 음성의 특징을 추출하는 자기지도학습(self-supervised learning) 방식의 vq-wav2vec와 RoBERTa를 이용하여 음성의 특징을 추출하고 의도분류 성능을 보완하기 위해 사전학습(pre-training) 단계와 의도를 분류하기 위한 fine-tuning 단계에서 음성-문자 간 교차-모드(cross-modal) 지식 증류(knowledge distillation)를 적용하는 방법을 제안한다. 또한, 부족한 데이터의 양을 보완하기 위해 마스킹을 통한 데이터 증강(data augmentation)을 적용해 종단 간 음성언어이해 기술의 성능을 높이는 방법을 제안한다.
제안한 알고리즘들의 효과를 검증하기 위해 음성언어이해 성능평가에 많이 사용되는 FSC(Fluent Speech Command) 데이터셋과 데이터셋이 적은 상황을 가정하여 학습 데이터의 10%만 추출한 부분 데이터셋을 사용하여 성능을 평가했고 다양한 데이터셋에 대해서도 적응되는지 검증하기 위해 자연어이해 성능평가에 사용되는 Snips(Snips-NLU) 데이터셋을 음성 합성(speech synthesis)하여 만든 데이터셋과 Smartlights(Snips-SLU) 데이터셋에 대해 실험한 결과 제안한 알고리즘이 모두 효과가 있음을 확인했으며 특히, FSC 데이터셋의 경우 의도분류 태스크에서 현재까지 보고된 성능 중 가장 높은 99.7%의 정확도를 달성했다.

SLU(spoken language understanding) technology aims to recognize the speaker''s command or intent through the spoken voice. Traditional spoken language understanding technology consists of a combination of ASR(automatic speech recognition) and NLU(natural language understanding) technology that transcripts spoken speech into text and then classifies intentions or commands through the transcribed text.
There was a disadvantage in that ASR errors were propagated to NLU algorithms, which was greatly affected by the accuracy of speech recognition. Accordingly, many researches have recently been conducted on SLU technologies of an end-to-end method that recognize intentions directly without transcribing into text through ASR, but research has not yet been completed to replace the existing methods. In addition, for end-to-end SLU, data consisting of pairs of intentions or commands corresponding to voice is required, but the number of data is still insufficient because it require a large cost to construct the data.
Therefore, in order to solve this problem, this paper extracts speech features and improve intention classification performance using vq-wav2vec and RoBERTa, a self-supervised learning method that extracts features of voice for recent voice recognition. To do this, we propose a method of applying a cross-modal knowledge distillation between speech and text in the pre-training stage and the fine-tuning stage to classify the intention. In addition, to compensate for the insufficient amount of data, we propose a method to improve the performance of end-to-end speech language understanding technology by applying data augmentation through masking.
In order to verify the effectiveness of the proposed algorithms, we evaluate the performance using the FSC(Fluent Speech Command) dataset, which is widely used in SLU performance evaluation, and the partial dataset extracted from only 10% of the training data, assuming a situation where there is little dataset. In order to verify whether it is adapted to various dataset, both the proposed algorithm is effective as a result of experimenting on the Snips(Snips-NLU) dataset used for NLU performance evaluation and the Smartlights(Snips-SLU) dataset. In particular, we achieve state-of-the-art results on SLU benchmarks that 99.7% test accuracy on the FSC dataset.

#spoken language understanding #self-supervised learning #vq-wav2vec #BERT

제 1 장 서론 1
제 2 장 배경 이론 3
2.1. 전통적인 음성언어이해 기술 3
2.2. 종단 간 방식의 음성언어이해 기술 5
2.3. 자기지도학습 방식의 음성 특징 추출과 지식 증류 6
2.3.1 wav2vec 6
2.3.2 vq-wav2vec 8
2.3.3 지식 증류 14
제 3 장 제안하는 방법 16
3.1 제안하는 종단 간 방식의 음성언어이해 모델 구조 16
3.2 음성-문자 간 교차-모드 지식 증류 18
3.2.1 사전학습 단계에서의 지식 증류 18
3.2.2 fine-tuning 단계에서의 지식 증류 20
3.3 음향모델 사전학습 22
3.4 데이터 증강 24
3.3.1 입력 마스킹을 통한 데이터 증강 25
3.3.2 시간-채널 축 임베딩 마스킹을 통한 데이터 증강 26
제 4 장 실험 및 결과 27
4.1 실험 환경 27
4.2 실험을 위한 데이터셋 29
4.2.1 FSC 데이터셋 29
4.2.2 Snips 데이터셋 30
4.2.3 Smartlights 데이터셋 31
4.3 결과 32
4.2.1 FSC 데이터셋에 대한 의도분류 결과 32
4.2.2 Snips 데이터셋에 대한 의도분류 결과 35
4.2.3 Smartlights 데이터셋에 대한 의도분류 결과 36
제 5 장 결론 37
참고 문헌 38

최근 본 자료

전체보기

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

초록· 키워드

목차

최근 본 자료

댓글(0)