Robust patch analysis using efficient features and an attention-based siamese classification neural network in binary executables :어텐션 기반 샴 신경망을 이용한 바이너리 패치 분석

Ullah, Sami

추천

검색

자료유형: 학위논문

저자정보: Ullah, Sami (한양대학교, 한양대학교 대학원)

지도교수: 오희국

발행연도: 2022

저작권: 한양대학교 논문은 저작권에 의해 보호받습니다.

이용수2

초록· 키워드

ㅍㅐㅊㅣ ㅂㅜㄴㅅㅓㄱㅇㅡㄴ ㅇㅣㅈㅣㄴ ㅅㅣㄹㅎㅐㅇ ㅍㅏㅇㅣㄹㅇㅔㅅㅓ ㅍㅐㅊㅣㄷㅚㄴ ㅋㅗㄴㅌㅔㄴㅊㅡㄹㅡㄹ ㅌㅏㅁㅅㅐㄱㅎㅏㄴㅡㄴ ㅇㅕㄱㅇㅔㄴㅈㅣㄴㅣㅇㅓㄹㅣㅇ ㄱㅣㅂㅓㅂㅇㅣㅁㅕ ㅈㅓㄴㅌㅗㅇㅈㅓㄱ ㅇㅡㄹㅗ ㅊㅟㅇㅑㄱㅅㅓㅇ ㄱㅓㅁㅅㅐㄱ, 1-day ㄱㅗㅇㄱㅕㄱ(1-day exploit) ㄷㅡㅇㄱㅘ ㄱㅏㅌㅇㅡㄴ ㅇㅡㅇㅇㅛㅇ ㅍㅡㄹㅗㄱㅡㄹㅐㅁㅇㅔ ㅅㅏㅇㅛㅇㄷㅚㄴㄷㅏ. ㅂㅏㅇㅣㄴㅓㄹㅣ ㅅㅏㅇㅇㅔㅅㅓ ㅍㅐㅊㅣ ㅂㅜㄴㅅㅓㄱㅇㅡㄹ ㅇㅟㅎㅐㅅㅓㄴㅡㄴ ㄷㅜ ㄱㅏㅈㅣ ㄷㅏㄹㅡㄴ ㅂㅓㅈㅓㄴㅇㅢ ㅂㅏㅇㅣㄴㅓㄹㅣ ㅅㅣㄹㅎㅐㅇ ㅍㅏㅇㅣㄹㅇㅡㄹ ㅂㅣㄱㅛㅎㅏㄱㅗ ㅍㅐㅊㅣ/ㅅㅜ ㅈㅓㅇㄷㅚㄴ ㅎㅏㅁㅅㅜㄹㅡㄹ ㅊㅏㅈㅇㅏㅇㅑ ㅎㅏㅁㅕ, ㅍㅐㅊㅣㄱㅏ ㄷㅚㅈㅣ ㅇㅏㄶㅇㅡㄴ ㅎㅏㅁㅅㅜㄹㅡㄹ ㅍㅣㄹㅌㅓㄹㅣㅇㅎㅐㅇㅑ ㅎㅏㄴㄷㅏ. ㄹㅣㅂㅓㅅㅡ ㅇㅔㄴㅈㅣㄴㅣㅇㅓㄹㅣㅇㅇㅔ ㅅㅓ ㅂㅏㅇㅣㄴㅓㄹㅣ ㄷㅣㅍㅣㅇ(binary diffing)ㅇㅡㄴ ㄷㅜ ㅂㅏㅇㅣㄴㅓㄹㅣ ㅍㅡㄹㅗㄱㅡㄹㅐㅁ ㅅㅏㅇㅣㅇㅢ ㄱㅣㄴㅡㅇ ㅊㅏㅇㅣㅇㅘ ㅇㅠㅅㅏㅅㅓㅇㅇㅡㄹ ㅂㅏㄹ ㄱㅕㄴㅎㅏㄴㅡㄴ ㅍㅡㄹㅗㅅㅔㅅㅡㅇㅣㅁㅕ, ㅈㅓㄴㅌㅗㅇㅈㅓㄱㅇㅡㄹㅗ ㅍㅐㅊㅣ ㅂㅜㄴㅅㅓㄱㅇㅡㄹ ㅇㅟㅎㅏㄴ ㅊㅚㅅㅓㄴㅇㅢ ㅅㅓㄴㅌㅐㄱㅇㅡㄹㅗ ㄱㅏㄴㅈㅜㄷㅚㄴㄷㅏ. ㅂㅏㅇㅣㄴㅓㄹㅣ ㄷㅣㅍㅣㅇㅇㅔ ㄷㅐㅎㅏㄴ ㄱㅣㅈㅗㄴ ㅇㅕㄴㄱㅜㄴㅡㄴ ㅎㅏㅁㅅㅜ ㅁㅐㅊㅣㅇ ㅁㅜㄴㅈㅔㄹㅗ ㅈㅓㅂㄱㅡㄴㅎㅏㅇㅕ ㅎㅏㅁㅅㅜ ㄱㅏㄴㅇㅢ ㅊㅗㄱㅣ 1:1 ㅁㅐㅍㅣㅇㅇㅡㄹ ㄱㅗㅇㅅㅣㄱㅎㅘㅎㅏㄱㅗ, ㄴㅏㅈㅜㅇㅇㅔ ㅅㅣㅋㅝㄴㅅㅡ ㅁㅐㅊㅣㅇ ㅂㅣㅇㅠㄹㅇㅡㄹ ㄱㅖㅅㅏㄴㅎㅏㅇㅕ ㄷㅜ ㅎㅏㅁㅅㅜㄹㅡㄹ ㅈㅓㅇㅎㅘㄱㅎㅏㄴ ㅇㅣㄹㅊㅣ(ㅍㅐㅊㅣㄷㅚㅈㅣ ㅇㅏㄶㅇㅡㅁ), ㅂㅜㅂㅜㄴ ㅇㅣㄹㅊㅣ(ㅍㅐ ㅊㅣㄷㅚㅁ) ㄸㅗㄴㅡㄴ ㅇㅣㄹㅊㅣ ㅇㅓㅄㅇㅡㅁ(ㅇㅗㄹㅠ/ㅅㅐㄹㅗㅇㅜㄴ ㅎㅏㅁㅅㅜ)ㅇㅡㄹㅗ ㅂㅜㄴㄹㅠㅎㅏㄴㄷㅏ. ㅍㅐㅊㅣ ㅂㅜㄴㅅㅓㄱㅇㅡㄹ ㅇㅟㅎㅏㄴ ㄱㅕㅇㅎㅓㅁㅈㅓㄱㅇㅣㄴ ㅂㅜㄴㅅㅓㄱㅇㅔ ㅅㅓ, ㅂㅗㄴ ㅇㅕㄴㄱㅜㄴㅡㄴ ㄱㅣㅈㅗㄴ ㄱㅣㅅㅜㄹㅇㅢ ㅈㅓㅇㅎㅘㄱㄷㅗㄴㅡㄴ ㅈㅓㅇㅎㅘㄱㅎㅏㄴ ㅇㅣㄹㅊㅣㄹㅡㄹ ㄱㅏㅁㅈㅣㅎㅏㄹ ㄸㅐㅁㅏㄴ ㄱㅏㅈㅏㅇ ㅇㅜㅅㅜㅎㅏㅁㅕ ㅂㅜㅂㅜㄴㅈㅓㄱㅇㅡㄹㅗ ㅂㅕㄴㄱㅕㅇㄷㅚㄴ ㄱㅣㄴㅡㅇ, ㅌㅡㄱㅎㅣ CWE478, CWE476 ㄷㅡㅇㄱㅘ ㄱㅏㅌㅇㅡㄴ ㅅㅏㅅㅗㅎㅏㄴ ㅍㅐㅊㅣㄱㅏ ㅇㅣㅆㄴㅡㄴ ㄱㅣㄴㅡㅇㅇㅡㄹ ㅌㅏㅁㅈㅣㅎㅏㄴㅡㄴ ㄷㅔㅇㅔ ㄴㅡㄴ ㅎㅛㅇㅠㄹㅈㅓㄱㅇㅣㅈㅣ ㅇㅏㄶㄷㅏㄴㅡㄴ ㄱㅓㅅㅇㅡㄹ ㅂㅏㄹㄱㅕㄴㅎㅐㅆㄷㅏ. ㄱㅣㅈㅗㄴ ㅇㅕㄴㄱㅜㄱㅏ ㅇㅣㄹㅓㄴ ㄷㅏㄴㅈㅓㅁㅇㅡㄹ ㅂㅗㅇㅣㄴㅡㄴ ㄷㅔㅇㅔㄴㅡㄴ ㄷㅜ ㄱㅏㅈㅣ ㅇㅣㅇㅠㄱㅏ ㅇㅣㅆㄷㅏ: (i) 1:1 ㅁㅐㅍㅣㅇ ㄷㅏㄴㄱㅖㅇㅔㅅㅓ ㅎㅏㅁㅅㅜㅇㅢ ㅌㅡㄱㅈㅣㅇㅇㅡㄹ ㅇㅣㄹㅊㅣㅅㅣㅋㅣㄱㅣ ㅇㅟㅎㅐ ㅇㅓㅁㄱㅕㄱㅎㅏㄴ ㅈㅓㅇㅊㅐㄱㅇㅣ ㅅㅏㅇㅛㅇㄷㅚㅇㅓㅆㄷㅏ. ㄱㅣㅈㅗㄴ ㅇㅕㄴㄱㅜㄴㅡㄴ ㅇㅣㄹㄹㅕㄴㅇㅢ ㅎㅠㄹㅣㅅㅡ ㅌㅣㄱ(heuristic)ㅇㅡㄹ ㅈㅓㅇㅇㅢㅎㅏㄱㅗ ㅅㅜㄴㅊㅏㅈㅓㄱ ㅂㅏㅇㅅㅣㄱㅇㅡㄹㅗ ㅎㅏㅁㅅㅜㄹㅡㄹ ㅇㅣㄹㅊㅣㅅㅣㅋㅣㄴㅡㄴ ㄷㅔ ㅅㅏㅇㅛㅇㅎㅐㅆㄷㅏ. ㅇㅣㄹㅂㅜ ㅎㅠㄹㅣㅅㅡㅌㅣㄱ ㅇㅣ ㅈㅣㄴㅏㅊㅣㄱㅔ ㅅㅣㄴㄹㅚㄷㅚㅇㅓㅆㄱㅗ, ㅎㅠㄹㅣㅅㅡㅌㅣㄱㅇㅢ ㅇㅜㅅㅓㄴㅅㅜㄴㅇㅟㄹㅡㄹ ㅁㅣㄹㅣ ㅅㅓㄹㅈㅓㅇㅎㅐ ㄷㅜㅁㅇㅡㄹㅗㅆㅓ ㅁㅏㄶㅇㅡㄴ ㅈㅏㄹㅁㅗㅅㄷㅚㄴ ㅇㅣㄹㅊㅣ ㄱㅕㄹㄱㅘㄱㅏ ㅂㅏㄹㅅㅐㅇㅎㅏㅇㅕㅆㄷㅏ. (ii) ㅂㅜㄴㄹㅠ ㄷㅏㄴㄱㅖㅇㅔㅅㅓ ㅇㅓㅅㅔㅁㅂㅡㄹㄹㅣ ㅅㅡㄴㅣㅍㅔㅅ(assembly snippet)ㅇㅡㄹ ㅇㅣㄹㅂㅏㄴ ㅌㅔㄱㅅㅡㅌㅡㄹㅗ ㄱㅏㄴㅈㅜㅎㅏㄱㅗ ㅇㅠㅅㅏㅅㅓㅇ ㅂㅣㄱㅛㄹㅡㄹ ㅇㅟㅎㅐ ㅅㅣㅋㅝㄴㅅㅡ ㅇㅣㄹㅊㅣ ㅇㅏㄹㄱㅗㄹㅣㅈㅡㅁㅇㅡㄹ ㅅㅏㅇㅛㅇㅎㅏㄴㄷㅏ. ㅁㅕㅇㄹㅕㅇㅇㅓㄴㅡㄴ ㄷㅗㄱㅌㅡㄱㅎㅏㄴ ㄱㅜㅈㅗㄹㅡㄹ ㄱㅏㅈㅣㄱㅗ ㅇㅣㅆㄷㅏ. ㅈㅡㄱ, ㄴㅣㅁㅗㄴㅣㄱ(mnemonic)ㄱㅘ ㄹㅔㅈㅣㅅㅡㅌㅓㄴㅡㄴ ㅁㅕㅇㄹㅕㅇㅇㅓㅇㅔㅅㅓ ㅌㅡㄱㅈㅓㅇㅎㅏㄴ ㅇㅟㅊㅣㄹㅡㄹ ㄱㅏㅈㅣㅁㅕ, ㄸㅗㅎㅏㄴ ㅇㅢㅁㅣㅈㅓㄱ ㄱㅘㄴㄱㅖ(semantic relationship)ㄹㅡㄹ ㄱㅏㅈㅣㄱㅗ ㅇㅣㅆㄷㅏㄴㅡㄴ ㅈㅓㅁㅇㅔㅅㅓ ㅇㅓㅅㅔㅁㅂㅡㄹㄹㅣ ㅋㅗㄷㅡㄹㅡㄹ ㅇㅣㄹㅂㅏㄴ ㅌㅔㄱㅅㅡㅌㅡ ㅇㅘ ㄷㅏㄹㅡㄱㅔ ㅁㅏㄴㄷㅡㄴㄷㅏ. ㄱㅕㅇㅎㅓㅁㅈㅓㄱ ㅂㅜㄴㅅㅓㄱㅇㅔㅅㅓ, ㅂㅗㄴ ㅇㅕㄴㄱㅜㄴㅡㄴ ㅍㅐㅊㅣ ㄸㅗㄴㅡㄴ ㅋㅓㅁㅍㅏㅇㅣㄹㄹㅓㄱㅏ ㄷㅗㅇㅣㅂㅎㅏㄴ ㅁㅜㅈㅏㄱㅇㅟㅅㅓㅇㅇㅔ ㅇㅢ ㅎㅐ ㅇㅑㄱㅣㄷㅚㄴㅡㄴ ㅈㅏㄱㅇㅡㄴ ㅅㅜㅈㅜㄴㅇㅢ ㅁㅕㅇㄹㅕㅇㅇㅓ ㅂㅕㄴㄱㅕㅇㅇㅡㄴ ㅅㅣㅋㅝㄴㅅㅡ ㅇㅣㄹㅊㅣ ㅇㅏㄹㄱㅗㄹㅣㅈㅡㅁㅇㅢ ㄱㅘㄴㅈㅓㅁㅇㅡㄹㅗㄴㅡㄴ ㄱㅓㅇㅢ ㄷㅗㅇㅇㅣㄹㅎㅏㄱㅔ ㅊㅟㄱㅡㅂㄷㅚㄴㅡㄴ ㄱㅓㅅㅇㅡㄹ ㅂㅏㄹㄱㅕㄴㅎㅐㅆㄷㅏ. ㅅㅣㅋㅝㄴㅅㅡ ㅇㅣㄹㅊㅣㄴㅡㄴ ㅇㅣㄹㅂㅏㄴ ㅌㅔㄱㅅㅡㅌㅡㅇㅔㅅㅓㄴㅡㄴ ㅎㅛㄱㅘㅈㅓㄱㅇㅣㅈㅣㅁㅏㄴ, ㅁㅕㅇㄹㅕㅇㅇㅓ ㅅㅜㅈㅜㄴㅇㅔㅅㅓ ㄴㅡㄴ ㄱㅜㅈㅗㅈㅓㄱ ㅁㅣㅊ ㅇㅢㅁㅣㅈㅓㄱ ㅂㅕㄴㅎㅘㄹㅡㄹ ㄱㅏㅁㅈㅣㅎㅏㅈㅣ ㅁㅗㅅㅎㅏㅁㅡㄹㅗ ㅂㅜㄴㄹㅠㅇㅔ ㄱㅡㄴㅑㅇ ㅅㅏㅇㅛㅇㅎㅏㄹ ㄱㅕㅇㅇㅜ ㅁㅏㄶㅇㅡㄴ ㅈㅏㄹㅁㅗㅅㄷㅚㄴ ㄱㅕㄹㄱㅘ ㅇㅢ ㅇㅝㄴㅇㅣㄴㅇㅣ ㄷㅚㄴㄷㅏ. ㅂㅗㄴ ㄴㅗㄴㅁㅜㄴㅇㅔㅅㅓㄴㅡㄴ ㄷㅜ ㄱㅏㅈㅣ ㅅㅗㄹㄹㅜㅅㅕㄴㅇㅡㄹ ㅈㅔㅇㅏㄴㅎㅏㅁㅇㅡㄹㅗㅆㅓ ㅇㅏㅍㅅㅓ ㅇㅓㄴㄱㅡㅂㅎㅏㄴ ㄱㅡㄴㅂㅗㄴㅈㅓㄱㅇㅣㄴ ㅁㅜㄴㅈㅔㄹㅡㄹ ㅎㅐㄱㅕㄹㅎㅐㅆㄷㅏ. ㅊㅓㅅ ㅉㅐ, 1:1 ㅁㅐㅍㅣㅇ ㄷㅏㄴㄱㅖㄹㅡㄹ ㅇㅟㅎㅐ, ㅇㅓㅂㄱㅖ ㅍㅛㅈㅜㄴ ㄷㅗㄱㅜㅇㅣㄴ Diaphoraㅇㅔㅅㅓ ㅅㅏㅇㅛㅇㅎㅏㄴㅡㄴ ㄱㅏㄱ ㅎㅠㄹㅣㅅㅡㅌㅣㄱㅇㅢ ㄷㅏㄴㅈㅓㅁㅇㅡㄹ ㅂㅜㄴㅅㅓㄱㅎㅏㄱㅗ, ㄱㅖㅅㅏㄴㅈㅓㄱㅇㅡㄹㅗ ㅈㅓㄹㅕㅁㅎㅏㄴ ㅌㅡㄱㅈㅣㅇ ㅂㅔㄱㅌㅓ(feature vector) ㅅㅔㅌㅡㄹㅡㄹ ㅈㅔㅇㅏㄴㅎㅐㅆㅇㅡㅁㅕ, Diaphoraㅇㅢ ㅎㅠ ㄹㅣㅅㅡㅌㅣㄱㅇㅡㄹ ㅂㅗㄴ ㄴㅗㄴㅁㅜㄴㅇㅔㅅㅓ ㅎㅏㅁㅅㅜ ㅁㅐㅍㅣㅇ ㄱㅘㅈㅓㅇㄱㅘ ㅍㅣㄹㅌㅓㄹㅣㅇ ㄱㅘㅈㅓㅇㅇㅡㄹ ㅇㅟㅎㅐ ㅈㅔㅇㅏㄴㅎㅏㄴ ㄱㅓㄹㅣ ㄱㅣ ㅂㅏㄴ(distance-based) ㅅㅓㄴㅈㅓㅇ ㄱㅣㅈㅜㄴㄱㅘ ㅂㅣㄱㅛㅎㅏㅇㅕㅆㄷㅏ. ㄷㅜㄹㅉㅐ, ㅂㅜㄴㄹㅠ ㄷㅏㄴㄱㅖㄹㅡㄹ ㅇㅟㅎㅐ, ㅇㅜㄹㅣㄴㅡㄴ ㄱㅏㄱ ㅂㅜㄴㄱㅣㄱㅏ ㅇㅓㅌㅔㄴ ㅅㅕㄴ(attention) ㄱㅣㅂㅏㄴ ㅂㅜㄴㅅㅏㄴ ㅎㅏㄱㅅㅡㅂ ㄴㅐㅈㅏㅇ ㅅㅣㄴㄱㅕㅇㅁㅏㅇㅇㅣㄴ ㅅㅑㅁ ㅇㅣㅈㅣㄴ ㅂㅜㄴㄹㅠ ㅅㅣㄴㄱㅕㅇㅁㅏㅇㅇㅡㄹ ㅈㅔㅇㅏㄴㅎㅐㅆㄷㅏ. ㅇㅣ ㄴㅔㅌㅡㅇㅝㅋㅡㄴㅡㄴ ㅇㅓㅅㅔㅁㅂㅡㄹㄹㅣ ㅁㅕㅇㄹㅕㅇㅇㅓ ㄱㅏㄴㅇㅢ ㅇㅢㅁㅣ ㅇㅠㅅㅏㅅㅓㅇㅇㅡㄹ ㅎㅏㄱㅅㅡㅂㅎㅏㄱㅗ, ㅁㅕㅇㄹㅕㅇㅇㅓ ㅅㅜㅈㅜㄴㅇㅔㅅㅓ ㅅㅣㄹㅈㅔㄹㅗ ㅂㅏㄹㅅㅐㅇㅎㅏㄴ ㅂㅕㄴㅎㅘㄹㅡㄹ ㄱㅏㅇ ㅈㅗㅎㅏㄴㅡㄴ ㅂㅏㅇㅂㅓㅂㅇㅡㄹ ㅂㅐㅇㅜㄱㅗ, ㅊㅚㅈㅗㅇ ㄷㅏㄴㄱㅖㅇㅔㅅㅓㄴㅡㄴ ㅇㅘㄴㅈㅓㄴ ㅇㅕㄴㄱㅕㄹ ㄱㅖㅊㅡㅇ(fully connected layer)ㅇㅡㄹ ㅌㅗㅇㅎㅐ ㅇㅣㄹㅊㅣ ㅎㅏㄱㅓㄴㅏ ㅂㅜㅂㅜㄴㅈㅓㄱㅇㅡㄹㅗ ㅇㅣㄹㅊㅣㅎㅏㄴㅡㄴ ㄷㅜ ㄱㅐㅇㅢ 1:1 ㅁㅐㅍㅣㅇㄷㅚㄴ ㅎㅏㅁㅅㅜㄹㅡㄹ ㅂㅜㄴㄹㅠㅎㅏㄴㅡㄴ ㄱㅓㅅㅇㅡㄹ ㅂㅐㅇㅜㄱㅔ ㄷㅚㄴㄷㅏ. ㅈㅔㅇㅏㄴㄷㅚㄴ ㅅㅣㄴㄱㅕㅇㅁㅏㅇㅇㅡㄴ ㅁㅕㅇㄹㅕㅇㅇㅓ ㅅㅜㅈㅜㄴㅇㅔㅅㅓ ㅋㅓㅁㅍㅏㅇㅣㄹㄹㅓㄹㅗ ㅇㅣㄴㅎㅏㄴ ㅂㅕㄴㄱㅕㅇㄱㅘ ㅍㅐㅊㅣ ㄱㅣㅂㅏㄴ ㅂㅕㄴㄱㅕㅇㅇㅡㄹ ㄱㅜㅂㅕㄹㅎㅏㄹ ㅅㅜ ㅇㅣㅆㅇㅡㄹ ㅈㅓㅇㄷㅗ ㄹㅗ ㅈㅓㅇㄱㅛㅎㅏㄷㅏ. ㅁㅏㅈㅣㅁㅏㄱㅇㅡㄹㅗ, ㅈㅔㅇㅏㄴㄷㅚㄴ 1:1 ㅁㅐㅍㅣㅇ ㄷㅏㄴㄱㅖㅇㅘ ㅂㅜㄴㄹㅠ ㄷㅏㄴㄱㅖㄹㅡㄹ ㅌㅗㅇㅎㅏㅂㅎㅏㄴㅡㄴ ㅎㅛㅇㅠㄹㅈㅓㄱㅇㅣㄴ ㅅㅣㄴㄱㅕㅇㅁㅏㅇ ㅈㅣ ㅇㅝㄴ ㅂㅏㅇㅣㄴㅓㄹㅣ ㄷㅣㅍㅣㅇ ㅇㅏㄹㄱㅗㄹㅣㅈㅡㅁㅇㅡㄹ ㅈㅔㅇㅏㄴㅎㅐㅆㄷㅏ. ㅈㅔㅇㅏㄴㄷㅚㄴ ㅂㅏㅇㅣㄴㅓㄹㅣ ㄷㅣㅍㅣㅇ ㅇㅏㄹㄱㅗㄹㅣㅈㅡㅁㅇㅡㄴ ㄷㅜ ㅂㅏㅇㅣㄴㅓㄹㅣ ㅎㅏㅁ ㅅㅜㄹㅡㄹ ㅈㅓㅇㅎㅘㄱㅎㅣ ㅇㅣㄹㅊㅣ, ㅂㅜㅂㅜㄴ ㅇㅣㄹㅊㅣ ㄸㅗㄴㅡㄴ ㅇㅣㄹㅊㅣ ㅇㅓㅄㅇㅡㅁ ㅅㅏㅇㅌㅐㄹㅗ ㅈㅓㅇㅎㅘㄱㅎㅏㄱㅔ ㅂㅜㄴㄹㅠㅎㅏㄴㄷㅏ. ㅂㅗㄴ ㄴㅗㄴㅁㅜㄴㅇㅡㄴ ㅈㅔㅇㅏㄴㄷㅚㄴ ㅌㅡㄱㅈㅣㅇ ㅂㅔㄱㅌㅓ, ㄷㅏㅇㅑㅇㅎㅏㄴ ㄷㅣㅈㅏㅇㅣㄴ ㅁㅣㅊ ㅅㅣㄴㄱㅕㅇㅁㅏㅇㅇㅢ ㅁㅐㄱㅐ ㅂㅕㄴㅅㅜㄹㅡㄹ ㅊㅓㄹㅈㅓㅎㅣ ㅍㅕㅇㄱㅏㅎㅐㅆㄷㅏ. ㅅㅣㄴㄱㅕㅇ ㅁㅏㅇㅇㅡㄹ ㅎㅜㄴㄹㅕㄴㅅㅣㅋㅣㄱㅣ ㅇㅟㅎㅐ, x86 XNU ㅋㅓㄴㅓㄹ ㅂㅏㅇㅣㄴㅓㄹㅣㄱㅏ ㅅㅏㅇㅛㅇㄷㅚㅇㅓㅆㅇㅡㅁㅕ ㅋㅓㄴㅓㄹ ㅂㅏㅇㅣㄴㅓㄹㅣㅇㅘ CWE ㄷㅔ ㅇㅣㅌㅓ ㅅㅔㅌㅡㅇㅔ ㄷㅐㅎㅐ ㅈㅔㅇㅏㄴㄷㅚㄴ ㅅㅣㄴㄱㅕㅇㅁㅏㅇ ㅈㅣㅇㅝㄴ ㅂㅏㅇㅣㄴㅓㄹㅣ ㄷㅣㅍㅣㅇ ㅇㅏㄹㄱㅗㄹㅣㅈㅡㅁㅇㅡㄹ ㅍㅕㅇㄱㅏㅎㅐㅆㄷㅏ. ㅂㅗㄴ ㄴㅗㄴㅁㅜㄴㅇㅢ ㅇㅏㄹㄱㅗ ㄹㅣㅈㅡㅁㅇㅡㄴ ㄱㅣㅈㅗㄴ ㅂㅏㅇㅣㄴㅓㄹㅣ ㄷㅣㅍㅣㅇ ㄱㅣㅂㅓㅂㄱㅘ ㅌㅜㄹㅂㅗㄷㅏ ㄴㅗㅍㅇㅡㄴ ∼99%ㅇㅢ ㅂㅜㄴㄹㅠ ㅈㅓㅇㅎㅘㄱㄷㅗㄹㅡㄹ ㄷㅏㄹㅅㅓㅇㅎㅐㅆㄷㅏ.

Patch analysis is a reverse engineering technique to explore the patched content in binary executables and traditionally it is being used for applications like vulnerability discovery, 1-day exploit generation, etc. For patch
analysis at the binary level, we need to compare two different versions of a binary executable and find the functions that were patched/modified; while filtering the unpatched functions. In reverse engineering, binary
diffing is a process to discover the differences and similarities in functionality between two binary programs and is traditionally considered the best choice for patch analysis. Previous research on binary diffing approaches it as a function matching problem to formulate an initial 1:1 mapping between functions, and later a sequence matching ratio is computed to classify two functions being an exact match (unpatched), a partial match (patched) or no-match (error/new functions). In our empirical analysis for patch analysis, we have discovered that the accuracy of existing techniques is best only when detecting exact matches and they are not
efficient in detecting partially changed functions; especially those with minor patches like CWE478, CWE476, etc.

The drawbacks in existing research are due to two major challenges (i) In the 1:1 mapping phase, using a strict policy to match function features. Existing research defines a set of heuristics and uses them to match functions in a sequential manner. They have overtrusted some heuristics and prioritizing them produces many false matching results. (ii) In the classification phase, consider an assembly snippet as a normal text, and use a sequence matching algorithm for similarity comparison. Instruction has a unique structure i.e. mnemonics and registers have a specific position in instruction and also have a semantic relationship, which makes assembly code different from general text. In our empirical analysis, we have discovered that the small instruction-level changes either caused by a patch or a compiler introduced randomness are pretty much the same for a sequence matching algorithm. Sequence matching performs best for general text but it fails to detect structural and semantic changes at an instruction level thus, its use for classification produces many false results.

In this dissertation, we have addressed the aforementioned underlying challenges by proposing a two-fold solution. First, for the 1:1 mapping phase, we have empirically analyzed heuristics in Diaphora ? an industrystandard tool, discovered drawbacks of each heuristic and proposed a set of computationally inexpensive feature vectors, which are later comparedwith a distance-based selection criteria to map similar functions and filter unmatched functions. Second, for the classification phase, we have proposed a Siamese binary-classification neural network where each branch is an attention-based distributed learning embedding neural network ? that learn the semantic similarity among assembly instructions, learn to highlight the actual changes at an instruction level and a final stage fully connected layer learn to accurately classify two 1:1 mapped function either an exact or a partial match. The proposed neural network is sophisticated enough to differentiate between the compiler-caused and patched-based changes at an instruction level. Finally, we have proposed an efficient neural network-assisted binary diffing algorithm that is an integration of our proposed 1:1 mapping phase and the classification phase. The proposed binary diffing algorithm accurately classifies the two binary functions being exact match, partial, or no-match.

We have thoroughly evaluated the proposed feature vectors, different design choices, and parameters of the neural network. For training the neural network, we have used x86 XNU kernel binaries and evaluated the proposed neural network-assisted binary diffing algorithm on kernel binaries (not included in training) and the CWE dataset. We have achieved ∼99% classification accuracy; which is higher than existing binary diffing
techniques and tools.

#컴퓨터공학

1. Introduction 1
1.1 Overview: Patch Analysis and Binary Diffing 1
1.1.1 Binary Diffing 3
1.2 Motivation 5
1.2.1 Problem Definition 9
1.2.2 Assumptions 11
1.3 Contributions 11
1.4 Organization of Dissertation 12
2. Background and Preliminaries 14
2.1 Function matching and Binary diffing 14
2.2 Function Matching Related Work 15
2.2.1 Binary Diffing 15
2.2.2 Binary Code Clones 17
2.2.3 Deep Learning 18
2.3 Binary Diffing Tool Analysis ? Diaphora 20
3 Feature Engineering ? 1:1 mapping phase 25
3.1 Tally Vector 25
3.2 Edge Type Vector 27
3.3 Vertex Type Vector 28
3.4 Vertex degree Vector 29
3.5 Digraph Dominance Relationship (DDR) 30
3.5.1 Piecewise Hashing 33
3.5.2 Projection based Hashing 33
3.6 Opcode vector 34
3.7 Assembly Embedding Representation 35
3.7.1 An Instruction Embedding Model 37
3.7.2 Function Modeling 37
3.8 Proposed Function Matching (FMA) 39
3.8.1 Features Encoding 40
3.8.2 Representation Vector Matching 41
3.8.3 Function Matching Algorithm 43
4 Attention based Siamese binary classification neural network 47
4.1 Modeling Assembly Functions 47
4.1.1 Assembly Formatting 48
4.1.2 Assembly Representation 49
4.1.3 Dataset Collection 52
4.2 Proposed Learning Model 55
4.2.1 Assembly as a Bag of Instructions 58
4.2.2 Attention Model 59
4.2.3 Siamese Binary-Classification Model 64
4.2.4 Training 66
4.2.5 Utility 67
4.3 Diffing Algorithm 68
4.4 Design Decisions and Limitations 70
4.4.1 Oneshot vs Sequential 70
4.4.2 Expressiveness of the Embedding Layers 70
4.4.3 Granularity of Operands Tokenization 71
4.4.4 Distance Function 72
5 Empirical Evaluation 73
5.1 Test Environment 73
5.2 Empirical evaluation ? 1:1 mapping phase 76
5.3 Empirical evaluation ? classification phase 78
5.3.1 RQ1a: Training Accuracy 80
5.3.2 RQ1bPrediction Accuracy 81
5.3.3 RQ2: Alternative Designs Comparison 86
5.3.4 RQ3: Effect of Attention Mechanism 89
5.3.5 RQ4: Comparison with binary diffing Baselines 90
5.3.6 RQ5 : Evaluation for CWEs Binary Dataset 95
6 Case Studies 98
6.1 Case Study ? CVE-2019-8605 98
7 Conclusions 101
7.1 Summary 101
7.2 Precautions for using Neural Network 102
7.2.1 Parameter l 102
7.2.2 Architecture 102
7.2.3 Optimizations 103
7.3 Importing to Register-based Architectures 103
7.3.1 1:1 mapping phase 103
7.3.2 Classification phase 104
7.4 Importing to Stack-based Architectures 104

최근 본 자료

전체보기

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

초록· 키워드

목차

최근 본 자료

댓글(0)