Penerapan Abstract Syntax Tree dan Algoritma Damerau-Levenshtein Distance untuk Mendeteksi Plagiarisme pada Berkas Source Code

Stephanie Rusdianto; Ria Chaniago

doi:10.61769/telematika.v13i2.236

Authors

Stephanie Rusdianto Institut Teknologi Harapan Bangsa
Ria Chaniago Institut Teknologi Harapan Bangsa

DOI:

https://doi.org/10.61769/telematika.v13i2.236

Keywords:

plagiarism source code, grammar, lexer rule, parser rule, tree-based, abstract syntax tree, damerau-levenshtein distance algorithm

Abstract

Plagiarism source code is a program that is made up of other programs that have same syntax structure. In this research, the approach used to detect plagiarism is tree-based by building abstract syntax tree base on grammar on two predefined plagiarism files source code. Damerau-Levenshtein Distance Algorithm will calculate the tree structure formed minimum distance value to get the percentage of similarity. Previously, the application calculated the value of threshold obtained from the average value of plagiarism plot paired pairs, and then were reduced to its standard deviation to be able to declare that both files are plagiarism or not. This research analyzes the best use of grammar between jexer rule or a combination of lexer and parser rule, the best use of preprocessing combination and the best use of distance value of Damerau-Levenshtein Distance Algorithm. Based on the tests performed, the use of grammar lexer and parser rule resulted the highest accuracy of 97.435 % by taking 118,115 seconds and threshold used is 88.2314 %.The combination of preprocessing resulted highest accuracy of 97.435% by using whole preprocessing existing or by using preprocessing comment only. For the best distance value is 4 with highest accuracy 97.435 %.

Plagiarisme source code adalah jika sebagai sebuah program yang terbentuk dari program lainnya dan memiliki struktur syntax yang sama. Dalam penelitian ini, pendekatan yang digunakan untuk mendeteksi plagiarisme adalah tree-based dengan membangun abstract syntax tree atas dua berkas source code terduga plagiat berdasarkan grammar yang telah dirancang. Struktur tree yang terbentuk akan dihitung nilai jarak minimumnya dengan Damerau-Levenshtein Distance Algorithm untuk mendapatkan persentase kemiripan. Sebelumnya, aplikasi menghitung nilai threshold yang didapatkan dari nilai rata-rata kemiripan pasangan berkas plagiat yang dikurangi dengan simpangan bakunya untuk dapat menyatakan kedua berkas masukan plagiat atau tidak. Penelitian ini menganalisis penggunaan grammar terbaik antara lexer rule atau kombinasi lexer dan parser rule, penggunaan kombinasi preprocessing terbaik serta penggunaan nilai jarak terbaik pada Damerau-Levenshtein Distance Algorithm. Berdasarkan pengujian yang dilakukan, penggunaan grammar lexer dan parser rule menghasilkan akurasi tertinggi yaitu 97.435 % dengan memakan waktu 118,115 detik dengan nilai threshold 88.2314 %. Kombinasi preprocessing yang menghasilkan akurasi tertinggi 97.435 % menggunakan seluruh preprocessing yang ada atau dengan menggunakan prerpocessing comment saja. Untuk nilai jarak terbaik adalah nilai jarak sebesar 4 dengan akurasi tertinggi, yaitu 97.435 %.

References

Republik Indonesia. Undang-Undang No. 17 Tahun 2010 Bab 1 Pasal 1 ayat 1 tentang Ketentuan Umum. Lembaran Negara RI Tahun 2010. Sekretariat Negara, Jakarta, 2010.

Republik Indonesia. 2010. Undang-Undang No. 17 Tahun 2010 Bab III Pasal 12 tentang Lingkup dan Pelaku. Lembaran Negara RI Tahun 2010. Sekretariat Negara, Jakarta.

The Institute of Electrical and Electronics Engineers, ”IEEE Standard Glossary of Software Engineering Terminology,” IEEE Standards Board. 1990.

B. Muddu, A. Asadullah, and V. Bhat. ”CPDP: A Robust Technique for Plagiarism Detection in Source Code,” Proceeding IWSC '13 Proceedings of the 7th International Workshop on Software Clones, 2013, pp. 39-45.

Republik Indonesia. Undang-Undang No. 17 Tahun 2010 Bab VI Pasal 12 ayat 1 tentang Peraturan Menteri Pendidikan Nasional Republik Indonesia. Lembaran Negara RI Tahun 2010. Sekretariat Negara, Jakarta, 2010.

J. Feng, B. Cui, and K. Xia, “A Code Comparison Algorithm Based on AST for Plagiarism Detection”, Fourth International Conference on Emerging Intelligent Dara and Web Technologies, IEEE, 2013.

Z. Li, S. Lu, S. Myagmar, and Y. Zhou. “CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code”, IEEE Transactions on Software Engineering, Vol. 32, No. 3. March, 2006.

L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection”, IEEE Transactions on Software Engineering, vol. 43, Issue No. 12 - Dec, 2017, pp. 1157-1177.

Omer; Missen, Saad, Tazeen, Tenvir and Moosa, M., “EPlag: A Two Layer Source Code Plagiarism Detection System”, IEEE, 2013.

A. Munif, R. J. Akbar, R. I. Tantra, dan R. Ilavi. “Rancang Bangun Sistem E-Learning Pemrograman pada Modul Deteksi Plagiarisme Kode Program dan Student Feedback System” Jurnal Ilmiah Teknologi Informasi, Vol. 15, No. 1, January, 2017.

GitHub. “Antlr/Grammars-v4.” Internet: https://github.com/antlr/grammarsv4 [Oct. 02, 2017].

“Belajar C++ Team.” Internet: http://www.belajarcpp.com/2016/01/sejarah-bahasa-pemrograman-cpp.hml [Oct. 03, 2017].