JARO WINKLER ALGORITHM FOR MEASURING SIMILARITY ONLINE NEWS

  • Teguh Efriyanto Program Studi Informatika, Fakultas Ilmu Komputer, Universitas Amikom Yogyakarta, Indonesia
  • Mardhiya Hayaty Program Studi Informatika, Fakultas Ilmu Komputer, Universitas Amikom Yogyakarta, Indonesia
Keywords: jaro winkler, online news, plagiarism, similarity, text preprocessing

Abstract

Online news is a source of information for people; this impacts journalists as news writers who can find news information quickly and accurately every day. Journalists can plagiarise other journalists or take news material from other news media sites and use it to publish in the media without including the source. An algorithm is needed to measure the similarity of online news. This work proposed the Jaro Winkler algorithm, with the value obtained from the calculation normalised so that the value 0 means there is no resemblance, and one means it has the exact resemblance. The data used is 20 online news media sites in the Central Kalimantan area. The Scraping process utilised the Custome Search JSON API and used keywords to get the news on the same topic. The results of the calculation of news similarity with the Jaro Winkler algorithm obtained an average value of online news similarity of 74.49%, with 43 news data with severe plagiarism levels and 12 news data with moderate plagiarism levels. There are weaknesses in the Jaro Winkler algorithm in calculating the similarity value in the data obtained. Some undetected data should have a heavy plagiarism level but not severe and vice versa.

Downloads

Download data is not yet available.

References

E. Kartinawati, “Jurnalisme Kloning di Kalangan Wartawan Kota Surakarta,” J. Messenger, vol. 9, no. 1, p. 91, Jan. 2017, doi: 10.26623/themessenger.v9i1.432.

N. I. Kurniati, A. Rahmatulloh, and R. N. Qomar, “Web Scraping and Winnowing Algorithms for Plagiarism Detection of Final Project Titles,” Lontar Komput. J. Ilm. Teknol. Inf., vol. 10, no. 2, p. 73, Aug. 2019, doi: 10.24843/LKJITI.2019.v10.i02.p02.

M. Z. Naf’an, A. Burhanuddin, and A. Riyani, “Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen,” J. Linguist. Komputasional, vol. 2, no. 1, p. 23, Mar. 2019, doi: 10.26418/jlk.v2i1.17.

T. Tinaliah and T. Elizabeth, “Perbandingan Hasil Deteksi Plagiarisme Dokumen dengan Metode Jaro-Winkler Distance dan Metode Latent Semantic Analysis,” J. Teknol. dan Sist. Komput., vol. 6, no. 1, pp. 7–12, Jan. 2018, doi: 10.14710/jtsiskom.6.1.2018.7-12.

S. C. Cahyono, “Comparison of document similarity measurements in scientific writing using Jaro-Winkler Distance method and Paragraph Vector method,” IOP Conf. Ser. Mater. Sci. Eng., vol. 662, no. 5, p. 052016, Nov. 2019, doi: 10.1088/1757-899X/662/5/052016.

M. A. Yulianto and N. Nurhasanah, “The Hybrid of Jaro-Winkler and Rabin-Karp Algorithm in Detecting Indonesian Text Similarity,” J. Online Inform., vol. 6, no. 1, p. 88, Jun. 2021, doi: 10.15575/join.v6i1.640.

B. Leonardo and S. Hansun, “Text Documents Plagiarism Detection using Rabin-Karp and Jaro-Winkler Distance Algorithms,” Indones. J. Electr. Eng. Comput. Sci., vol. 5, no. 2, p. 462, Feb. 2017, doi: 10.11591/ijeecs.v5.i2.pp462-471.

S. Christina, E. D. Oktaviyani, and B. Famungkas, “Mendeteksi Plagiarism Pada Dokumen Proposal Skripsi Menggunakan Algoritma Jaro Winkler Distance,” J. SAINTEKOM, vol. 8, no. 2, p. 143, Sep. 2018, doi: 10.33020/saintekom.v8i2.68.

P. Novantara, “Implementasi Algoritma Jaro-Winkler Distance Untuk Sistem Pendeteksi Plagiarisme Pada Dokumen Skripsi,” Buffer Inform., vol. 3, no. 1, Apr. 2018, doi: 10.25134/buffer.v3i2.960.

C. Varol and H. M. T. Abdulhadi, “Comparision of String Matching Algorithms on Spam Email Detection,” in 2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT), Dec. 2018, pp. 6–11. doi: 10.1109/IBIGDELFT.2018.8625317.

H. A. Rouf, A. Wijayanto, and A. Aziz, “Deteksi Plagiarisme Skripsi Mahasiswa dengan Metode Single-link Clustering dan Jaro-Winkler Distance,” J. PILAR Teknol. J. Ilm. Ilmu Ilmu Tek., vol. 5, no. 1, Jun. 2020, doi: 10.33319/piltek.v5i1.50.

R. A. Salim, M. R. D. Septian, S. Suhartini, D. Anggraini, and Q. Qomariyah, “Aplikasi Pendeteksi Kesamaan Dokumen Dengan Menggunakan Algoritma Jarak Jaro Winkler Dan Levenshtein,” Sebatik, vol. 25, no. 1, pp. 35–41, Jun. 2021, doi: 10.46984/sebatik.v25i1.1309.

I. E. Agbehadji, H. Yang, S. Fong, and R. Millham, “The Comparative Analysis of Smith-Waterman Algorithm with Jaro-Winkler Algorithm for the Detection of Duplicate Health Related Records,” 2018 Int. Conf. Adv. Big Data, Comput. Data Commun. Syst. icABCD 2018, pp. 1–10, 2018, doi: 10.1109/ICABCD.2018.8465458.

V. Nurcahyawati and Z. Mustaffa, “Online Media as a Price Monitor: Text Analysis using Text Extraction Technique and Jaro-Winkler Similarity Algorithm,” in 2020 Emerging Technology in Computing, Communication and Electronics (ETCCE), Dec. 2020, pp. 1–6. doi: 10.1109/ETCCE51779.2020.9350898.

M. H. P. Swari, C. A. Putra, and I. P. S. Handika, “Plagiarsm Checker pada Sistem Manajemen Data Tugas Akhir,” J. Sains dan Inform., vol. 7, no. 2, pp. 192–201, Dec. 2021, doi: 10.34128/jsi.v7i2.338.

S. R. Alenazi, K. Ahmad, and A. Olowolayemo, “A review of similarity measurement for record duplication detection,” in 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI), Nov. 2017, vol. 2017-Novem, pp. 1–6. doi: 10.1109/ICEEI.2017.8312386.

D. Hu and A. Yin, “Efficient fuzzy keyword search scheme over encrypted data in cloud computing based on Bed-tree index structure,” J. Intell. Fuzzy Syst., pp. 1–13, Aug. 2021, doi: 10.3233/JIFS-202844.

Sandhya and U. Ghose, “san_sim: Factual and efficient URL text similarity algorithm,” in 2017 3rd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), Dec. 2017, pp. 359–364. doi: 10.1109/ICATCCT.2017.8389161.

Y. Rochmawati and R. Kusumaningrum, “Studi Perbandingan Algoritma Pencarian String dalam Metode Approximate String Matching untuk Identifikasi Kesalahan Pengetikan Teks,” J. Buana Inform., vol. 7, no. 2, pp. 125–134, Jan. 2016, doi: 10.24002/jbi.v7i2.491.

H. Cho, J. An, I. Hong, and Y. Lee, “Automatic Sensor Data Stream Segmentation for Real-time Activity Prediction in Smart Spaces,” in Proceedings of the 2015 Workshop on IoT challenges in Mobile and Industrial Systems, May 2015, pp. 13–18. doi: 10.1145/2753476.2753484.

A. A. P. Ratna, R. Sanjaya, T. Wirianata, and P. Dewi Purnamasari, “Word level auto-correction for latent semantic analysis based essay grading system,” in 2017 15th International Conference on Quality in Research (QiR) : International Symposium on Electrical and Computer Engineering, Jul. 2017, vol. 2017-Decem, pp. 235–240. doi: 10.1109/QIR.2017.8168488.

Y. A. Gerhana et al., “Computer speech recognition to text for recite Holy Quran,” IOP Conf. Ser. Mater. Sci. Eng., vol. 434, no. 1, p. 012044, Dec. 2018, doi: 10.1088/1757-899X/434/1/012044.

M. Elveny, S. M. Hardi, I. Jaya, and P. Gundari, “Web-based E-Commerce Products Grouping,” J. Phys. Conf. Ser., vol. 1898, no. 1, p. 012018, Jun. 2021, doi: 10.1088/1742-6596/1898/1/012018.

S. Sastroasmoro, “Beberapa Catatan tentang Plagiarisme *,” Maj. Kedokt. Indones., vol. Volum: 57, pp. 239–244, 2007.

R. Feldman and J. Sanger, The Text Mining Handbook. Cambridge: Cambridge University Press, 2006. doi: 10.1017/CBO9780511546914.

P. Wahyuningtias, H. W. Utami, U. A. Raihan, and H. N. Hanifah, “COMPARISON OF RANDOM FOREST AND SUPPORT VECTOR MACHINE METHODS ON TWITTER SENTIMENT ANALYSIS ( CASE STUDY : INTERNET SELEBGRAM RACHEL VENNYA ESCAPE FROM QUARANTINE ) PERBANDINGAN METODE RANDOM FOREST DAN SUPPORT VECTOR MACHINE PADA ANALISIS SENTIMEN TWITT,” vol. 3, no. 1, pp. 141–145, 2022, doi: 10.20884/1.jutif.2022.3.1.168.

P. M. Prihatini, I. K. G. D. Putra, I. A. D. Giriantari, and M. Sudarma, “Stemming Algorithm for Indonesian Digital News Text Processing,” Int. J. Eng. Emerg. Technol., vol. 2, no. 2, pp. 1–7, 2017.

Y. Wang, J. Qin, and W. Wang, “Efficient approximate entity matching using Jaro-Winkler distance,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 10569 LNCS, pp. 231–239, 2017, doi: 10.1007/978-3-319-68783-4_16.

A. Prasetyo, W. M. Baihaqi, and I. S. Had, “Algoritma Jaro-Winkler Distance: Fitur Autocorrect dan Spelling Suggestion pada Penulisan Naskah Bahasa Indonesia di BMS TV,” J. Teknol. Inf. dan Ilmu Komput., vol. 5, no. 4, p. 435, Oct. 2018, doi: 10.25126/jtiik.201854780.

Published
2022-08-20
How to Cite
[1]
T. Efriyanto and M. Hayaty, “JARO WINKLER ALGORITHM FOR MEASURING SIMILARITY ONLINE NEWS”, J. Tek. Inform. (JUTIF), vol. 3, no. 4, pp. 975-982, Aug. 2022.