Vol.13, No.3, August 2024.                                                                                                                                                                               ISSN: 2217-8309

                                                                                                                                                                                                                        eISSN: 2217-8333

 

TEM Journal

 

TECHNOLOGY, EDUCATION, MANAGEMENT, INFORMATICS

Association for Information Communication Technology Education and Science


Low-Complexity and Secure Clustering-Based Similarity Detection for Private Files

 

Duaa Fadhel Najem, Nagham Abdulrasool Taha, Zaid Ameen Abduljabbar, Vincent Omollo Nyangaresi, Junchao Ma, Dhafer G. Honi

 

© 2024 Zaid Ameen Abduljabbar, published by UIKTEN. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. (CC BY-NC-ND 4.0)

 

Citation Information: TEM Journal. Volume 13, Issue 3, Pages 2341-2349, ISSN 2217-8309, DOI: 10.18421/TEM133-61, August 2024.

 

Received: 23 January 2024.

Revised:   11 May 2024.
Accepted: 18 May 2024.
Published: 27 August 2024.

 

Abstract:

 

Detection of the similarity between files is a requirement for many practical applications, such as copyright protection, file management, plagiarism detection, and detecting duplicate submissions of scientific articles to multiple journals or conferences. Existing methods have not taken into consideration file privacy, which prevents their use in many delicate situations, for example when comparing two intellectual agencies' files where files are meant to be secured, to find file similarities. Over the last few years, encryption protocols have been developed with the aim of detecting similar files without compromising privacy. However, existing protocols tend to leak important data, and do not have low complexity costs. This paper addresses the issue of computing the similarity between two file collections belonging to two entities who desire to keep their contents private. We propose a clustering-based approach that achieves 90% accuracy while significantly reducing the execution time. The protocols presented in this study are much more efficient than other secure protocols, and the alternatives are slower in terms of similarity detection for large file sets. Our system achieves a high level of security by using a vector space model to convert the files into vectors and by applying Paillier encryption to encrypt the elements of the vector separately, to protect privacy. The study uses the application of the Porter algorithm to the vocabulary set. Using a secure cosine similarity approach, a score for similar files was identified and the index of the similarity scores is returned to the other party, rather than the similar files themselves. The system is strengthened by using clustering for files, based on the k-means clustering technique, which makes it more efficient for large file sets.

 

Keywords – File similarity, privacy, similarity detection.

 

-----------------------------------------------------------------------------------------------------------

Full text PDF >  

-----------------------------------------------------------------------------------------------------------

 


Copyright © 2024 UIKTEN
Copyright licence: All articles are licenced via Creative Commons CC BY-NC-ND 4.0 licence