SCALABLE AND ACCURATE HIGH-DIMENSIONAL SIMILARITY SEARCH: FROM DATA SERIES TO DEEP NETWORK EMBEDDINGS

dc.contributor.authorECHIHABI Karima
dc.date.accessioned2023-03-31T14:00:37Z
dc.date.accessioned2025-12-15T12:17:21Z
dc.date.available2023-03-31T14:00:37Z
dc.date.issued2020-07-28
dc.description.abstractThe world is drowning in a big data tsunami of high-dimensional objects that need to be analyzed in order to identify useful patterns and extract new knowledge in domains as varied as agriculture, medicine, cybersecurity, seismology, astrophysics, manufacturing, finance, and others. In response to these needs, it is imperative to build analytical systems that truly support interactive exploration on datasets containing terabytes of high-dimensional objects, with dimensions reaching hundreds to thousands. A fundamental and challenging operation called similarity search is the main bottleneck of many critical data processing tasks such as data cleaning, data integration and big data analytics (e.g., outlier detection, frequent pattern mining, clustering, and classification). A number of exact and approximate approaches have been proposed in the literature to support similarity search over massive data series collections. In this thesis, we unify and formally define the terminology used for the different flavors of the similarity search problem. We present a similarity search taxonomy that classifies methods based on the quality guarantees they provide for the search results, and that unifies the varied nomenclature used in the literature. Following this taxonomy, we include a survey of similarity search approaches supporting exact and approximate search, bringing together works from the data series and multidimensional data research communities. We propose extensions to existing data series indexes that can answer approximate queries with guarantees and that outperform popular state-of-the-art techniques such as LSH, kNN graphs and quantization-based inverted indexes in many scenarios. We also design and conduct the two most exhaustive experimental evaluations in the field covering both exact and approximate techniques. Building upon the deep insights gained from both studies, we propose Hercules, a new algorithm that outperforms the state-of-the-art similarity search approaches in-memory and on-disk. Our work has far-reaching fundamental and practical implications. We demonstrate that it is possible to design efficient high-dimensional vector similarity search algorithms.with theoretical guarantees on the quality of the answers, and we thus offer a more promising alternative to the two current trends in the literature: (i) LSH-based algorithms that support guarantees, but are relatively slow, and (ii) kNN graphs and inverted indexes, which are relatively fast, but do not provide theoretical guarantees. This finding paves the way for very exciting new developments which will lead to efficient solutions that can support critical analytical tasks such as brain seizure detection, cyber-attack prevention, transportation management and data cleaning automationfr_FR
dc.description.collaboratorBounabat, Bouchaib (Président)
dc.description.collaboratorBenbrahim, Houda (Directrice de thèse)
dc.description.collaboratorPalpanas, Themis (Co-Encadrant de thèse)
dc.description.collaboratorRamdan, Mohammed (Rapporteur)
dc.description.collaboratorLazaar, Mohamed (Rapporteur)
dc.description.collaboratorBakkoury, Zohra (Rapportrice)
dc.description.collaboratorZbakh, Mostapha (Examinateur)
dc.identifier.urihttps://toubkalpreprod.imist.ma/handle/123456789/16394
dc.language.isoenfr_FR
dc.publisherUniversité Mohamed V, Ecole Nationale Supérieur d'Informatique et d'Analyse des Systèmes , Rabatfr_FR
dc.relation.ispartofseries57/2023;
dc.subjectInformatiquefr_FR
dc.subjectHigh-dimensionalfr_FR
dc.subjectData seriesfr_FR
dc.subjectMultidimensional data research communitiesfr_FR
dc.subjectAlgorithmfr_FR
dc.titleSCALABLE AND ACCURATE HIGH-DIMENSIONAL SIMILARITY SEARCH: FROM DATA SERIES TO DEEP NETWORK EMBEDDINGSfr_FR

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
THESE_ECHIHABI.pdf
Size:
7.28 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description: