The aim of this research is to see how many presentations there are between dialects and look for clusters. There is also a method used for weighting, namely tf-idf, there are several steps used in this method, namely starting from the tokenizing process, transform cases, stopwords filter and token filter. to search for clusters using the k-means clustering method on rapidminer. The results of this research obtained a tf-idf weighting value, namely ginger dialect 37.5% for the number of word occurrences and 62.5% for the total of all words documented. Furthermore, for the Julu dialect, it was 37.5% for the number of word occurrences and 62.5% for the total of all words documented. The Singaporean Lau dialect accounts for 38% of the number of word occurrences and 62% of the total number of words documented. The singteruh deleng lau dialect accounts for 38% of the number of word occurrences and 62% of the total number of words documented. The Liang Melas dialect accounts for 38% of the number of word occurrences and 62% of the total number of words documented. Based on k-means clustering, it produces cluster 0: 68 items, cluster 1: 3 items, cluster 2: 15 items, cluster 3: 10 items, cluster 4: 4 items with a total sample of 100 items. The conclusion obtained is that the Ginger dialect and the Julu dialect are identical, while the Singaporean Lau dialect, the Teruh Deleng and Liang Melas dialects are also identical.
