Usage#

from PolyDeDupe import deduplicate_dataset, display_dataset_entries
from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca",split="train")
ds_dedup, duplicate_clusters = deduplicate_dataset(newdataset, jaccard_threshold=0.90)
display_dataset_entries(newdataset, duplicate_clusters)

Output:

Original dataset size: 52002
Number of duplicate clusters: 40
Files in duplicate cluster: 82
Unique files in duplicate cluster: 52
Filtered dataset size: 51972
Cluster:
Base Index: 1482, Data: {'instruction': 'Find the five largest cities in France.', 'input': '', 'output': 'The five largest cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nFind the five largest cities in France.\n\n### Response:\nThe five largest cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.'}
Base Index: 1820, Data: {'instruction': 'Name five cities in France.', 'input': '', 'output': 'The five cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nName five cities in France.\n\n### Response:\nThe five cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.'}

Supported Langauges#

  • Western European Languages:

    • French, German, Spanish, Portuguese, Italian, Dutch, etc.

  • Central European and Baltic Languages:

    • Czech, Polish, Slovak, Hungarian, Croatian, Slovenian, Latvian, Lithuanian, etc.

  • Additional European Languages:

    • Additional European languages with special characters.

  • Vietnamese and Some African Languages:

    • Vietnamese and various African languages using extended Latin characters.

  • Slavic Languages Using Cyrillic Script:

    • Russian, Bulgarian, Serbian, Ukrainian, Belarusian, Macedonian, etc.

  • Greek Language:

    • Modern Greek.

  • Arabic Language and its Variants:

    • Standard Arabic, Persian (Farsi), Urdu, Pashto, Kurdish (Sorani), etc.

  • Languages Using the Devanagari Script:

    • Hindi, Marathi, Sanskrit, Nepali, Konkani, Bodo, etc.

  • Ethiopic Script Languages:

    • Amharic, Tigrinya, and other languages in Ethiopia and Eritrea.

  • Tifinagh Script for Berber Languages:

    • Berber languages in North Africa.

  • Vai Script:

    • Used for the Vai language in West Africa.

  • East Asian Languages:

    • Chinese, Japanese, Korean.

  • Dravidian Languages:

    • Tamil, Telugu, Kannada, Malayalam.

  • Indian Languages:

    • Bengali, Punjabi, Gujarati, Oriya.

  • General Latin, Numerals, and Underscore:

    • Basic Latin characters, numbers, and underscore used globally.