: Approximately 100,000 documents with titles, tables, and images removed to provide clean, plain text.
: Many versions include a brief summary for each article, allowing models to be trained on how to condense information.
: These datasets often represent millions of individual word tokens, making them suitable for training small-to-medium scale language models. Germany 100k.zip
This dataset typically contains extracted from German Wikipedia . It is widely used by researchers for tasks such as:
: Providing a large corpus for both extractive and abstractive summarization techniques. : Approximately 100,000 documents with titles, tables, and
: Identifying specific locations, organizations, or names within German-language text. Dataset Composition
: Building a set of unique German words or tokens for language modeling. Dataset Composition : Building a set of unique
While exact versions vary (such as the dataset hosted on Hugging Face ), these files generally include: