Researchers from Tsinghua and ModelBest Release Ultra-FineWeb: A Trillion-Token Dataset Enhancing LLM Accuracy Across Benchmarks

The data quality used in pretraining LLMs has become increasingly critical to their success. To build information-rich corpora, researchers have moved from heuristic filtering methods, such as rule-based noise removal and…

Continue Reading