BLOG POSTS

MangoHost Blog / Creating an LLM Dataset for Training and Validation

Creating an LLM Dataset for Training and Validation

Building high-quality datasets for training Large Language Models isn’t just about scraping the internet and throwing everything into a blender. It’s a meticulous process that determines whether your model becomes a reliable assistant or generates nonsensical responses. This guide walks you through creating robust LLM datasets from data collection and preprocessing to validation splits and quality assurance, covering the infrastructure needs, common gotchas, and proven techniques that separate production-ready datasets from hobby projects.

Understanding LLM Dataset Architecture

LLM datasets differ significantly from traditional machine learning datasets. Instead of structured features and labels, you’re working with massive text corpora that need careful organization, cleaning, and formatting. The typical pipeline involves collecting raw text from various sources, applying multiple preprocessing steps, implementing quality filters, and structuring the data for efficient training.

Modern LLM training requires datasets ranging from hundreds of gigabytes to several terabytes. The Common Crawl dataset alone contains over 3 billion web pages, while academic datasets like The Pile weigh in at around 800GB of compressed text. Your infrastructure needs to handle this scale efficiently.

Dataset Type	Size Range	Use Case	Processing Time
Domain-specific	1-50 GB	Specialized models	Hours
General purpose	100-500 GB	Mid-size models	Days
Large-scale	1+ TB	Foundation models	Weeks

Setting Up Your Data Processing Infrastructure

Processing large datasets requires substantial compute resources. A typical setup involves multiple high-memory instances for parallel processing, fast SSD storage for intermediate files, and efficient networking for data transfer.

For processing datasets under 100GB, a single VPS with 32GB RAM and fast NVMe storage works well. Larger datasets benefit from dedicated servers with 128GB+ RAM and multiple CPU cores for parallel processing.

# Install essential tools for dataset processing
apt update && apt install -y python3-pip git wget unzip parallel

# Install Python libraries
pip3 install datasets transformers tokenizers pandas numpy scipy

# Set up processing environment
mkdir -p /data/{raw,processed,validated}
export PYTHONPATH=/opt/dataset-tools:$PYTHONPATH

Create a basic processing pipeline structure:

#!/usr/bin/env python3
import os
import json
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer

class DatasetProcessor:
    def __init__(self, config_path):
        with open(config_path, 'r') as f:
            self.config = json.load(f)
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.config['tokenizer_name']
        )
    
    def load_raw_data(self, data_path):
        """Load raw text data from various formats"""
        if data_path.endswith('.jsonl'):
            return pd.read_json(data_path, lines=True)
        elif data_path.endswith('.csv'):
            return pd.read_csv(data_path)
        else:
            raise ValueError(f"Unsupported format: {data_path}")
    
    def preprocess_text(self, text):
        """Apply text cleaning and normalization"""
        # Remove excessive whitespace
        text = ' '.join(text.split())
        # Filter out very short or very long sequences
        if len(text) < self.config['min_length'] or \
           len(text) > self.config['max_length']:
            return None
        return text

Data Collection and Source Management

Quality datasets combine multiple sources with different characteristics. Web scraping provides broad coverage but requires aggressive filtering. Academic papers offer high-quality text but limited domain coverage. Books and articles provide structured, well-edited content but may have licensing restrictions.

Popular data sources include:

Common Crawl web archives (free, massive scale, variable quality)
Wikipedia dumps (high quality, multilingual, structured)
Project Gutenberg (books, literary content, public domain)
arXiv papers (academic, technical content, recent)
Reddit datasets (conversational, diverse topics, informal)
GitHub repositories (code, documentation, technical)

Implement a source manager to handle different data formats:

class SourceManager:
    def __init__(self):
        self.sources = {}
    
    def add_source(self, name, url, processor_func):
        self.sources[name] = {
            'url': url,
            'processor': processor_func,
            'downloaded': False
        }
    
    def download_source(self, name, output_dir):
        source = self.sources[name]
        if not source['downloaded']:
            print(f"Downloading {name}...")
            os.system(f"wget -P {output_dir} {source['url']}")
            source['downloaded'] = True
    
    def process_wikipedia_dump(self, dump_path):
        """Extract clean text from Wikipedia XML dumps"""
        import xml.etree.ElementTree as ET
        
        texts = []
        for event, elem in ET.iterparse(dump_path, events=('start', 'end')):
            if event == 'end' and elem.tag.endswith('text'):
                if elem.text:
                    # Remove wiki markup, extract clean text
                    clean_text = self.clean_wiki_markup(elem.text)
                    if len(clean_text) > 100:
                        texts.append(clean_text)
                elem.clear()
        return texts

Text Processing and Quality Filtering

Raw text from the internet contains significant noise: HTML tags, garbled encoding, duplicate content, spam, and low-quality text. Effective filtering removes this noise while preserving valuable training data.

Implement a multi-stage filtering pipeline:

import re
import langdetect
from collections import Counter

class QualityFilter:
    def __init__(self):
        self.min_words = 10
        self.max_words = 10000
        self.min_avg_word_length = 3
        self.max_duplicate_ratio = 0.8
        
    def language_filter(self, text):
        """Filter non-English text"""
        try:
            return langdetect.detect(text) == 'en'
        except:
            return False
    
    def quality_score(self, text):
        """Calculate text quality score"""
        words = text.split()
        
        # Word count check
        if len(words) < self.min_words or len(words) > self.max_words:
            return 0.0
        
        # Average word length
        avg_word_len = sum(len(w) for w in words) / len(words)
        if avg_word_len < self.min_avg_word_length:
            return 0.0
        
        # Character diversity
        char_counts = Counter(text.lower())
        most_common_ratio = char_counts.most_common(1)[0][1] / len(text)
        if most_common_ratio > 0.5:  # Too repetitive
            return 0.0
        
        # Basic spam detection
        spam_patterns = [
            r'\b(buy now|click here|limited time)\b',
            r'(viagra|casino|lottery)',
            r'[A-Z]{10,}',  # All caps spam
        ]
        
        spam_score = 0
        for pattern in spam_patterns:
            if re.search(pattern, text, re.IGNORECASE):
                spam_score += 1
        
        return max(0.0, 1.0 - (spam_score * 0.3))
    
    def deduplication(self, texts, threshold=0.85):
        """Remove near-duplicate texts"""
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity
        
        vectorizer = TfidfVectorizer(max_features=1000)
        tfidf_matrix = vectorizer.fit_transform(texts)
        
        similarities = cosine_similarity(tfidf_matrix)
        to_remove = set()
        
        for i in range(len(texts)):
            if i in to_remove:
                continue
            for j in range(i+1, len(texts)):
                if similarities[i][j] > threshold:
                    to_remove.add(j)
        
        return [texts[i] for i in range(len(texts)) if i not in to_remove]

Tokenization and Format Preparation

LLM training requires converting text into tokens that the model can process. Different tokenization strategies affect model performance and training efficiency. Byte Pair Encoding (BPE) and SentencePiece are popular choices for modern LLMs.

Prepare your text data for training with proper tokenization:

def prepare_training_data(texts, tokenizer, max_length=2048):
    """Tokenize and chunk texts for LLM training"""
    
    def chunk_text(text, chunk_size, overlap=128):
        """Split long texts into overlapping chunks"""
        tokens = tokenizer.encode(text)
        chunks = []
        
        start = 0
        while start < len(tokens):
            end = min(start + chunk_size, len(tokens))
            chunk_tokens = tokens[start:end]
            
            # Add special tokens
            if start == 0:
                chunk_tokens = [tokenizer.bos_token_id] + chunk_tokens
            if end == len(tokens):
                chunk_tokens = chunk_tokens + [tokenizer.eos_token_id]
            
            chunks.append({
                'input_ids': chunk_tokens,
                'attention_mask': [1] * len(chunk_tokens)
            })
            
            start = end - overlap
            if start >= len(tokens):
                break
        
        return chunks
    
    processed_data = []
    for text in texts:
        if len(text.strip()) == 0:
            continue
            
        chunks = chunk_text(text, max_length - 2)  # Reserve space for special tokens
        processed_data.extend(chunks)
    
    return processed_data

# Usage example
tokenizer = AutoTokenizer.from_pretrained('gpt2')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

training_data = prepare_training_data(filtered_texts, tokenizer)
print(f"Generated {len(training_data)} training examples")

Creating Training and Validation Splits

Proper data splitting ensures reliable model evaluation and prevents overfitting. Unlike traditional ML, LLM datasets require careful consideration of data distribution and temporal ordering.

Implement stratified splitting based on content characteristics:

from sklearn.model_selection import train_test_split
import numpy as np

def create_dataset_splits(data, test_size=0.1, val_size=0.1, 
                         stratify_by_length=True, random_state=42):
    """Create train/validation/test splits with proper stratification"""
    
    if stratify_by_length:
        # Create length-based strata
        lengths = [len(item['input_ids']) for item in data]
        length_percentiles = np.percentile(lengths, [25, 50, 75])
        
        strata = []
        for length in lengths:
            if length <= length_percentiles[0]:
                strata.append(0)
            elif length <= length_percentiles[1]:
                strata.append(1)
            elif length <= length_percentiles[2]:
                strata.append(2)
            else:
                strata.append(3)
    else:
        strata = None
    
    # First split: separate test set
    train_val_data, test_data = train_test_split(
        data, test_size=test_size, 
        stratify=strata, random_state=random_state
    )
    
    # Second split: separate validation from training
    val_ratio = val_size / (1 - test_size)
    if stratify_by_length:
        train_val_strata = strata[:len(train_val_data)]
    else:
        train_val_strata = None
        
    train_data, val_data = train_test_split(
        train_val_data, test_size=val_ratio,
        stratify=train_val_strata, random_state=random_state
    )
    
    return {
        'train': Dataset.from_list(train_data),
        'validation': Dataset.from_list(val_data),
        'test': Dataset.from_list(test_data)
    }

# Create and save datasets
dataset_dict = create_dataset_splits(training_data)
dataset_dict = DatasetDict(dataset_dict)

# Save to disk
dataset_dict.save_to_disk('/data/processed/llm_dataset')

# Push to Hugging Face Hub (optional)
# dataset_dict.push_to_hub('your-username/your-dataset')

Quality Assurance and Validation

Dataset quality directly impacts model performance. Implement comprehensive validation checks to catch issues before expensive training runs.

class DatasetValidator:
    def __init__(self, dataset_dict):
        self.dataset_dict = dataset_dict
        
    def validate_splits(self):
        """Check data distribution across splits"""
        print("Dataset Split Validation:")
        total_samples = sum(len(split) for split in self.dataset_dict.values())
        
        for split_name, split_data in self.dataset_dict.items():
            ratio = len(split_data) / total_samples
            avg_length = np.mean([len(item['input_ids']) for item in split_data])
            print(f"{split_name}: {len(split_data)} samples ({ratio:.1%}), "
                  f"avg length: {avg_length:.0f} tokens")
    
    def check_token_distribution(self):
        """Analyze token frequency distribution"""
        all_tokens = []
        for split_data in self.dataset_dict.values():
            for item in split_data:
                all_tokens.extend(item['input_ids'])
        
        token_counts = Counter(all_tokens)
        vocab_size = len(token_counts)
        most_common = token_counts.most_common(10)
        
        print(f"\nToken Distribution:")
        print(f"Vocabulary size: {vocab_size}")
        print(f"Total tokens: {len(all_tokens)}")
        print(f"Most common tokens: {most_common}")
    
    def detect_anomalies(self):
        """Find potential data quality issues"""
        issues = []
        
        for split_name, split_data in self.dataset_dict.items():
            lengths = [len(item['input_ids']) for item in split_data]
            
            # Check for extremely short/long sequences
            very_short = sum(1 for l in lengths if l < 10)
            very_long = sum(1 for l in lengths if l > 1000)
            
            if very_short > len(lengths) * 0.05:
                issues.append(f"{split_name}: {very_short} very short sequences")
            if very_long > len(lengths) * 0.01:
                issues.append(f"{split_name}: {very_long} very long sequences")
        
        return issues

# Run validation
validator = DatasetValidator(dataset_dict)
validator.validate_splits()
validator.check_token_distribution()

issues = validator.detect_anomalies()
if issues:
    print("\nPotential Issues Found:")
    for issue in issues:
        print(f"- {issue}")

Performance Optimization and Storage

Large datasets require efficient storage and loading mechanisms. The Hugging Face datasets library provides memory mapping and lazy loading, but additional optimizations help with very large datasets.

Storage Format	Load Speed	Memory Usage	Compression	Best For
Arrow/Parquet	Fast	Low	Good	Production training
JSON Lines	Medium	High	Poor	Development/debugging
HDF5	Fast	Medium	Good	Numerical data
TFRecord	Fast	Low	Good	TensorFlow training

Optimize dataset loading performance:

def optimize_dataset_storage(dataset_dict, output_path):
    """Optimize dataset for training performance"""
    
    # Set optimal format
    dataset_dict.set_format(
        type='torch',
        columns=['input_ids', 'attention_mask']
    )
    
    # Enable memory mapping for large datasets
    for split_name, split_data in dataset_dict.items():
        split_data.save_to_disk(
            f"{output_path}/{split_name}",
            num_shards=8  # Parallel loading
        )
    
    # Create data loading configuration
    config = {
        'batch_size': 32,
        'num_workers': 4,
        'pin_memory': True,
        'prefetch_factor': 2
    }
    
    with open(f"{output_path}/dataloader_config.json", 'w') as f:
        json.dump(config, f, indent=2)

# Optimize and save
optimize_dataset_storage(dataset_dict, '/data/processed/optimized_dataset')

Common Issues and Troubleshooting

Dataset creation involves several common pitfalls that can waste significant time and resources. Memory issues during processing are frequent with large datasets. Implement batch processing and monitor memory usage closely.

Tokenization inconsistencies cause training failures. Always validate that your tokenizer matches the model architecture you plan to use. Special tokens (BOS, EOS, PAD) must be handled consistently across all data splits.

Data leakage between splits happens when similar or identical content appears in both training and validation sets. This is particularly common with web-scraped data where the same article might appear on multiple sites.

# Monitor memory usage during processing
import psutil
import gc

def process_with_memory_management(data_chunks, process_func, max_memory_gb=16):
    """Process data chunks with memory monitoring"""
    results = []
    
    for i, chunk in enumerate(data_chunks):
        # Check memory usage
        memory_gb = psutil.virtual_memory().used / (1024**3)
        if memory_gb > max_memory_gb:
            print(f"Memory usage high ({memory_gb:.1f}GB), clearing cache...")
            gc.collect()
        
        result = process_func(chunk)
        results.extend(result)
        
        if i % 100 == 0:
            print(f"Processed {i}/{len(data_chunks)} chunks")
    
    return results

Character encoding issues are common with web data. Always specify UTF-8 encoding and handle decode errors gracefully. Implement robust error handling for corrupted files or network timeouts during downloads.

Real-World Applications and Use Cases

Domain-specific datasets enable specialized models that outperform general-purpose alternatives. Legal document processing benefits from datasets containing court cases, contracts, and legal texts. Medical applications require carefully curated datasets from medical literature and clinical notes.

Code generation models need diverse programming language samples, documentation, and examples. Financial models require news articles, earnings reports, and market analysis data. Each domain has specific preprocessing requirements and quality metrics.

Multilingual datasets present additional challenges with language detection, script normalization, and balanced representation across languages. Consider using language-specific preprocessing pipelines and validation metrics.

For more information on dataset best practices, check the Hugging Face Datasets documentation and the EleutherAI Pile repository for examples of large-scale dataset creation.

Creating high-quality LLM datasets requires patience, computational resources, and attention to detail. Start with smaller, domain-specific datasets to validate your pipeline before scaling to larger collections. The infrastructure investment pays off when your models consistently generate coherent, useful outputs instead of struggling with poor-quality training data.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.