1. Script 1: Data Extraction and Preprocessing
Purpose: Extract text from your PDF files and preprocess it (cleaning, removing unnecessary parts, and tokenizing it).
Functionality:
Extracts text from PDFs using libraries like pdfminer or PyMuPDF.
Cleans the text (removes headers, footers, special characters).
Splits the text into smaller chunks that fit the model’s tokenization limits (typically 1024 tokens).
2. Script 2: Tokenization and Dataset Preparation
Purpose: Tokenize the cleaned text and split it into manageable chunks that the GPT-2 model can process.
Functionality:
Tokenizes the cleaned text using the GPT-2 tokenizer from Hugging Face.
Splits the tokenized data into chunks (1024 tokens per chunk) for training.
Converts it into a format compatible with Hugging Face’s training pipeline.
3. Script 3: Fine-tuning GPT-2
Purpose: Fine-tune the pre-trained GPT-2 model on the tokenized book data.
Functionality:
Loads the pre-trained GPT-2 model and tokenizer.
Fine-tunes GPT-2 on the tokenized data using Hugging Face’s Trainer API.
Saves the fine-tuned model to disk for later use.
4. Script 4: Text Generation (Inference)
Purpose: Use the fine-tuned GPT-2 model to generate text based on a prompt.
Functionality:
Loads the fine-tuned GPT-2 model and tokenizer.
Accepts a text prompt (e.g., user input or predefined).
Generates text based on the prompt using the fine-tuned model.
Final Count of Python Scripts:
Script 1: Data Extraction and Preprocessing (Extracts text from PDFs and cleans it).
Script 2: Tokenization and Dataset Preparation (Tokenizes and chunks the text into usable format for training).
Script 3: Fine-tuning GPT-2 (Fine-tunes the GPT-2 model on the processed text).
Script 4: Text Generation (Generates text using the fine-tuned model).
We are ingesting ./pdfs/Adventures Of Huckleberry Finn.pdf and ./pdfs/The Stolen White Elephant.pdf
At this point Cursor finally stated no issues.
There were in fact still issues.
Leave a Reply