Back to Tutorials
IntermediateTutorial 9

Large Language Models: How LLMs Work

NeuronDB Team
2/24/2025
28 min read

Large Language Models Overview

Large language models are transformer-based models trained on massive text corpora. They learn language patterns through pre-training. They generate coherent text. They perform various NLP tasks. They form the foundation of modern AI applications.

LLMs use transformer architectures. They scale to billions of parameters. They learn from unsupervised pre-training. They adapt to tasks through fine-tuning. They enable zero-shot and few-shot learning.

LLM Architecture
Figure: LLM Architecture

The diagram shows LLM structure. Input tokens flow through transformer layers. Each layer processes information. Output generates next tokens.

Pre-training Process

Pre-training learns language representations from unlabeled text. Models predict masked tokens or next tokens. They learn syntax, semantics, and world knowledge. They require massive compute and data.

Masked language modeling masks random tokens. Models predict masked tokens from context. This learns bidirectional representations. Next token prediction predicts following tokens. This learns autoregressive generation.

# Pre-training Concept
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
# Masked language modeling
text = "The cat sat on the [MASK]"
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
predictions = tokenizer.decode(outputs.logits[0].argmax(-1))
print("Predicted: " + str(predictions))

Pre-training creates general language understanding. Models learn from diverse text. They capture linguistic patterns. They enable transfer learning.

Pre-training
Figure: Pre-training

The diagram shows pre-training process. Models learn from large text corpora. They predict tokens from context. They learn language representations.

Fine-tuning Strategies

Fine-tuning adapts pre-trained models to specific tasks. It updates model weights on task data. It requires less data than training from scratch. It improves task performance significantly.

Full fine-tuning updates all parameters. It works well but is expensive. Parameter-efficient fine-tuning updates only some parameters. LoRA adds low-rank adapters. It reduces memory and compute.

Parameter-Efficient Fine-tuning
Figure: Parameter-Efficient Fine-tuning

The diagram compares full fine-tuning and PEFT methods. Full fine-tuning updates all parameters. PEFT methods update only adapters. LoRA adds low-rank matrices. Reduces memory and compute requirements.

LoRA Architecture
Figure: LoRA Architecture

The diagram shows LoRA architecture. Original weights frozen. Low-rank matrices A and B added. Output combines original and adapted weights. Enables efficient fine-tuning.

# Fine-tuning Example
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()

Fine-tuning adapts general models to specific tasks. It leverages pre-trained knowledge. It improves with task-specific data.

Tokenization Methods

Tokenization converts text to model inputs. Different models use different tokenizers. WordPiece splits words into subwords. BPE merges frequent byte pairs. SentencePiece handles multiple languages.

Tokenization handles out-of-vocabulary words. Subword tokenization splits unknown words. It maintains vocabulary coverage. It enables processing any text.

# Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
print("Tokens: " + str(tokens))
print("Token IDs: " + str(token_ids))
# Result:
# Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
# Token IDs: [15496, 11, 527, 499, 366, 30]

Tokenization is critical for model performance. It affects vocabulary coverage. It impacts sequence length. It influences model understanding.

Tokenization Methods
Figure: Tokenization Methods

The diagram shows tokenization methods. WordPiece splits words into subwords. BPE merges frequent pairs. Each method has different characteristics and use cases.

GPT Architecture

GPT uses decoder-only transformers. It predicts next tokens autoregressively. It generates text sequentially. It works well for generation tasks.

GPT stacks transformer decoder layers. Each layer has masked self-attention. Masking prevents looking ahead. It enables causal generation.

# GPT Generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
prompt = "The future of AI is"
inputs = tokenizer.encode(prompt, return_tensors='pt')
outputs = model.generate(inputs, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated: " + str(generated_text))

GPT generates coherent text. It continues prompts naturally. It works for various generation tasks.

GPT Architecture
Figure: GPT Architecture

The diagram shows GPT architecture. Decoder-only stack processes tokens. Masked self-attention prevents looking ahead. Feed-forward layers transform representations.

BERT Architecture

BERT uses encoder-only transformers. It processes bidirectional context. It works well for understanding tasks. It captures context from both directions.

BERT has two pre-training objectives. Masked language modeling learns representations. Next sentence prediction learns relationships. Both improve understanding.

# BERT for Classification
from transformers import BertForSequenceClassification, BertTokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "This movie is great"
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
print("Prediction: " + str(predictions))

BERT understands context bidirectionally. It works well for classification. It captures sentence relationships.

BERT Architecture
Figure: BERT Architecture

The diagram shows BERT architecture. Encoder-only stack processes tokens. Bidirectional self-attention sees both directions. Feed-forward layers transform representations.

T5 Architecture

T5 uses encoder-decoder transformers. It frames all tasks as text-to-text. It unifies task formats. It works for diverse tasks.

T5 converts tasks to text generation. Classification becomes text generation. Translation becomes text generation. Summarization becomes text generation.

# T5 for Text-to-Text
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
# Summarization
text = "summarize: The quick brown fox jumps over the lazy dog."
inputs = tokenizer.encode(text, return_tensors='pt')
outputs = model.generate(inputs, max_length=20)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Summary: " + str(summary))

T5 unifies task formats. It works for many tasks. It simplifies task handling.

Inference and Generation

Inference uses trained models for predictions. Generation creates new text. Different strategies produce different results. Greedy decoding selects highest probability tokens. Sampling adds randomness.

# Generation Strategies
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
prompt = "The future of technology"
inputs = tokenizer.encode(prompt, return_tensors='pt')
# Greedy decoding
outputs_greedy = model.generate(inputs, max_length=50, do_sample=False)
text_greedy = tokenizer.decode(outputs_greedy[0], skip_special_tokens=True)
# Sampling
outputs_sample = model.generate(inputs, max_length=50, do_sample=True, temperature=0.7)
text_sample = tokenizer.decode(outputs_sample[0], skip_special_tokens=True)
print("Greedy: " + str(text_greedy))
print("Sampling: " + str(text_sample))

Generation strategies affect output quality. Greedy produces deterministic results. Sampling produces diverse results. Temperature controls randomness.

Summary

Large language models are transformer-based models trained on massive text. Pre-training learns general language patterns. Fine-tuning adapts to specific tasks. Tokenization converts text to model inputs. GPT uses decoder-only architecture. BERT uses encoder-only architecture. T5 uses encoder-decoder architecture. Inference generates predictions. Generation creates new text. LLMs enable many AI applications.

References

Related Tutorials