Embeddings Overview
Embeddings represent data as dense vectors. They capture semantic meaning. Similar items have similar vectors. They enable similarity search and arithmetic operations. Word embeddings represent words as vectors. Sentence embeddings represent sentences. Document embeddings represent documents.
Embeddings transform discrete tokens into continuous vectors. They preserve semantic relationships. Words with similar meanings have similar vectors. They enable mathematical operations on meaning.
The diagram shows embedding space. Similar words cluster together. Relationships appear as vector differences. King - Man + Woman approximates Queen.
Word Embeddings
Word embeddings map words to vectors. Word2Vec learns from context. GloVe learns from co-occurrence statistics. Both capture semantic relationships. Pre-trained embeddings work well for many tasks.
Word2Vec has two architectures. Skip-gram predicts context from word. CBOW predicts word from context. Both learn useful representations. Training uses neural networks on large text corpora.
# Word Embeddings with Word2Vecfrom gensim.models import Word2Vecsentences = [['king', 'queen', 'royal'],['man', 'woman', 'person'],['paris', 'france', 'city'],['london', 'england', 'city']]model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)word_vectors = model.wv# Find similar wordssimilar = word_vectors.most_similar('king', topn=3)print("Similar to 'king': " + str(similar))# Vector arithmeticresult = word_vectors['king'] - word_vectors['man'] + word_vectors['woman']similar_words = word_vectors.similar_by_vector(result, topn=3)print("King - Man + Woman: " + str(similar_words))
-- NeuronDB: Word Embeddings StorageCREATE TABLE word_embeddings (word VARCHAR(100) PRIMARY KEY,embedding vector(300));-- Insert pre-trained embeddingsINSERT INTO word_embeddings (word, embedding) VALUES('king', ARRAY[0.1, 0.2, ...]::vector(300)),('queen', ARRAY[0.15, 0.18, ...]::vector(300));-- Find similar words using cosine similaritySELECT word, 1 - (embedding <=> (SELECT embedding FROM word_embeddings WHERE word = 'king')) AS similarityFROM word_embeddingsWHERE word != 'king'ORDER BY similarity DESCLIMIT 5;
Word embeddings capture semantic relationships. They enable similarity search. They support arithmetic operations. They are foundational for NLP.
Detailed Word Embedding Training Methods
Word2Vec uses two architectures. Skip-gram predicts context words from target word. Continuous Bag of Words predicts target word from context. Both learn embeddings by predicting word co-occurrences.
Skip-gram maximizes probability of context words given target. P(w_{i-k}, ..., w_{i+k} | w_i). It works well for rare words. It requires more training data. It captures multiple contexts per word.
CBOW averages context word embeddings. It predicts target word from context average. It trains faster than skip-gram. It works well for frequent words. It uses less memory.
Training uses negative sampling. Instead of computing all vocabulary probabilities, sample negative examples. Reduces computation from O(V) to O(k) where k is number of negatives. Typical k is 5-20. Speeds up training significantly.
# Detailed Word2Vec Trainingimport numpy as npfrom collections import defaultdictimport randomclass Word2VecDetailed:def __init__(self, vocab_size, embedding_dim=100, window_size=2, negative_samples=5):self.vocab_size = vocab_sizeself.embedding_dim = embedding_dimself.window_size = window_sizeself.negative_samples = negative_samples# Initialize embeddingsself.target_embeddings = np.random.randn(vocab_size, embedding_dim) * 0.01self.context_embeddings = np.random.randn(vocab_size, embedding_dim) * 0.01# Word frequencies for negative samplingself.word_freq = defaultdict(int)def skip_gram_step(self, target_idx, context_idx, learning_rate=0.01):# Positive exampletarget_emb = self.target_embeddings[target_idx]context_emb = self.context_embeddings[context_idx]# Compute positive scorescore = np.dot(target_emb, context_emb)sigmoid_score = 1 / (1 + np.exp(-score))# Positive gradientgrad_target_pos = (1 - sigmoid_score) * context_embgrad_context_pos = (1 - sigmoid_score) * target_emb# Negative samplinggrad_target_neg = np.zeros(self.embedding_dim)grad_context_neg = np.zeros(self.embedding_dim)for _ in range(self.negative_samples):neg_idx = self.sample_negative(context_idx)neg_emb = self.context_embeddings[neg_idx]neg_score = np.dot(target_emb, neg_emb)sigmoid_neg = 1 / (1 + np.exp(-neg_score))grad_target_neg += -sigmoid_neg * neg_embgrad_context_neg += -sigmoid_neg * target_emb# Update embeddingsself.target_embeddings[target_idx] += learning_rate * (grad_target_pos + grad_target_neg)self.context_embeddings[context_idx] += learning_rate * grad_context_posdef sample_negative(self, positive_idx):# Sample based on word frequency (unigram distribution)while True:neg_idx = random.randint(0, self.vocab_size - 1)if neg_idx != positive_idx:return neg_idxdef train(self, corpus, epochs=10, learning_rate=0.01):for epoch in range(epochs):total_loss = 0for sentence in corpus:for i, target_word in enumerate(sentence):# Get context wordsstart = max(0, i - self.window_size)end = min(len(sentence), i + self.window_size + 1)context_words = sentence[start:i] + sentence[i+1:end]for context_word in context_words:self.skip_gram_step(target_word, context_word, learning_rate)total_loss += 1print(f"Epoch {epoch+1}, Loss: {total_loss}")# Example usagecorpus = [[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 5]] # Word indicesmodel = Word2VecDetailed(vocab_size=10, embedding_dim=50)model.train(corpus, epochs=5)
Embedding Quality Evaluation
Evaluate embeddings using intrinsic and extrinsic tasks. Intrinsic tasks test embedding properties directly. Extrinsic tasks test downstream performance.
Intrinsic tasks include word similarity and word analogy. Word similarity compares embedding similarity to human judgments. Word analogy tests relationships like king - man + woman ≈ queen. These tasks measure embedding quality directly.
Extrinsic tasks test embeddings in applications. Text classification uses embeddings as features. Named entity recognition uses embeddings for sequence labeling. Machine translation uses embeddings for alignment. Performance on these tasks measures practical value.
# Embedding Quality Evaluationfrom sklearn.metrics.pairwise import cosine_similarityimport numpy as npdef evaluate_word_similarity(embeddings, word_pairs, human_scores):"""Evaluate embedding similarity against human judgments"""embedding_scores = []for word1, word2 in word_pairs:if word1 in embeddings and word2 in embeddings:sim = cosine_similarity(embeddings[word1].reshape(1, -1),embeddings[word2].reshape(1, -1))[0][0]embedding_scores.append(sim)else:embedding_scores.append(0)# Compute correlationcorrelation = np.corrcoef(human_scores, embedding_scores)[0, 1]return correlationdef evaluate_word_analogy(embeddings, analogy_tests):"""Evaluate word analogy tasks"""correct = 0total = 0for a, b, c, expected_d in analogy_tests:if all(w in embeddings for w in [a, b, c, expected_d]):# Compute: a - b + c should be close to expected_dvec = embeddings[a] - embeddings[b] + embeddings[c]# Find closest wordsimilarities = {}for word, emb in embeddings.items():if word not in [a, b, c]:sim = cosine_similarity(vec.reshape(1, -1), emb.reshape(1, -1))[0][0]similarities[word] = simpredicted_d = max(similarities, key=similarities.get)if predicted_d == expected_d:correct += 1total += 1accuracy = correct / total if total > 0 else 0return accuracy# Exampleembeddings = {'king': np.array([0.5, 0.3, 0.2]),'queen': np.array([0.4, 0.4, 0.2]),'man': np.array([0.6, 0.2, 0.2]),'woman': np.array([0.3, 0.5, 0.2])}word_pairs = [('king', 'queen'), ('man', 'woman')]human_scores = [0.8, 0.7]correlation = evaluate_word_similarity(embeddings, word_pairs, human_scores)print("Similarity correlation: " + str(correlation))analogy_tests = [('king', 'man', 'woman', 'queen')]accuracy = evaluate_word_analogy(embeddings, analogy_tests)print("Analogy accuracy: " + str(accuracy))
The diagram shows word embedding space. Related words cluster together. Vector differences encode relationships.
Sentence Embeddings
Sentence embeddings represent entire sentences. They capture sentence meaning. They enable sentence similarity search. They work well for semantic search and clustering.
Detailed Sentence Embedding Methods
Averaging word embeddings is simple but limited. It computes mean of word vectors. It loses word order information. It works for short sentences. It fails for complex semantics.
Sentence encoders use neural networks. They process entire sentences. They preserve word order. They capture sentence structure. They work better than averaging.
Transformer-based encoders use BERT or similar models. They process sentences through transformer layers. They use [CLS] token or mean pooling. They capture rich semantic information. They work well for many tasks.
# Detailed Sentence Embedding Methodsfrom sentence_transformers import SentenceTransformerimport numpy as npfrom sklearn.metrics.pairwise import cosine_similarity# Method 1: Averaging word embeddingsdef average_word_embeddings(sentence, word_embeddings):words = sentence.lower().split()word_vecs = [word_embeddings.get(word, np.zeros(300)) for word in words]if len(word_vecs) == 0:return np.zeros(300)return np.mean(word_vecs, axis=0)# Method 2: Sentence transformermodel = SentenceTransformer('all-MiniLM-L6-v2')sentences = ["Machine learning is a subset of artificial intelligence","Deep learning uses neural networks","AI enables computers to learn from data"]# Generate embeddingsembeddings_avg = [average_word_embeddings(s, {}) for s in sentences] # Placeholderembeddings_transformer = model.encode(sentences)# Compare similaritysimilarity_matrix = cosine_similarity(embeddings_transformer)print("Sentence similarity matrix:")print(similarity_matrix)# Method 3: BERT-based with mean poolingfrom transformers import AutoTokenizer, AutoModelimport torchtokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')bert_model = AutoModel.from_pretrained('bert-base-uncased')def get_bert_embeddings(sentences):embeddings = []for sentence in sentences:inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True, max_length=128)with torch.no_grad():outputs = bert_model(**inputs)# Mean poolingembedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()embeddings.append(embedding)return np.array(embeddings)bert_embeddings = get_bert_embeddings(sentences)print("BERT embeddings shape: " + str(bert_embeddings.shape))
Embedding Quality Metrics and Evaluation
Evaluate embeddings using multiple metrics. Intrinsic metrics test embedding properties. Extrinsic metrics test application performance. Both are important for assessment.
Intrinsic metrics include similarity correlation and analogy accuracy. Similarity correlation compares embedding similarity to human judgments. Higher correlation indicates better embeddings. Analogy accuracy tests word relationships. Higher accuracy indicates better structure.
Extrinsic metrics test downstream tasks. Classification accuracy uses embeddings as features. Clustering quality measures grouping performance. Retrieval performance measures search quality. Better embeddings improve task performance.
# Comprehensive Embedding Evaluationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.cluster import KMeansfrom sklearn.metrics import adjusted_rand_score, silhouette_scorefrom scipy.stats import spearmanrdef evaluate_embeddings_comprehensive(embeddings, labels, similarity_pairs=None, human_scores=None):results = {}# 1. Similarity correlation (if human scores available)if similarity_pairs and human_scores:embedding_scores = []for word1, word2 in similarity_pairs:if word1 in embeddings and word2 in embeddings:sim = cosine_similarity(embeddings[word1].reshape(1, -1),embeddings[word2].reshape(1, -1))[0][0]embedding_scores.append(sim)if len(embedding_scores) == len(human_scores):correlation, p_value = spearmanr(embedding_scores, human_scores)results['similarity_correlation'] = correlationresults['similarity_p_value'] = p_value# 2. Classification performanceX = np.array([embeddings.get(word, np.zeros(300)) for word in labels.keys()])y = list(labels.values())from sklearn.model_selection import cross_val_scoreclf = LogisticRegression()cv_scores = cross_val_score(clf, X, y, cv=5)results['classification_accuracy'] = cv_scores.mean()results['classification_std'] = cv_scores.std()# 3. Clustering qualityn_clusters = len(set(y))kmeans = KMeans(n_clusters=n_clusters, random_state=42)cluster_labels = kmeans.fit_predict(X)ari = adjusted_rand_score(y, cluster_labels)silhouette = silhouette_score(X, cluster_labels)results['adjusted_rand_index'] = ariresults['silhouette_score'] = silhouettereturn results# Example evaluationembeddings_dict = {'cat': np.random.randn(300),'dog': np.random.randn(300),'car': np.random.randn(300),'vehicle': np.random.randn(300)}labels_dict = {'cat': 0, 'dog': 0, 'car': 1, 'vehicle': 1}results = evaluate_embeddings_comprehensive(embeddings_dict, labels_dict)print("Evaluation results:")for metric, value in results.items():print(f"{metric}: {value:.4f}")
Methods include averaging word embeddings, training sentence encoders, and using transformer models. Averaging is simple but loses word order. Sentence encoders preserve structure. Transformers capture complex relationships.
# Sentence Embeddingsfrom sentence_transformers import SentenceTransformerimport numpy as npmodel = SentenceTransformer('all-MiniLM-L6-v2')sentences = ["The cat sits on the mat","A feline is on the rug","The weather is sunny today"]embeddings = model.encode(sentences)# Compute similaritysimilarity = np.dot(embeddings[0], embeddings[1])print("Similarity between sentence 1 and 2: " + str(similarity))# High similarity indicates similar meaning
Sentence embeddings enable semantic search. They find sentences with similar meaning. They work regardless of exact word matches.
The diagram shows sentence embedding space. Semantically similar sentences cluster together.
Document Embeddings
Document embeddings represent entire documents. They capture document topics and themes. They enable document similarity and clustering. They work well for information retrieval.
Methods include averaging sentence embeddings, training document encoders, and using transformer models with pooling. Document encoders preserve document structure. Transformers capture long-range dependencies.
# Document Embeddingsfrom sentence_transformers import SentenceTransformermodel = SentenceTransformer('all-MiniLM-L6-v2')documents = ["Machine learning is a subset of artificial intelligence...","Deep learning uses neural networks with multiple layers...","The weather forecast predicts rain tomorrow..."]doc_embeddings = model.encode(documents)# Find similar documentsfrom sklearn.metrics.pairwise import cosine_similaritysimilarity_matrix = cosine_similarity(doc_embeddings)print("Document similarity matrix:")print(similarity_matrix)
Document embeddings enable semantic document search. They find documents with similar topics. They work for large document collections.
Embedding Similarity and Distance
Similarity measures compare embeddings. Cosine similarity measures angle between vectors. Euclidean distance measures straight-line distance. Dot product measures alignment. Each suits different use cases.
Cosine similarity is cos(θ) = (A·B) / (||A|| × ||B||). It ranges from -1 to 1. Higher values mean more similar. It ignores vector magnitudes. It works well for embeddings.
# Embedding Similarityimport numpy as npfrom sklearn.metrics.pairwise import cosine_similarity, euclidean_distancesembeddings = np.array([[1.0, 0.0, 0.0],[0.9, 0.1, 0.0],[0.0, 1.0, 0.0]])# Cosine similaritycos_sim = cosine_similarity(embeddings)print("Cosine similarity:")print(cos_sim)# Euclidean distanceeuc_dist = euclidean_distances(embeddings)print("Euclidean distance:")print(euc_dist)
Choose similarity measures based on needs. Cosine similarity works well for embeddings. Euclidean distance works for spatial data.
The diagram shows similarity computation. Vectors with small angles have high cosine similarity.
Embedding Arithmetic
Embedding arithmetic performs operations on meaning. King - Man + Woman approximates Queen. It demonstrates captured relationships. It enables analogy solving.
Arithmetic works because embeddings capture relationships. Vector differences encode relationships. Adding differences applies relationships. Results approximate semantic operations.
# Embedding Arithmeticfrom gensim.models import KeyedVectors# Load pre-trained embeddingsword_vectors = KeyedVectors.load_word2vec_format('word2vec.bin', binary=True)# King - Man + Woman ≈ Queenresult = word_vectors['king'] - word_vectors['man'] + word_vectors['woman']similar_words = word_vectors.similar_by_vector(result, topn=5)print("King - Man + Woman: " + str(similar_words[0][0])) # Should be 'queen'# Paris - France + Italy ≈ Romeresult = word_vectors['paris'] - word_vectors['france'] + word_vectors['italy']similar_words = word_vectors.similar_by_vector(result, topn=5)print("Paris - France + Italy: " + str(similar_words[0][0])) # Should be 'rome'
Embedding arithmetic demonstrates learned relationships. It shows embeddings capture semantic structure. It enables analogy solving.
The diagram shows embedding arithmetic. Vector operations approximate semantic relationships.
Pre-trained Embeddings
Pre-trained embeddings are trained on large corpora. They capture general language patterns. They work well for many tasks. They save training time and data.
Common pre-trained embeddings include Word2Vec, GloVe, FastText, and transformer embeddings. Word2Vec and GloVe are word-level. FastText handles subwords. Transformers provide contextual embeddings.
# Using Pre-trained Embeddingsimport gensim.downloader as api# Load pre-trained Word2Vecword_vectors = api.load("word2vec-google-news-300")# Use embeddingssimilar = word_vectors.most_similar('computer', topn=5)print("Similar to 'computer': " + str(similar))
Pre-trained embeddings provide strong baselines. They work well without fine-tuning. They enable quick prototyping.
Fine-tuning Embeddings
Fine-tuning adapts pre-trained embeddings to specific tasks. It improves performance on domain data. It requires task-specific training data. It balances general and specific knowledge.
Fine-tuning updates embedding weights. It preserves general knowledge. It learns task-specific patterns. It improves performance on target tasks.
# Fine-tuning Embeddingsfrom sentence_transformers import SentenceTransformer, InputExample, lossesfrom torch.utils.data import DataLoadermodel = SentenceTransformer('all-MiniLM-L6-v2')# Task-specific examplesexamples = [InputExample(texts=['query about machine learning', 'document about ML']),InputExample(texts=['query about weather', 'weather forecast document'])]dataloader = DataLoader(examples, shuffle=True, batch_size=16)loss = losses.CosineSimilarityLoss(model)# Fine-tunemodel.fit(train_objectives=[(dataloader, loss)], epochs=1)
Fine-tuning improves task performance. It adapts general embeddings to specific needs. It requires labeled task data.
Summary
Embeddings represent data as dense vectors. Word embeddings capture word meaning. Sentence embeddings capture sentence meaning. Document embeddings capture document topics. Similarity measures compare embeddings. Cosine similarity works well for embeddings. Embedding arithmetic performs operations on meaning. Pre-trained embeddings provide strong baselines. Fine-tuning adapts embeddings to tasks. Embeddings enable semantic search and similarity operations.