主页 > 开源代码  > 

WordEmbeddings

WordEmbeddings
Count-based Approach

Term-document matrix: Document vectors

Two ways to extract information from the matrix:

Column-wise: a document is represented by a |V|-dim vector (V: vocabulary)

Widely used in information retrieval:

find similar documents 查找類似的文件

Two documents that are similar will tend to have similar words

find documents close to a query 查找附近的查詢的文件

Consider a query as a document Row-wise: a word is represented by a |D|-dim vector (D: document set

Term-term matrix

we have seen it before (co-occurrence vectors): Count how many times a word u appearing with a word v

raw frequency is bad 原始頻率不良 Not all contextual words are equally important: of, a, … vs. sugar, jam, fruit… 並非所有上下文單詞同樣重要Which words are important, which ones are not? infrequent words are more important than frequent ones (examples?) 不頻繁的單詞比常見單詞更重要correlated words are more important than uncorrelated ones (examples?)…

→ weighing schemes (TF-IDF, PMI,…)

Weighing terms: TF-IDF (for term-document matrix)

tf (frequency count): t f ( t , d ) = log ⁡ 10 ( 1 + c o u n t ( t , d ) ) tf(t,d)=\log_{10}(1+count(t,d)) tf(t,d)=log10​(1+count(t,d))

idf (inverse document frequency): popular terms (terms that appear in many documents) are down weighed t f ( t , d ) = log ⁡ 10 N d f ( t ) tf(t,d)=\log_{10}\frac{N}{df(t)} tf(t,d)=log10​df(t)N​

TF - IDF: t f − i d f ( t , d ) = t f ( t , d ) ∗ i d f ( t ) tf - idf(t,d) = tf(t,d) *idf(t) tf−idf(t,d)=tf(t,d)∗idf(t)

Many word pairs should have > 0 counts, but their corresponding matrix entries are 0s because of lacking data (data sparsity) → Laplace smoothing: adding 1 to every entry (pseudocount)

ProsConsSimple and intuitiveWord/document vectors are sparse (dims are |V|, vocabulary size, or |D|, number of documents, often from 2k to 10k) → difficult for machine learning algorithmsDimensions are meaningful (e.g., each dim is a document / a contextual word)→ easy to debug and interpret (Think about Explainable AI)How to represent word meaning in a specific context?(From sparse vectors to dense vectors)->Employ dimensionality reduction (e.g., latent semantic analysis - LSA)Use a different approach: prediction (coming up next) Prediction-based Approach

Introduction to ANNs used to learn word embeddings

two major count-based approach methods: term-document matrix 術語文檔矩陣term-term matrix 術語矩陣 Raw frequency is bed using weighing schemes to “correct” counts使用稱重方案using smoothing to take into account “unseen” events使用平滑來考慮看不見的事件 Formalisation

Assumptions: ● each word w ∈ V is represented by a vector v ∈ R d (d is often smaller than 3k) ● there is a mechanism to compute the probability Pr (w|u 1 , u 2 , …, u l) of the event that a target word w appears in a context (u 1 , u 2 , …, u l ).

Task: find a vector v for each word w such that those probabilities are as high as possible for each w and its context (u 1 , u 2 , …, u l ).

We use a neural network with parameters θ to compute the probability by minimizing the cross-entropy loss使用具有最小參數θ的神經網絡。透過最小化交叉熵損失來計算概率

L ( θ ) = − ∑ ( w , u 1 , … , u l ) ∈ D train log ⁡ Pr ⁡ ( w ∣ u 1 , … , u l ) L(\theta) = -\sum_{(w, u_1, \ldots, u_l) \in D_{\text{train}}} \log \Pr(w \mid u_1, \ldots, u_l) L(θ)=−∑(w,u1​,…,ul​)∈Dtrain​​logPr(w∣u1​,…,ul​)

Bengio

CBOW: CBOW 模型的工作原理

輸入 (Input Layer):

給定一個目標詞 w t w_t wt​ ​,選取其 前 m 個詞 和 後 m 個詞 作為上下文詞 (context words)。這些詞會從詞嵌入矩陣 CCC 中查找對應的詞向量。

投影層 (Projection Layer)

將這些上下文詞對應的詞向量取平均: y = average ( w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) y = \text{average}(w_{t-1}, ..., w_{t-m}, w_{t+1}, ..., w_{t+m}) y=average(wt−1​,...,wt−m​,wt+1​,...,wt+m​)這一步沒有非線性變換 (例如 ReLU 或 tanh),只是簡單的平均。

輸出層 (Output Layer)

計算該平均向量 y y y 與詞彙矩陣 W W W 的線性變換,並使用 softmax 來預測中心詞 wtw_twt​: P ( w t ∣ w t − 1 , . . . , w t − m , w t + 1 , . . . , w t + m ) = softmax ( W y ) P ( w t ​ ∣ w t − 1 ​ , . . . , w t − m ​ , w t + 1 ​ , . . . , w t + m ​ ) = s o f t m a x ( W y ) P(w_t | w_{t-1}, ..., w_{t-m}, w_{t+1}, ..., w_{t+m}) = \text{softmax}(Wy)P(wt​∣wt−1​,...,wt−m​,wt+1​,...,wt+m​)=softmax(Wy) P(wt​∣wt−1​,...,wt−m​,wt+1​,...,wt+m​)=softmax(Wy)P(wt​∣wt−1​,...,wt−m​,wt+1​,...,wt+m​)=softmax(Wy)Softmax 的輸出是一個機率分佈,表示詞彙表 (vocabulary) 中每個詞作為中心詞的可能性。 CBOW 的特點 上下文到目標詞:它是從上下文詞預測中心詞(這與 Skip-gram 相反,Skip-gram 是用中心詞預測周圍的上下文詞)。計算高效:由於使用平均詞向量,CBOW 計算速度通常比 Skip-gram 更快,尤其是在大語料庫上訓練時。適合大規模語料庫:CBOW 在大語料庫下通常表現更穩定,適合訓練大詞彙的詞向量。 CBOW 在 NLP 任務中的影響 詞向量學習:CBOW 提供了一種高效的方式來學習詞向量,後來影響了 GloVe、FastText 等模型的發展。語意計算:學到的詞向量可以用來計算詞語之間的語義相似性,例如餘弦相似度 (cosine similarity)。下游應用:CBOW 訓練出的詞向量可以應用於 文本分類、情感分析、機器翻譯 等 NLP 任務。

CBOWSkip-gram目标用上下文词预测中心词用中心词预测上下文词计算速度快慢(因为对每个中心词要预测多个上下文词)适用场景大数据、大语料库小数据、小语料库效果适合学习常见词的词向量在低频词的词向量学习上更优 word2vec Skip-gram model“a baby step in Deep Learning but a giant leap towards Natural Language Processing”can capture linear relational meanings (i.e., analogy): king - man + women = queen

Problems : biases (gender, ethnic, …) Word embeddings are learned from data → they also capture biases implicitly appearing in the dataGender bias: “computer_programmer” is closer to “man” than “woman”“homemaker” is closer to “woman” than “man” Ethnic bias: African-American names are associated with unpleasant words (more than European-American names) … → Debiasing embeddings is a hot (and very needed) research topic Dealing with unknown words Many words are not in dictionariesNew words are invented everydaySolution 1: using a special token #UNK # for all unknown wordsSolution 2: using characters/sub-words instead of words Characters (c-o-m-p-u-t-e-r instead of computer)Subwords (com-omp-mpu-put-ute-ter instead of computer) Word embeddings in a specific context The meaning of a word standing alone can be different than its meaning in a specific context He lost all of his money when the bank failed.He stood on the bank of Amstel river and thought about his future. Solution: w |c = f (w, c)Solution 1: f is continuous w.r.t. c (contextual embeddings, e.g., ELMO, BERT - next week)Solution 2: f is discrete w.r.t. c (e.g., word sense disambiguation - coming up in the next video) Summary Prediction-based approaches require neural network models, which are not intuitive as count-based onesLow dimensional vectors (about 200-400 dimensions) Dimensions are not easy to interpret Robust performance for NLP tasks 延伸:Word Embeddings 進化

靜態詞嵌入(Static Embeddings):

Word2Vec、GloVe、FastText缺點:一個詞的向量固定,不能根據不同上下文改變語義(如「bank」的不同意思)。

上下文敏感的詞嵌入(Contextualized Embeddings):

ELMo、BERT、GPT解決了詞義多義性(Polysemy)問題,能夠根據上下文動態調整詞向量。 Contextualised Word Embedding Static map 靜態地圖

f trained on large corpus Based on co-occurrence of words

标签:

WordEmbeddings由讯客互联开源代码栏目发布,感谢您对讯客互联的认可,以及对我们原创作品以及文章的青睐,非常欢迎各位朋友分享到个人网站或者朋友圈,但转载请说明文章出处“WordEmbeddings