NaturalLanguageProcessingNLP

IT业界
2025-09-03 11:36:02

NLP

清晰版本查看

Sentence segmentation (split)Tokenisation (split)Named entity recognition (combine) 概念主要內容典型方法Distributional Semantics（分佈式語義）（分銷語義（分佈式語義）單詞的語義來自於它的上下文共現矩陣（Co-occurrence Matrix），PMI共現矩陣（共發生矩陣ChesiWord Embeddings（詞嵌入）變成嵌入（詞嵌入）將單詞映射到低維向量空間，以捕捉語義關係Word2Vec、GloVe、FastText、BERTword2vec，手套，fasttx，bereWord Sense Disambiguation（詞義消歧）單詞感官歧義（詞義消歧）根據語境確定詞的正確含義WordNet、機器學習、BERT introduction to NLP

A research field focussed on creating software systems with knowledge about natural (human) language 研究重點是關於自認語言的知識

Interdisciplinary: makes use of theories from Linguistics 語言學理論, adopts an Engineering approach

Aimed at human-like understanding of language (but not yet there)

[!tip] Contributing disciplines Linguistics: formal models of language, linguistic knowledge Computer Science: representations, efficient processing, state machines, parsing algorithms, probabilistic models, dynamic programming, machine learning Mathematics: formal automata theory, computational modelling Psychology: psychologically plausible modelling of language use

Many types of ambiguity歧義: Phonological 語音學 multiple interpretations due to how it sounds 有些音聽起來一樣那麼在識別的時候會存在很多種解釋 Lexical 詞匯 multiple interpretations due to a word having multiple senses 詞語的歧義，由於有的單詞本身帶有的意思多重導致的 Syntactic 句法 due to a word having more than one possible part of speech 一個單詞有多個演講部分due to prepositional phrase attachment 介詞的附件 Semantic 語義 multiple possible interpretations解釋 unless knowledge of the world is available

[!danger] Two major approaches to NLP

Symbolic 象征 Rule- and dictionary-based systems 基於規則和字典系統Captures linguistic knowledge in rules written by experts 補貨專家撰寫的規則中的語言知識 Statistical/Machine learning-based Data-driven 數據驅動Use of large amounts of (labelled) textual data (文本數據) to train systems, discover patterns

Comparison

SymbolicStatistical/Machine learning✅ Expert knowledge yields highly precise results 專業知識會產生高度精確的結果✅Can generalise well on unseen examples可以很好的概括在看不見的例子上❎Shortage of experts❎Need people for labelling❎Laborious rule writing, dictionary preparation❎Time consuming and laborious labelling 耗時且費力❎ Domain adaptation problematic域適應性問題❎Must retrain for new domain必須重新訓練✅Results can be interpreted結果可以解釋❎ Often cannot inspect/change models通常無法檢查/更改模型✅ Good when labelled data is hard to obtain當很難獲得標記的數據時✅ Good where dictionaries are unavailable NLP Pipelines

A ‘complete’ NLP system is usually a pipeline of components

在这里插入图片描述

Sentence Segmentation 句子細分Tokenisation 象征化ParsingInformation

Sentence segmentation (split)Tokenisation (split)Named entity recognition (combine)

Why is NLP challenging:

Natural Language evolve: new words appear constantlySyntactic rules are flexible 句法規則靈活ambiguity (模糊性) is inherent 固有的

Why Machine Learning for NLP

Traditional rule-based artificial intelligence (symbolic AI): requires expert knowledge to engineer the rules 需要專家知識來設計not flexible to adapt in multiple languages, domains, applications 不能靈活低使用與多種語言 Learning from data (machine learning) adapts: to evolution: just learn from new data 從新數據中學習to different applications: just learn with the appropriate target representation Sentence Segmentation 句子細分

Sentence Detection is done before the text is tokenized 句子檢測在文本標記化之前進行

Task 順序：

Determination of boundaries between sentences確定句子之間的界限Sentences used in subsequent NLP tasks 隨後NLP中使用的句子Is it enough to detect the full stop Could be an end-of-sentence (EOS) marker 可能是句子結束 (EOS) 標記Or an end of abbreviation marker 縮寫標記的結尾Or both

通常是Text Mining的第一步，因為它將非結構化文本拆分成基本處理單位

Variation in delimiters

Typical：“.”,“!”,“?”

[!todo] # Approaches

Regular expressions (Patterns) 正則表達式Dictionaries (e.g., abbreviation lists) 字典Hand-crafted rules 手工製作的規則(e.g., to check whether the word following an EOS delimiter starts with an uppercase character 檢查EOS界定符之後的單詞是否從大寫字符開始)Statistical and ML approachesHybrid approaches 混合方法 Example of useful rules or Features

First character after potential EOS char 潛在EOS Char之後的第一個字符 ● Should be uppercase? Problematic for some languages, e.g. German ● Permissible chars after potential EOS, e.g. lowercase characters? Abbreviations 縮寫 ● titles not likely to occur at EOS (e.g., Dr. Jones) ● company indicators could occur at EOS (e.g., MySocialMedia Inc.)

OpenNLP

The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence 無法根據內容識別句子邊界

OpenNLP 開發文檔

輸出的是一個個string 每個string都是一個分解出來的句子

spaCy

for pdf and word docs

spaCy 開發文檔 Tokenisation 象征化

Why need Tokenisation

機器學習模型通常使用「單詞」作為特徵，例如詞頻統計、情感分析等。英文單詞之間有空格，中文沒有，需要不同的處理方法。

Break sentence into tokens

3 main classes of tokens often considered

Morphosyntactic wordPunctuation mark or special symbolA numberEndings of contractions, e.g., “'re” in “we’re”Compounds and multi-words (e.g., daughter-in-law) Challenges: Character encoding ASCII only?Unicode (UTF-8) ASCII-fication or romanisation of texts Transliterations Results from OCR may be poor Pre-processing to detect/correct errors 預處理OCR errors may appear as correct text (but not intended text)

[!attention] Challenging

Hyphenation Manchester-basedSister-in-law Telephone numbers: many different Formats with whitespace, dotsslashes, hyphens, parentheses, plus signs

[!example]+

Dates: 04 January 2018; 04-01-2018, Jan 4, 2018Decimals: 0.05; 3.4; .6Monetary values: a £5-a-dish dinner Split or not to split Sentence segmentation (split)Tokenisation (split)Named entity recognition (combine) In other words: tokenisation is knowing when to split (not when to combine) Annotation Formats

Annotation（標註）是在文本中為每個 token 添加額外的信息，例如詞性標註（Part-of-Speech Tagging, POS Tagging）、命名實體識別（Named Entity Recognition, NER）等。

Understanding documents

Documents rarely have a simple structure 很少有簡單結構 Documents are meant to be human-readable

[!example]

news articleresearch article Annotations: Enabling machine-readablility

如圖展示的他從單純的文字變成機器可讀的樣式

Types of annotation Formats

Boundary notation 邊界符號 Inline markup language elements 內聯標記語言元素 Stand-off

delimiter-separated values (DSV)JSON Part-od-Speech Tagging (POS Tagging) 詞性標註

給每個單詞分配詞性標籤

[!example]

import nltk nltk.download('averaged_perceptron_tagger') sentence = "She runs fast." tokens = word_tokenize(sentence) pos_tags = nltk.pos_tag(tokens) print(pos_tags) # [('She', 'PRP'), ('runs', 'VBZ'), ('fast', 'RB')] Named Entity Recpgnition (NER) 命名實體識別

NER 用來識別文本中的人名、地名、組織名等專有名詞

句子： “Apple is looking at buying U.K. startup for $1 billion.”

NER 標註結果：

“Apple” → ORG（組織）“U.K.” → GPE（地名）“$1 billion” → MONEY（貨幣） import spacy nlp = spacy.load("en_core_web_sm") sentence = "Apple is looking at buying U.K. startup for $1 billion." doc = nlp(sentence) for ent in doc.ents: print(ent.text, ent.label_) # Apple ORG # U.K. GPE # $1 billion MONEY Boundary Notation

Done at the level of individual tokens 在單個token完成 How do we encode units of interest spanning several tokens BIO: B=Begin I=Inside O=Outside

[!example]

Strengths simple limitations cannot handle hierarchical or structured annotations e.g., nested entities(NES) 嵌套實體, relations events

[!example] Nested entities

展示了Named Entity Recognition (NER) 命名實體識別的結果

是一種NLP技術，用來識別文本中的關鍵實體，比如地名 (GPE)，組織 (Org), 人物 (Person) 和事件 (Conflict_Attack)GPE：國家或地區名稱 Org：機構或組織的名稱 Person：具體人名 Contact_Meet：這類標籤標識涉及會議或高級別會議 Conflict_Attack（衝突/攻擊, Conflict_Attack 標識與戰爭或攻擊有關的事件關係標註（箭頭）表示某人參與了一場活動Target (目標)標識某個實體是某個行動的目標 Inline markup language elements

By addition of markup tags within text

HTMLXML

[!example]

Strengths:

can handle annotations which are hierarchical (e.g., nested NEs, trees) and structured (e.g., events) 可以處理分層的注釋 Limitations:requires substantial processing with standard XML parsers 對標準XML處理器進行大量處理impossible to encode overlapping/intersecting annotations, e.g., second Iraqi city of Basra 無法編碼重疊/相交注釋 Stand-off Annotations (JSON格式)

將標註信息與原始文本分離的標註方式，而不是將標註直接嵌入到文本內部。

為什麼使用 Stand-off Annotation？避免污染原始數據：原始文本保持不變，標註信息存儲在外部文件或數據結構中。允許多層次標註：可以為相同文本提供多種標註（如詞性、語法結構、命名實體等），並獨立管理它們。便於版本控制：標註數據和原始文本分開存儲，有助於管理不同版本的標註信息。支持長文本處理：對於超大文本，標註信息存儲在索引數據結構中，提高效率。

[!example]+

UK and US discuss the role of UN.

stand-off 標註文件

{ "text": "UK and US discuss the role of UN.", "annotations": [ { "id": "T1", "type": "GPE", "start": 0, "end": 2, "text": "UK" }, { "id": "T2", "type": "GPE", "start": 7, "end": 9, "text": "US" }, { "id": "T3", "type": "ORG", "start": 27, "end": 29, "text": "UN" } ] }

annotations are stored separately requires a way to link between annotations and text links annotations to text using indexing based on character offsets (computed over raw text)

strength: original raw text is left untouched 原始文本未觸及can handle structured and overlapping annotations 可以處理結構化和重疊的注釋 limitations: not readily human-readable

标签：

NaturalLanguageProcessingNLP由讯客互联IT业界栏目发布，感谢您对讯客互联的认可，以及对我们原创作品以及文章的青睐，非常欢迎各位朋友分享到个人网站或者朋友圈，但转载请说明文章出处“NaturalLanguageProcessingNLP”

上一篇
软件测试技术之跨平台的移动端UI自动化测试（上）

下一篇
实用且美观，一款简单且模块化的UI组件库！