Natural language processing

NLP is the act of performing computation on human language.

Applications

Search, particularly in the progression from just indexing and retrieving data to understanding semantic meaning and citing specific sections of pages.
Conversation, e.g. chat bots providing customer service.
Classification of input, e.g. news articles.
Context-aware autocompletion.

NLP is hard

Human language is a complex system tied heavily to symbols: language is a discrete, symbolic, categorical signalling system that allows us to convey the same meaning over multiple transports (speech, gesture, signs). The human brain encodes language as a continuous pattern of activation by which symbols are transmitted through audio and vision.

Challenges differ across languages, e.g. in Vietnamese where word separation isn't implied by spaces.

NLP is considered an AI-complete problem, equivalent to central AI problem. As there's an infinite range of possibilities because words can be arranged into an infinite number of sequences, a specific algorithm would be unable to decipher these possibilities, and as such more general intelligence and reasoning ability is required.

Types of analysis

A sentence is determined to be valid based on two factors:

Syntax is the grammatical structure of text. Syntactic analysis (or parsing) is the analysis of natural language based on a grammar, applied to categories and groups of words.
Semantics are the meanings being conveyed. Semantic analysis applies meaning and interpretation.

Techniques

Parsing breaks down language based on syntactic/grammatical structure.
Stemming reduces words to their canonical form, e.g. reducing "organising" to "organise".
Text segmentation breaks down sentences into meaningful units.
- Named Entity Recognition seeks the names of people, places, percentages, monetary values, medical codes, etc.
- Relationship Extraction is the identification of relationships between words, e.g. adjectives and parent nouns.
- Sentiment Analysis examines opinions toward objects.

Tokenising

A sentence or text fragment is first broken into words. In English we'd just split on whitespace.

Part-Of-Speech tagging

POS tagging is a supervised learning process that annotates each word within the sentence with a class.

There are multiple tag sets in use for this process, but at a high level:

Nouns (N) are objects.
Pronouns (PRO) describe a person or object's identity.
Verbs (V) are actions.
Adverbs (ADV) describe verbs.
Adjectives (ADJ) describe nouns.
Conjunctions (CON) join together clauses.
Prepositions (P) describe relationships.
Interjections (INT) are abrupt remarks.

We can take into account the previous and next word and word capitalisation to grant context. Many sentences have ambiguous meanings, and so may generate multiple parse trees.