Zhubanov: Kazakh Text Frequency Analyzer

Phonology & Phonological analysis

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles[clarification needed].[1]

Using Latin numerical prefixes, an n-gram of size 1 is referred to as a 'unigram'; size 2 is a 'bigram' (or, less commonly, a 'digram'); size 3 is a 'trigram'. English cardinal numbers are sometimes used, e.g., 'four-gram', 'five-gram', and so on. In computational biology, a polymer or oligomer of a known size is called a k-mer instead of an n-gram, with specific names using Greek numerical prefixes such as 'monomer', 'dimer', 'trimer', 'tetramer', 'pentamer', etc., or English cardinal numbers, 'one-mer', 'two-mer', 'three-mer', etc.

Applications

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.[2] n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression. Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.

n-gram models

An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.

This idea can be traced to an experiment by Claude Shannon's work in information theory. Shannon posed the question: given a sequence of letters (for example, the sequence 'for ex'), what is the likelihood of the next letter? From training data, one can derive a probability distribution for the next letter given a history of size {\displaystyle n} n: a = 0.4, b = 0.00001, c = 0, ....; where the probabilities of all possible 'next-letters' sum to 1.0.

More concisely, an n-gram model predicts {\displaystyle x_{i}} x_{i} based on {\displaystyle x_{i-(n-1)},\dots ,x_{i-1}} x_{i-(n-1)}, \dots, x_{i-1}. In probability terms, this is {\displaystyle P(x_{i}\mid x_{i-(n-1)},\dots ,x_{i-1})} P(x_{i}\mid x_{{i-(n-1)}},\dots ,x_{{i-1}}). When used for language modeling, independence assumptions are made so that each word depends only on the last n − 1 words. This Markov model is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of estimating the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together.

Note that in a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a 'multinomial distribution').

In practice, the probability distributions are smoothed by assigning non-zero probabilities to unseen words or n-grams; see smoothing techniques.

Applications and considerations

n-gram models are widely used in statistical natural language processing. In speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution. For parsing, words are modeled such that each n-gram is composed of n words. For language identification, sequences of characters/graphemes (e.g., letters of the alphabet) are modeled for different languages.[4] For sequences of characters, the 3-grams (sometimes referred to as 'trigrams') that can be generated from 'good morning' are 'goo', 'ood', 'od ', 'd m', ' mo', 'mor' and so forth, counting the space character as a gram (sometimes the beginning and end of a text are modeled explicitly, adding '__g', '_go', 'ng_', and 'g__'). For sequences of words, the trigrams (shingles) that can be generated from 'the dog smelled like a skunk' are '# the dog', 'the dog smelled', 'dog smelled like', 'smelled like a', 'like a skunk' and 'a skunk #'.

Practitioners[who?] more interested in multiple word terms might preprocess strings to remove spaces.[who?] Many simply collapse whitespace to a single space while preserving paragraph marks, because the whitespace is frequently either an element of writing style or introduces layout or presentation not required by the prediction and deduction methodology. Punctuation is also commonly reduced or removed by preprocessing and is frequently used to trigger functionality.

n-grams can also be used for sequences of words or almost any type of data. For example, they have been used for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from.[5] They have also been very successful as the first pass in genetic sequence search and in the identification of the species from which short sequences of DNA originated.[6]

n-gram models are often criticized because they lack any explicit representation of long range dependency. This is because the only explicit dependency range is (n − 1) tokens for an n-gram model, and since natural languages incorporate many cases of unbounded dependencies (such as wh-movement), this means that an n-gram model cannot in principle distinguish unbounded dependencies from noise (since long range correlations drop exponentially with distance for any Markov model). For this reason, n-gram models have not made much impact on linguistic theory, where part of the explicit goal is to model such dependencies.

Another criticism that has been made is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction. This is because n-gram models are not designed to model linguistic knowledge as such, and make no claims to being (even potentially) complete models of linguistic knowledge; instead, they are used in practical applications.

In practice, n-gram models have been shown to be extremely effective in modeling language data, which is a core component in modern statistical language applications.

Most modern applications that rely on n-gram based models, such as machine translation applications, do not rely exclusively on such models; instead, they typically also incorporate Bayesian inference. Modern statistical models are typically made up of two parts, a prior distribution describing the inherent likelihood of a possible result and a likelihood function used to assess the compatibility of a possible result with observed data. When a language model is used, it is used as part of the prior distribution (e.g. to gauge the inherent 'goodness' of a possible translation), and even then it is often not the only component in this distribution.

Handcrafted features of various sorts are also used, for example variables that represent the position of a word in a sentence or the general topic of discourse. In addition, features based on the structure of the potential result, such as syntactic considerations, are often used. Such features are also used as part of the likelihood function, which makes use of the observed data. Conventional linguistic theory can be incorporated in these features (although in practice, it is rare that features specific to generative or other particular theories of grammar are incorporated, as computational linguists tend to be 'agnostic' towards individual theories of grammar[citation needed]).

https://en.wikipedia.org/wiki/Phonology