Fundamental Aspects of Language Modelling

⚡️Hudson Ⓜ️endes
4 min readNov 15, 2022
"artificial brain book fantasy realm" (DALL-E)

Language Modelling is the process through which we produce a probabilistic model capable of estimating how likely an arbitrary sequence of text is, in the form of P(word | previous words in sentence). These likelihoods are calculated based on information learned from a training corpus, or a training split; after learning, they must be evaluated against a test corpus or the remainder of the original corpus after the split. The present document will navigate through the main aspects of how these are performed, as well as the pitfalls and solutions that language modelling often runs into.

The reason for which “strong tea” makes more sense than “strong engine”, whereas “powerful tea” makes less sense than “powerful engine” are given by their usual employment by textual communication. Some combinations are not impossible, they are just less likely. Words are frequently generalised as states, and their sequence as transitions with probabilities for each of those transitions. That definition aligns text sequences to Markov Decision Processes, traditionally used to learn and create probabilistic language models, where bi-grams (or longer sequences) are used to predict how likely it is to change from a state (word) to another. The product of those transition probabilities is then calculated to provide us with an estimate of how likely an entire sequence is.

--

--