NLP - Language Modeling
date
Sep 23, 2024
type
Post
AI summary
Language modeling in NLP is essential for applications like autocomplete and translation, focusing on text fluency by predicting the next word based on context. It utilizes models such as unigram, bigram, and n-gram, with effective modeling requiring efficient context management through fixed windows. Key challenges include representing history for accurate predictions, while neural language models and techniques like LSTM enhance performance by managing memory and context in sequences.
slug
nlp-language-modeling
status
Published
tags
NLP
summary
Language modeling in NLP is essential for applications like autocomplete and translation, focusing on text fluency by predicting the next word based on context. It utilizes models such as unigram, bigram, and n-gram, with effective modeling requiring efficient context management through fixed windows. Key challenges include representing history for accurate predictions, while neural language models and techniques like LSTM enhance performance by managing memory and context in sequences.
What is Language Modeling?
Applications include: autocomplete, summarization, translation, spell and grammar correction, text generation, and chatbots. Relation to speech recognition and image recognition:
- In speech recognition, language modeling helps predict the next word in a sequence based on context.
- In image recognition, language models are often used for tasks like image captioning, where the model generates text that describes the contents of an image.
A good language model focuses on modeling the
fluency
of a sequence of words, rather than every aspect of language production. Fluency means that the generated text resembles accurate, natural language.How to Model Fluent Language?
Vocabulary () is the set of words that our model recognizes. For example: English has over 600,000 words, but a language model might use the top 50,000 most frequent words. To handle out-of-vocabulary words, we can introduce special tokens for unknown words:
- UNK: Represents an unknown word.
- SOS: Start of sequence.
- EOS: End of sequence.
Fluency is approximated by probability: by treating each word as a random variable. Each random variable can take on the value of any word in the vocabulary (). Conveniently, we order according to how the words appear in the sequence (although this ordering isn’t strictly required mathematically). Using the chain rule from Bayesian statistics:
The equation above becomes:
Where,
is called the
history
or context
of the t-th word. One of the key challenges in language modeling is figuring out how to represent the history or context of previous words in a way that helps the algorithm better predict the next word (i.e., the t+1-th word).
For arbitrarily long sequences, it becomes computationally intractable to manage a large (or infinite) number of random variables representing the entire history of the sequence. Thus, it’s crucial to limit the history/context by creating a fixed window of previous words.
This fixed window of words provides a practical way to handle context without overwhelming the model with too much information or requiring unbounded resources.
Unigram Models
Taking the equation above, we throw away the context completely, the equation becomes:
Basically we assume all words are independent of other words, this is obviously wrong because words could be in any order but still be predicted to be fluent. For example, P(moles, garden, snuck, the, in) would have a high probability.
Bigram Models
Taking the equation above, assume a word is only dependent on the immediate previous word, the equation becomes:
N-gram Models
Each word is conditioned on a fixed window of previous words of size k:
However, we're not going to make k random variables and try to learn a joint probability distribution over all variables — each of which can take on v different values. Instead: we are going to try to approximate the conditional probability distribution using a neural network.
As you can see, how much context/history matters a lot, as well as the size of our n-grams.
Tokens
One-hot Vector
Recall in Multinomial Logistic Regression, Given a sequence of features, the neural network Produce a probability distribution over classes, Suppose each input is a word and each output class is a different word from our vocabulary, we would have 50,000+ output classes!
As such, we introduce One-hot Vector: Represent a word as a vector with the same length as the number of words in the vocabulary. Each index i in the vector corresponds to the ith word in the vocabulary.

A Token is the index of a word (king = 2), this way we can easily translate back and forth between words, tokens and one-hot vectors.
Bi-gram Neural Language Model Architecture:


Tri-gram Neural Language Model Architecture:


Sub-Word Tokens
Sub-word tokenization breaks complex words into smaller, recurring units like roots, prefixes, and suffixes (e.g., "tokenize" becomes "token" + "ize"). This approach enables a neural network to manage complex words, while also maintaining a more manageable vocabulary size. By tokenizing sub-words, a vocabulary of about 50,000 tokens can cover most of the English language. This method also helps avoid the need for out-of-vocabulary (UNK) tokens, allowing the model to handle rare words by constructing them from common sub-word components.
Characters as Tokens
Another approach to tokenization is using characters instead of words or sub-words. In this case, the vocabulary could consist of individual letters, numbers, and punctuation marks. This reduces the size of the vocabulary but requires the model to learn to "spell out" words one character at a time. This method can capture detailed information about language structure but introduces added complexity in sequence processing and learning.
Encoders and Decoders
Definitions
Encoder: Send input through a smaller-than-necessary layer to force the neural network to find a small set of parameters that produce intermediate activations approximating the output. A good set of parameters will:
- Represent the input with a minimal amount of corruption.
- Be able to be "uncompressed" to produce a prediction over the full range of outputs.
Decoder: A set of parameters that recovers information to produce the output.

Examples
Consider the following example that uses the encoder and decoder to implement a identity function. The “King” produces a hidden state activation similar to “regent”.

Now consider a bigram model, given a word, we produce the next word’s activation as the output.

Now consider a Tri-gram model, given two words, we produce the next word’s activation as the output. The hidden state can be reinterpreted as summarizing the history (context vector). Note that we now have different architectures based on whether we have bigram or trigrams. Also we cannot remember any context outside of these n-grams.

Utilities in PyTorch and TensorFlow
pytorch.nn.Linear(vocab_size, hidden_size)
- Takes a one-hot vector (float) of length
vocab_size
.
- Maps to a vector of length
hidden_size
.
pytorch.nn.Embedding(vocab_size, hidden_size)
- Takes a token ID (integer) as input.
- Converts to a one-hot vector of length
vocab_size
.
- Maps to a vector of length
hidden_size
.
Recurrent Neural Networks (RNN)
Hidden State
Consider we treat a sequence of text as two word time-slices. For example, the sentence “The deep blue sea” can be treated as “The deep”, “deep blue”, “blue sea”. This way the text is basically a recurrent bigram neural networks. However, the context of previous text beyond the bigram is still missing. We can handle this by using the hidden state from the previous bi-gram as the input along the the current input word.

Training
To train a RNN, we feed an empty hidden state to the first word and start generating, over and over again. Use Cross Entropy Loss as the loss function. Backpropagate to adjust the linear layer. To generate text, note the the top layer is logarithmic probability, the next word is the of the distribution.
Generative Sampling
Sometimes would cause a sequence of text to always be activated, or a local maximum. Instead, we use multinomial sampling, which adds variety to the generated text:

If the distribution does not show a nice peak like above, we need to use temperature to redistribute the probabilities:

Long Short-Term Memory (LSTM)
Note that the hidden state is limited in size, thus as we generate deeper into the text, the hidden state will start to lose memory from earlier text. Thus it would be great if the neural network can learn:
- what is worth forgetting
- what parts of the new word input is worth saving
We introduce the memory cell that replace part of the RNN encoder like this:

Here is a visual of the memory cell:

Mathematically, the cell memory updates can be expressed like this:
- Forget: applies the forget gate to the previous cell state ()
- Input: (applies the input gate to the new context information)
The hidden state updates can be expressed like:
To summarize:
- Long Short-Term Memory cells pass extra cell state information that doesn’t try to encode text history but information about how to process that history—meta knowledge about the text history.
- Long Short-Term Memory networks were state-of-the-art for a long time and are still useful when handling recurrent data of indeterminate length.
- Holds “short-term” memory longer.
- But LSTM can still get overwhelmed trying to decide what to encode into the context vector as the history gets longer and longer.
Sequence-to-Sequence Models
Sometimes data is of the form:
For Example, if we want to translate English into French
Notice that the sequence of word may be different, as well as “gender” involved in French. Using a LSTM, we would not be able to take the entire sequence of words into account. Instead, sequence-to-sequence models can:
- Sweep up an arbitrary input sequence and encode it into a hidden state context vector.
- Good for picking up on negations, adjective order, etc., because the whole input sequence is in the context vector.
- Then decode it into an arbitrary-length sequence.
- Stop decoding when EOS (End of Sequence) is generated.
A diagram of such a model is shown below is shown below, in a manner such that the encoder and decoder are “side by side”.

To improve the performance of sequence-to-sequence models, it's important to separate the encoder and decoder components. The decoder is then modified to take in both the hidden state and the previously generated word as inputs.
This architecture enables the use of attention mechanisms, which have become a crucial element in modern neural language models. Attention allows the model to focus on specific parts of the input sequence, improving translation and other tasks. Next steps involve training sequence-to-sequence networks that incorporate these modifications.
Training
Step 1: Initialization
Start with the input sequence:
and the target sequence:
- Initialize the encoder hidden state to a zero vector:
Step 2: Encoding the Input Sequence
- For each time step , encode the input word and the current hidden state :
- Continue encoding until
Step 3: Decoding with Initial Conditions
- Set the initial decoder state:
Step 4: Decoding the Output Sequence
- While (or until reaching max length):
- Decode using the previous predicted word and hidden state to predict the next word:
- Set:
Step 5: Calculate Loss
- For each time step, update the loss:
Step 6: Backpropagation
After reaching EOS or max length, backpropagate the total loss to update the model weights.

Teacher Forcing
Teacher forcing is a technique used during the training of sequence-to-sequence models, where instead of using the model's own previous predictions as input for the next time step, the actual target output (the correct word from the training data) is fed into the decoder at each step.
- Speeds up training: By providing the correct target as input during training, the model is more likely to converge faster because it reduces the chance of compounding errors across time steps.
- Stabilizes training: Using the true outputs instead of predictions prevents the model from becoming biased towards its own mistakes, which is important when the model is not yet well-trained.
In the sequence-to-sequence training context, teacher forcing ensures that each prediction during training is grounded in the true data, improving the model's ability to learn the underlying patterns in the sequences.
Attention
In sequence-to-sequence models like those used for translation or summarization, the encoder compresses the entire input sequence into a single final hidden state. This final hidden state is then passed to the decoder to generate the output. However, this creates a problem: for long sequences, the hidden state may overwrite or forget important details from earlier in the sequence. This limits the model's ability to remember key parts of the input, especially for long sentences or complex contexts.
Attention solves this issue by allowing the model to access all intermediate hidden states from the encoder, not just the final one. This means the decoder doesn't have to rely on a single compressed hidden state but can selectively attend to different parts of the input sequence at each decoding step.
- Instead of forcing the decoder to rely on one compressed vector, attention provides a way for the decoder to look back at specific parts of the input sequence that are more relevant at a given time.
- For example, when translating a long sentence, the decoder can focus on the relevant words or phrases in the source sentence, making it easier to translate the next word in the target sentence.
How Attention Works:
- Score Calculation: For each decoder step, a score is computed between the current decoder hidden state and all encoder hidden states. This score represents how relevant each encoder hidden state is for the current decoding step.
- Softmax for Attention Weights: A softmax function is applied to these scores to transform them into attention weights—probabilities that sum to 1. These weights determine how much attention should be given to each encoder hidden state.
- Context Vector: The attention weights are used to create a weighted sum of all encoder hidden states, resulting in a context vector. This vector represents the important information from the encoder, specifically tailored for the current decoding step.
- Final Decoder Input: The decoder then uses this context vector, along with its own hidden state, to generate the next word in the sequence.
