Let’s apply these steps to creating our word2vec, skip-gram model.
Phase 1: Assemble the graph
-
Define placeholders for input and output
Input is the center word and output is the target (context) word. Instead of using one-hot vectors, we input the index of those words directly. For example, if the center word is the 1001th word in the vocabulary, we input the number 1001.
Each sample input is a scalar, the placeholder for BATCH_SIZE sample inputs with have shape[BATCH_SIZE].
Similar, the placeholder for BATCH_SIZE sample outputs with have shape [BATCH_SIZE].
Note that our center_words and target_words being fed in are both scalars -- we feed in their corresponding indices in our vocabulary.
-
Define the weight (in this case, embedding matrix)
Each row corresponds to the representation vector of one word. If one word is represented with a vector of size EMBED_SIZE, then the embedding matrix will have shape [VOCAB_SIZE, EMBED_SIZE]. We initialize the embedding matrix to value from a random distribution. In this case, let’s choose uniform distribution.
-
Inference (compute the forward path of the graph)
Our goal is to get the vector representations of words in our dictionary. Remember that the embed_matrix has dimension VOCAB_SIZE x EMBED_SIZE, with each row of the embedding matrix corresponds to the vector representation of the word at that index. So to get the representation of all the center words in the batch, we get the slice of all corresponding rows in the embedding matrix. TensorFlow provides a convenient method to do so called tf.nn.embedding_lookup().
This method is really useful when it comes to matrix multiplication with one-hot vectors because it saves us from doing a bunch of unnecessary computation that will return 0 anyway. An illustration from Chris McCormick for multiplication of a one-hot vector with a matrix.
So, to get the embedding (or vector representation) of the input center words, we use this:
- Define the loss function
While NCE is cumbersome to implement in pure Python, TensorFlow already implemented it for us.
Note that by the way the function is implemented, the third argument is actually inputs, and the fourth is labels. This ambiguity can be quite troubling sometimes, but keep in mind that TensorFlow is still new and growing and therefore might not be perfect. Nce_loss source code can be found here.
For nce_loss, we need weights and biases for the hidden layer to calculate NCE loss.
Then we define loss:
-
Define optimizer
We will use the good old gradient descent.
Phase 2: Execute the computation
We will create a session then within the session, use the good old feed_dict to feed inputs and outputs into the placeholders, run the optimizer to minimize the loss, and fetch the loss value to report back to us.