How about adding a mechanism for storing the raw, embedding-dimensional vectors as a part of the sequence instead of introducing a set of additional discrete "invisible" tokens? So basically something like checking the final element of each vector in the sequence before the final linear layer and if the element is larger than, say, 0, giving the vector as-is as the output instead of passing through the de-embedding process. Then, when generating the next token, one could just interleave the thought vectors between the embedded "real" tokens after the embedding. This would allow the "thoughts" of the LLM to be continuous and thus more nuanced - a transformer doesn't need the sequence to be discrete, that's something imposed on LLMs by the nature of natural language. Could be an advatage over traditional CoT!
One other reason as to why something like this might beat o1's thought document (at least for some tasks) is the way the attention mechanism works: it's much more natural to attend to nearby tokens than to far away ones.
Training thought tokens like this is pretty simple in principle: one could construct a loss for them based on whether they increase the odds of producing the correct token next. Probably should pair that with some minimum increase threshold (below which we actually penalize for thought token generation) and an increasing penalty for outputting multiple thought tokens in a row (in addition to the hard constraint suggested in the OP). The training does pose one major challenge, though: it would need to be done autoregressively instead of pushing the whole sequence through at once, as we don't have ground truth for these thought tokens. So this would slow things down quite a bit!
How about adding a mechanism for storing the raw, embedding-dimensional vectors as a part of the sequence instead of introducing a set of additional discrete "invisible" tokens? So basically something like checking the final element of each vector in the sequence before the final linear layer and if the element is larger than, say, 0, giving the vector as-is as the output instead of passing through the de-embedding process. Then, when generating the next token, one could just interleave the thought vectors between the embedded "real" tokens after the embedding. This would allow the "thoughts" of the LLM to be continuous and thus more nuanced - a transformer doesn't need the sequence to be discrete, that's something imposed on LLMs by the nature of natural language. Could be an advatage over traditional CoT!
One other reason as to why something like this might beat o1's thought document (at least for some tasks) is the way the attention mechanism works: it's much more natural to attend to nearby tokens than to far away ones.
Training thought tokens like this is pretty simple in principle: one could construct a loss for them based on whether they increase the odds of producing the correct token next. Probably should pair that with some minimum increase threshold (below which we actually penalize for thought token generation) and an increasing penalty for outputting multiple thought tokens in a row (in addition to the hard constraint suggested in the OP). The training does pose one major challenge, though: it would need to be done autoregressively instead of pushing the whole sequence through at once, as we don't have ground truth for these thought tokens. So this would slow things down quite a bit!