this post was submitted on 29 Mar 2024
1 points (100.0% liked)

Machine Learning

19 readers
1 users here now

This subreddit is temporarily closed in protest of Reddit killing third party apps, see /r/ModCoord and /r/Save3rdPartyApps for more information.

founded 1 year ago
MODERATORS
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/bnqj on 2024-03-29 13:04:29.


Yes, with VAPE - Vector Addition Positional Encoding.

I’ve been exploring a new approach to positional encoding that I’m calling VAPE - Vector Addition Positional Encoding.

The Method:

  • borrow some number of channels from queries and keys,
  • run a cumulative (prefix) sum across sequence length on these borrowed channels (add vectors together),
  • normalize - divide by the square root of the vector's magnitude,
  • we now have position-aware channels,
  • so concatenate them back to queries and keys.

What’s intriguing is that this method can work effectively with just a single channel per head. Using a single channel means that we're running prefix sum on scalars not vectors and the method still works.

VAPE features:

  • No Extra Parameters: VAPE introduces positional information without adding any new parameters to the model, preserving its simplicity and efficiency.
  • Performance: Early tests indicate that VAPE outperforms methods like RoPE in final perplexity.
  • Extrapolation: Early tests suggest that VAPE can extrapolate beyond training context length quite nicely, there is no explicit positional information added like in RoPE.
  • Compatibility with Flash Attention: It's fully compatible with Flash Attention.
  • Efficiency: By leveraging just a small number of channels for positional encoding, VAPE maintains model efficiency.
  • Inference Speed: VAPE caches the last positional states for queries and keys - it's a bit like SSM/RNN, you only need the last state to compute the next.

Seeking Your Insight:

  • What benchmarks or specific comparisons would best demonstrate VAPE's value to you?
  • Do you know of any methods similar to VAPE?

Benchmarks:

I've run some early tests that look very promising for causal language modeling tasks, but I have quite limited resources for doing benchmarks, thus before I put any effort into them I think it's better to ask the community how to go about it.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here