@oblomov @mathematics @[email protected] @[email protected] No solution, but the problem is related to de Bruijn sequences (https://en.wikipedia.org/wiki/De_Bruijn_sequence), for which there exists a lot of literature.
Mathematics
A community for discussing mathematics and developments in mathematics.
@mrdk @mathematics @[email protected] @[email protected]
oh, interesting. It's definitely related, although we allow different substrings to start at the same place, and this has a huge impact on the lengths (also it's not cyclic in our case, but that probably makes things worse).
@mrdk @mathematics @[email protected] @[email protected] also this might explain why @mau saw some relation to Gray codes in the binary case.
@oblomov well, a Gray code codes all n-bit sequences from 000...0 to 111...1. It's a bit overkill (we don't need the sequence with all 0) but probably the overhead is just 1.
Cc: @mrdk @mathematics @[email protected] @[email protected]
I've mulled a similar problem.
Inspired by the following Fox Trot comic:
Some irrational numbers have the property that they contain every finite substring. (For instance, the full retail version of Windows 98 exists somewhere in the binary expansion of pi.) And they can't really outlaw the number pi. (Or can you?)
So, theoretically if you could find the offset in the binary expansion of pi where the digits happen to be identical to the entirety of some particular copyrighted content, you could share that offset and the length of the original copyrighted work file and a recipient could reproduce the copyrighted content losslessly based just on that offset.
...theoretically.
One of the benefits of this approach is that it'd probably stump the courts at least for a time. If you got a copy of the movie Oppenheimer this way, could you be said to have "copied" it in a way that runs afoul of copyright law in whatever jurisdiction you happen to be in?
It is possible to generate "digits" of pi given an offset without claculating all the digits before, so there's no real problem there. Some of the big issues with this approach would be:
- The offset itself would be a huge number. It would in the vast majority of cases require a file much larger than the copyrighted work just to hold that offset number. So it would be very space inefficient.
- If you wanted to share a file this way, I doubt there's any way to search through pi efficiently for where that particular sequence of bits happens to start. So even if the size of the offset wasn't an issue, it would probably be infeasible to find the offset.
So, maybe pi isn't the best number. What if we could come up with a better number?
At this point, it's probably useful to stop talking about binary expansions of irrational numbers and just talk about algorithms that generate infinite strings of bits with particular properties.
So, is it possible to create any algorithm that generates an infinite string of binary digits with the following properties?:
- Its infinite expansion contains every finite substring.
- Later bits in the output can be calculated without calculating all previous bits.
- It's efficiently packed. That is, we want to ensure that all strings of bits we might want to find in the output have as small/low offsets as possible. Preferably offsets that can be represented using a number of bits comparable to the length of the string of digits. This is the property where it becomes related to OP's post.
- Where the string of bits of the copyrighted work (as well as significantly-sized substrings of the copyrighted work) isn't inherently in the representation of the offset.
You could even tweak some things. You could maybe tune it for strings between 1MB and 1TB in size, for instance. Maybe rather than dealing in bits, you could deal in blocks of 1MB. (Basically, rather than working in base 2, work in base 2^(2^20). Maybe you could get some efficiency gains that way.) Maybe find ways to prioritize bit strings that are more like the kind of data people would want to share this way. (Maybe make one version of the algorithm specifically for ascii text. Others for text in various different languages. Several for video data. Others for video games. Etc. And these would be optimized such that they gave you smaller offsets for the particular kind of data you might want to share via them.)
So, is this possible? I don't know for sure. If I had to wager a guess, I'd bet not. But it's interesting to think about.
@TootSweet this reminds me of https://github.com/philipl/pifs, the filesystem based on the normality of π