A small amount of data that does not belong to any file
What if I told you everything you knew was a lie. You can store any binary as an index of its anagram and eliminate all redundancy.
This algorithm is like the little sister of Arithmetic coding, but with a bit more class and sophistication. No more feeling like a cheap hack, this method is sleek and streamlined. Say goodbye to those annoying end-of-file characters and goodbye to dynamic character frequency space adjustments. Best of all, it's so easy to understand that even my mum could explain it to her friends at the beauty salon.
A data compression scheme based on the index of permutations of a string is a novel approach to reducing the amount of storage space required for large amounts of data. The basic idea behind this scheme is that by storing a binary as an index of its anagram, all redundancy is eliminated.
To illustrate this concept, let's take the word "ABRACADABRA" as an example. If we were to write down all its permutations, this word would have an index of 21519 out of a total of 83160 possible combinations. This means that the word "ABRACADABRA" can be represented by a single number, 21519, instead of the 11 characters that make up the original word. This greatly reduces the amount of storage space needed to store this word, and the same principle applies to larger amounts of data.
0: AAAAABBCDRR
1: AAAAABBCRDR
2: AAAAABBCRRD
3: AAAAABBDCRR
...
21518: ABRACADABAR
21519: ABRACADABRA <-- index 21519
21520: ABRACADARAB
...
83156: RRDCBAAABAA
83157: RRDCBAABAAA
83158: RRDCBABAAAA
83159: RRDCBBAAAAA
One of the main advantages of this compression scheme is that it is relatively easy to implement. The process of finding the permutation index for a given string can be done quickly and efficiently using mathematical algorithms. So all you need to store is a count of all the characters and the index.
A: 5, B: 2, C: 1, D: 1, R: 2
Index: 21519
Tada!
My nephew's favourite word and a fair question. The reasoning goes something like this. The best compression is Run Length Encoding, but only if your data is nicely sorted alphabetically. "ABRACADABRA" doesn't compress very well. "AAAAABBCDRR" does.
So you'll lose the crucial information about the exact order of each character in your file. Let's have a look at the list of possible combinations for the word "WIKI", in alphabetical order:
0: IIKW
1: IIWK
2: IKIW
3: IKWI
4: IWIK
5: IWKI
6: KIIW
7: KIWI
8: KWII
9: WIIK
10: WIKI <-- index 10
11: WKII
11: WKII
See, not so many. So the logical thing to do is to count the position in this list. Together with the number of characters (frequencies), you can create a very simple data compression algorithm.
Now let's raise the level a bit...
Frequencies are a count of all the characters in a string.
To calculate permutations of multi-sets, you can use the following expression:
Where is the frequency of every character. More info on Wikipedia.
If you have trouble seeing the equations, please upgrade your browser.
Word | Permutations |
---|---|
ABRACADABRA | 83160 |
ANAGRAM | 840 |
ARTIC | 360 |
MISSISSIPPI | 34650 |
WIKI | 12 |
Nice. Now that you understand how to determine the number of anagrams that can be made with a given word, the next step is to determine the index of that word within the set of all possible anagrams. This is called the permutation index. Do you see the logic of where this is taking us?
You want to know how many permutations there are starting with the letter A. Let's say is the frequency of the character A. The only thing you need to do is to reduce the frequency of by 1 because you just "used" the character. The equation will be:
So if you want to know the permutation of the 3rd, just add them up. It gets a bit longer:
Luckily Wolfram Alpha is there to save the day and simplify this mofo:
That first multiplier are all elements in the frequency list to the character. So for "ABRACADABRA", if you want to know the sub-index for the 3rd letter "R", your multiplier becomes: because it's the 5th letter in the frequency list.
If you go through the word, calculate all the sub-indices, add them all up, you'll end up with the index.
Sub-permutation | Frequencies | Sub-index |
---|---|---|
A BRACADABRA | A: 5, B: 2, C: 1, D: 1, R: 2 | 0 |
A B RACADABRA | A: 4, B: 2, C: 1, D: 1, R: 2 | 15120 |
AB R ACADABRA | A: 4, B: 1, C: 1, D: 1, R: 2 | 5880 |
ABR A CADABRA | A: 4, B: 1, C: 1, D: 1, R: 1 | 0 |
ABRA C ADABRA | A: 3, B: 1, C: 1, D: 1, R: 1 | 480 |
ABRAC A DABRA | A: 3, B: 1, C: 0, D: 1, R: 1 | 0 |
ABRACA D ABRA | A: 2, B: 1, C: 0, D: 1, R: 1 | 36 |
ABRACAD A BRA | A: 2, B: 1, C: 0, D: 0, R: 1 | 0 |
ABRACADA B RA | A: 1, B: 1, C: 0, D: 0, R: 1 | 2 |
ABRACADAB R A | A: 1, B: 0, C: 0, D: 0, R: 1 | 1 |
ABRACADABR A | A: 1, B: 0, C: 0, D: 0, R: 0 | 0 |
Sum | 21519 |
Word | Index | Permutations |
---|---|---|
ABRACADABRA | 21519 | 83160 |
ANAGRAM | 250 | 840 |
ARCTIC | 41 | 360 |
MISSISSIPPI | 13736 | 34650 |
WIKI | 10 | 12 |
This sentence has a length of 43 characters:
The quick brown fox jumps over the lazy dog
And has an index value of:
(dec) 254668872108764518031939727748830382196379337
(hex) 0b6b748d1dc2610369faa7eaf6160ec0b1cec9
The binary would be 19 bytes.
As I said the binary has a length of 19 bytes. It's below the theoretical limit of Shannon entropy. But then again, most modern compression algorithms are.
This function is an implementation of Shannon entropy, but returns the entropy in total number of bytes instead of bits per symbol:
function entropy($string) {
$length = strlen($string);
$freq = count_chars($string, 1);
$freq_length = array_map(fn($f) => $f / $length, $freq);
$entropy = array_reduce($freq_length, fn($r, $f) => $r - $f * log($f, 256)) * $length;
return($entropy);
}
import math
def entropy(string):
length = len(string)
freq = {}
for char in string:
if char not in freq:
freq[char] = 0
freq[char] += 1
freq_length = {char: count / length for char, count in freq.items()}
entropy = sum([-count * math.log(count, 256) for count in freq_length.values()]) * length
return entropy
"The quick brown fox jumps over the lazy dog" has an entropy of 24 bytes.
As I said before is the number of unique characters and is how many times the character is used. is the maximum possible permutations.
So I'll show you how we walk from Shannon entropy to Anagram coding:
Shannon entropy:
For the entire string, the total entropy is , representing an absolute lower bound.
The number of distinct permutations is governed by the multinomial coefficient:
Using:
we approximate:
Converting to base-2 logs:
Please note:
Substituting back, we find:
Thus, encoding the permutation index requires bits, matching Shannon’s limit.
For large , the symbol frequencies match their probabilities. This aligns with Shannon’s typical set—the set of sequences dominating the probability mass.
Overheads (e.g., transmitting counts ) become insignificant as . The term dominates.
Optimal encoding uses bits, converging to .
Permutation indexing achieves Shannon’s entropy limit because asymptotically equals . This mirrors Shannon’s source coding theorem, showing the interplay between combinatorics and information theory.
Ok, that was heavy. Time for some implementation. Now this is the naive algorithm, based on the simple math. The complexity is O(n2). Where n is the length of the string. So this is for educational purposes. At the end you can find a optimised version.
<?php
function anagram_index($string) {
$index = 0;
$frequencies = count_chars($string, 1);
$string_length = strlen($string) - 1;
for($i = 0; $i < $string_length; $i++) {
$char = ord($string[$i]);
$freq_char_index = array_search($char, array_keys($frequencies));
$multiplier = array_slice($frequencies, 0, $freq_char_index);
$multiplier = array_sum($multiplier);
$numerator = gmp_fact(array_sum($frequencies) - 1);
$denominator = array_map("gmp_fact", $frequencies);
$denominator = array_reduce($denominator, fn($carry, $den) => gmp_mul($carry, $den), "1");
$index += gmp_div_q(gmp_mul($multiplier, $numerator), $denominator);
$frequencies[$char]--;
}
return($index);
}
$string = "The quick brown fox jumps over the lazy dog";
echo anagram_index($string);
?>
import gmpy2
def anagram_index(string: str) -> int:
index = gmpy2.mpz(0)
frequencies = dict()
for char in string:
if char not in frequencies:
frequencies[char] = 1
else:
frequencies[char] += 1
string_length = len(string) - 1
for i in range(string_length):
char = ord(string[i])
freq_char_index = list(frequencies.keys()).index(chr(char))
multiplier = sum(frequencies[key] for key in frequencies.keys() if ord(key) < char)
numerator = gmpy2.fac(sum(frequencies.values())-1)
denominator = gmpy2.fac(frequencies[chr(char)])
for key in frequencies:
if key != chr(char):
denominator *= gmpy2.fac(frequencies[key])
index += gmpy2.f_div(gmpy2.mul(multiplier, numerator), denominator)
frequencies[chr(char)] -= 1
return index
string = "The quick brown fox jumps over the lazy dog"
print(anagram_index(string))
<?php
function anagram_string($index, $frequencies) {
$string = null;
$string_length = array_sum($frequencies);
for($i = 0; $i < $string_length; $i++) {
$frequencies_sum = array_sum($frequencies);
$numerator = gmp_fact($frequencies_sum);
$denominator = array_map("gmp_fact", $frequencies);
$denominator = array_reduce($denominator, fn($carry, $den) => gmp_mul($carry, $den), "1");
$total_permutations = gmp_div_q($numerator, $denominator);
$frequencies_index = gmp_div_q(gmp_mul($index, $frequencies_sum), $total_permutations);
$cum_frequencies = 0;
foreach($frequencies as $char => $char_freq) {
$cum_frequencies += $char_freq;
if($cum_frequencies > $frequencies_index) break;
}
$freq_char_index = array_search($char, array_keys($frequencies));
$multiplier = array_slice($frequencies, 0, $freq_char_index, true);
$multiplier = array_sum($multiplier);
$numerator = gmp_div_q($numerator, $frequencies_sum);
$index -= gmp_div_q(gmp_mul($multiplier, $numerator), $denominator);
$frequencies[$char]--;
$string .= chr($char);
}
return($string);
}
$frequencies = [32=>8, 84=>1, 97=>1, 98=>1, 99=>1, 100=>1, 101=>3, 102=>1, 103=>1, 104=>2, 105=>1, 106=>1, 107=>1, 108=>1, 109=>1, 110=>1, 111=>4, 112=>1, 113=>1, 114=>2, 115=>1, 116=>1, 117=>2, 118=>1, 119=>1, 120=>1, 121=>1, 122=>1];
$index = "254668872108764518031939727748830382196379337";
echo anagram_string($index, $frequencies);
?>
import gmpy2
def anagram_string(index:str, frequencies:dict) -> str:
string = ""
string_length = sum(frequencies.values())
index = gmpy2.mpz(index)
for i in range(string_length):
frequencies_sum = sum(frequencies.values())
numerator = gmpy2.fac(frequencies_sum)
denominator = 1
for key in frequencies:
denominator *= gmpy2.fac(frequencies[key])
total_permutations = gmpy2.f_div(numerator, denominator)
frequencies_index = gmpy2.f_div(gmpy2.mul(index, frequencies_sum), total_permutations)
cum_frequencies = 0
for char in frequencies.keys():
cum_frequencies += frequencies[char]
if cum_frequencies > frequencies_index:
break
freq_char_index = list(frequencies.keys()).index(char)
multiplier = sum(frequencies[key] for key in frequencies.keys() if key < char)
numerator = gmpy2.f_div(numerator, frequencies_sum)
index -= gmpy2.f_div(gmpy2.mul(multiplier, numerator), denominator)
frequencies[char] -= 1
string += chr(char)
return string
frequencies = {32: 8, 84: 1, 97: 1, 98: 1, 99: 1, 100: 1, 101: 3, 102: 1, 103: 1, 104: 2, 105: 1, 106: 1, 107: 1, 108: 1, 109: 1, 110: 1, 111: 4, 112: 1, 113: 1, 114: 2, 115: 1, 116: 1, 117: 2, 118: 1, 119: 1, 120: 1, 121: 1, 122: 1}
index = "254668872108764518031939727748830382196379337"
print(anagram_string(index, frequencies))
So we can squeeze the algo above a lot. We can use Fenwich binary trees to store calculated factorials. Then the first thing I learned when I learning programming on the C64, is take as much as you can out of the loop. We trade memory for processing power, the speed increase is dramatic however.
<?php
class FenwickTree {
private $tree;
private $size;
public function __construct($size) {
$this->size = $size;
$this->tree = array_fill(0, $size + 1, 0);
}
public function update($index, $delta) {
$i = $index + 1; // 1-based indexing
while ($i <= $this->size) {
$this->tree[$i] += $delta;
$i += $i & -$i;
}
}
public function query($index) {
$sum = 0;
$i = $index + 1; // 1-based indexing
while ($i > 0) {
$sum += $this->tree[$i];
$i -= $i & -$i;
}
return $sum;
}
}
function lets_go_faster_anagram_index($s) {
$len = strlen($s);
if ($len == 0) return gmp_init(0);
// Precompute factorials up to the string length
$factorials = array();
$factorials[0] = gmp_init(1);
for ($i = 1; $i <= $len; $i++) $factorials[$i] = gmp_mul($factorials[$i - 1], $i);
// Compute frequency counts and sorted unique characters
$counts = array_count_values(str_split($s));
$sorted_chars = array_keys($counts);
sort($sorted_chars);
$char_to_index = array_flip($sorted_chars);
$k = count($sorted_chars);
// Initialize Fenwick Tree with character counts
$ft = new FenwickTree($k);
foreach ($sorted_chars as $idx => $c) $ft->update($idx, $counts[$c]);
// Compute initial denominator using precomputed factorials
$denominator = gmp_init(1);
foreach ($counts as $c => $cnt) $denominator = gmp_mul($denominator, $factorials[$cnt]);
// Precompute remaining_length_minus_1 divisors as GMP numbers
$remaining_divisors = array();
for ($i = 0; $i < $len; $i++) $remaining_divisors[$i] = gmp_init($len - $i - 1);
// Initialize running term T = (n-1)! / denominator
$T = gmp_div($factorials[$len - 1], $denominator);
$index = gmp_init(0);
for ($i = 0; $i < $len; $i++) {
$current_char = $s[$i];
$current_idx = $char_to_index[$current_char];
$original_count = $counts[$current_char];
// Get sum of counts of characters < current_char
$sum_less = $ft->query($current_idx - 1);
if (gmp_cmp($sum_less, 0) > 0) {
$permutations = gmp_mul($sum_less, $T);
$index = gmp_add($index, $permutations);
}
// Update Fenwick Tree and counts
$counts[$current_char]--;
$ft->update($current_idx, -1);
// Update T for next iteration: T = T * original_count / (remaining_length - 1)
if ($i < $len - 1) $T = gmp_div(gmp_mul($T, $original_count), $remaining_divisors[$i]);
}
return $index;
}
$string = "The quick brown fox jumps over the lazy dog";
echo lets_go_faster_anagram_index($string);
?>
Same strategy as before:
<?php
class FenwickTree {
private $tree;
private $size;
public function __construct($size) {
$this->size = $size;
$this->tree = array_fill(0, $size + 1, 0);
}
public function update($index, $delta) {
$i = $index + 1; // Convert to 1-based indexing
while ($i <= $this->size) {
$this->tree[$i] += $delta;
$i += $i & -$i;
}
}
public function query($index) {
$sum = 0;
$i = $index + 1; // Convert to 1-based indexing
while ($i > 0) {
$sum += $this->tree[$i];
$i -= $i & -$i;
}
return $sum;
}
public function findIndex($target) {
$low = 0;
$high = $this->size - 1;
while ($low < $high) {
$mid = (int)(($low + $high) / 2);
if ($this->query($mid) >= $target) $high = $mid; else $low = $mid + 1;
}
return $low;
}
}
function lets_go_faster_anagram_string($index, $counts) {
$sorted_chars = array_keys($counts);
sort($sorted_chars);
$k = count($sorted_chars);
$n = array_sum($counts);
if ($n == 0) return null;
$ft = new FenwickTree($k);
foreach ($sorted_chars as $i => $char) $ft->update($i, $counts[$char]);
$factorials = array();
$factorials[0] = gmp_init(1);
for ($i = 1; $i <= $n; $i++) $factorials[$i] = gmp_mul($factorials[$i - 1], $i);
$denom = gmp_init(1);
foreach ($counts as $char => $cnt) $denom = gmp_mul($denom, $factorials[$cnt]);
$T = gmp_divexact($factorials[$n - 1], $denom);
$remaining_index = gmp_init(gmp_strval($index));
$result = null;
for ($i = 0; $i < $n; $i++) {
$div = gmp_div_q($remaining_index, $T);
$candidate_threshold = gmp_add($div, gmp_init(1));
$threshold_int = gmp_intval($candidate_threshold);
$r = $ft->findIndex($threshold_int);
$chosen = $sorted_chars[$r];
$count_before = ($r > 0) ? $ft->query($r - 1) : 0;
$remaining_index = gmp_sub($remaining_index, gmp_mul($T, $count_before));
$result .= $chosen;
$old_count = $counts[$chosen];
$counts[$chosen]--;
$ft->update($r, -1);
$remaining = $n - $i - 1;
if ($remaining > 0) $T = gmp_divexact(gmp_mul($T, $old_count), $remaining);
}
return $result;
}
// Example Usage
$index = "254668872108764518031939727748830382196379337";
$frequencies = [" " => 8, "T" => 1, "a" => 1, "b" => 1, "c" => 1, "d" => 1, "e" => 3, "f" => 1, "g" => 1, "h" => 2, "i" => 1, "j" => 1, "k" => 1, "l" => 1, "m" => 1, "n" => 1, "o" => 4, "p" => 1, "q" => 1, "r" => 2, "s" => 1, "t" => 1, "u" => 2, "v" => 1, "w" => 1, "x" => 1, "y" => 1, "z" => 1];
echo lets_go_faster_anagram_string($index, $frequencies);
?>