Lost-Cluster

A small amount of data that does not belong to any file

Anagram Coding

Introduction

What if I told you everything you knew was a lie. You can store any binary as an index of its anagram and eliminate all redundancy.

Yeah, yeah, yeah, I know...

This algorithm is like the little sister of Arithmetic coding, but with a bit more class and sophistication. No more feeling like a cheap hack, this method is sleek and streamlined. Say goodbye to those annoying end-of-file characters and goodbye to dynamic character frequency space adjustments. Best of all, it's so easy to understand that even my mum could explain it to her friends at the beauty salon.

Marko Blazevic - Elevating 3x3 Rubik's Cube on Person's Palm

A data compression scheme based on the index of permutations of a string is a novel approach to reducing the amount of storage space required for large amounts of data. The basic idea behind this scheme is that by storing a binary as an index of its anagram, all redundancy is eliminated.

To illustrate this concept, let's take the word "ABRACADABRA" as an example. If we were to write down all its permutations, this word would have an index of 21519 out of a total of 83160 possible combinations. This means that the word "ABRACADABRA" can be represented by a single number, 21519, instead of the 11 characters that make up the original word. This greatly reduces the amount of storage space needed to store this word, and the same principle applies to larger amounts of data.

0: AAAAABBCDRR
1: AAAAABBCRDR
2: AAAAABBCRRD
3: AAAAABBDCRR
...
21518: ABRACADABAR
21519: ABRACADABRA <-- index 21519
21520: ABRACADARAB
...
83156: RRDCBAAABAA
83157: RRDCBAABAAA
83158: RRDCBABAAAA
83159: RRDCBBAAAAA

One of the main advantages of this compression scheme is that it is relatively easy to implement. The process of finding the permutation index for a given string can be done quickly and efficiently using mathematical algorithms. So all you need to store is a count of all the characters and the index.

A: 5, B: 2, C: 1, D: 1, R: 2
Index: 21519

Tada!

Why?

My nephew's favourite word and a fair question. The reasoning goes something like this. The best compression is Run Length Encoding, but only if your data is nicely sorted alphabetically. "ABRACADABRA" doesn't compress very well. "AAAAABBCDRR" does.

So you'll lose the crucial information about the exact order of each character in your file. Let's have a look at the list of possible combinations for the word "WIKI", in alphabetical order:

0: IIKW
1: IIWK
2: IKIW
3: IKWI
4: IWIK
5: IWKI
6: KIIW
7: KIWI
8: KWII
9: WIIK
10: WIKI <-- index 10
11: WKII
11: WKII

See, not so many. So the logical thing to do is to count the position in this list. Together with the number of characters (frequencies), you can create a very simple data compression algorithm.

Now let's raise the level a bit...

The Math

Frequencies are a count of all the characters in a string.

To calculate permutations of multi-sets, you can use the following expression:

(f1+f2+f3++fn)!f1!f2!f3!fn!=(i=1nfi)!i=1nfi!

Where f is the frequency of every character. More info on Wikipedia.

If you have trouble seeing the equations, please upgrade your browser.

Example time

Word Permutations
ABRACADABRA 83160
ANAGRAM 840
ARTIC 360
MISSISSIPPI 34650
WIKI 12

Nice. Now that you understand how to determine the number of anagrams that can be made with a given word, the next step is to determine the index of that word within the set of all possible anagrams. This is called the permutation index. Do you see the logic of where this is taking us?

You want to know how many permutations there are starting with the letter A. Let's say f1 is the frequency of the character A. The only thing you need to do is to reduce the frequency of f1 by 1 because you just "used" the character. The equation will be:

((f1-1)+f2+f3++fn)!(f1-1)!f2!f3!fn!

So if you want to know the permutation of the 3rd, just add them up. It gets a bit longer:

((f1-1)+f2+f3++fn)!(f1-1)!f2!f3!fn!+(f1+(f2-1)+f3++fn)!f1!(f2-1)!f3!fn!+(f1+f2+(f3-1)++fn)!f1!f2!(f3-1)!fn!

Luckily Wolfram Alpha is there to save the day and simplify this mofo:

(f1+f2+f3)(f1+f2+f3++fn-1)!f1!f2!f3!fn!

That first multiplier (f1+f2+f3) are all elements in the frequency list to the character. So for "ABRACADABRA", if you want to know the sub-index for the 3rd letter "R", your multiplier becomes: (f1+f2+f3+f4+f5) because it's the 5th letter in the frequency list.

If you go through the word, calculate all the sub-indices, add them all up, you'll end up with the index.

Sub-permutation Frequencies Sub-index
A BRACADABRA A: 5, B: 2, C: 1, D: 1, R: 2 0
A B RACADABRA A: 4, B: 2, C: 1, D: 1, R: 2 15120
AB R ACADABRA A: 4, B: 1, C: 1, D: 1, R: 2 5880
ABR A CADABRA A: 4, B: 1, C: 1, D: 1, R: 1 0
ABRA C ADABRA A: 3, B: 1, C: 1, D: 1, R: 1 480
ABRAC A DABRA A: 3, B: 1, C: 0, D: 1, R: 1 0
ABRACA D ABRA A: 2, B: 1, C: 0, D: 1, R: 1 36
ABRACAD A BRA A: 2, B: 1, C: 0, D: 0, R: 1 0
ABRACADA B RA A: 1, B: 1, C: 0, D: 0, R: 1 2
ABRACADAB R A A: 1, B: 0, C: 0, D: 0, R: 1 1
ABRACADABR A A: 1, B: 0, C: 0, D: 0, R: 0 0
Sum 21519

Example time

Word Index Permutations
ABRACADABRA 21519 83160
ANAGRAM 250 840
ARCTIC 41 360
MISSISSIPPI 13736 34650
WIKI 10 12

Another example

This sentence has a length of 43 characters:

The quick brown fox jumps over the lazy dog

And has an index value of:

(dec) 254668872108764518031939727748830382196379337
(hex) 0b6b748d1dc2610369faa7eaf6160ec0b1cec9

The binary would be 19 bytes.

Shannon Entropy

As I said the binary has a length of 19 bytes. It's below the theoretical limit of Shannon entropy. But then again, most modern compression algorithms are.

This function is an implementation of Shannon entropy, but returns the entropy in total number of bytes instead of bits per symbol:

function entropy($string) {
    $length = strlen($string);
    $freq = count_chars($string, 1);
    $freq_length = array_map(fn($f) => $f / $length, $freq);
    $entropy = array_reduce($freq_length, fn($r, $f) => $r - $f * log($f, 256)) * $length;
    return($entropy);
}
import math

def entropy(string):
    length = len(string)
    freq = {}

    for char in string:
        if char not in freq:
            freq[char] = 0
        freq[char] += 1

    freq_length = {char: count / length for char, count in freq.items()}
    entropy = sum([-count * math.log(count, 256) for count in freq_length.values()]) * length

    return entropy

"The quick brown fox jumps over the lazy dog" has an entropy of 24 bytes.

The Proof

As I said before n is the number of unique characters and fi is how many times the character is used. M is the maximum possible permutations.

So I'll show you how we walk from Shannon entropy to Anagram coding:

  1. Shannon entropy:

    H=ifinlog2(fin)

    For the entire string, the total entropy is nH, representing an absolute lower bound.

  2. The number of distinct permutations is governed by the multinomial coefficient:

    M=n!ifi!
  3. Using:

    lnn!nlnnn

    we approximate:

    lnMnlnnni(filnfifi)=nlnnifilnfi

    Converting to base-2 logs:

    log2M1ln 2(nlnnifilnfi)
  4. Please note:

    ifilnfi=ifilnnnHln2

    Substituting back, we find:

    log2MnH

    Thus, encoding the permutation index requires nH bits, matching Shannon’s limit.

For large n, the symbol frequencies fi/n match their probabilities. This aligns with Shannon’s typical set—the set of sequences dominating the probability mass.

Overheads (e.g., transmitting counts {fi}) become insignificant as n. The term nH dominates.

Optimal encoding uses log2M+O(1) bits, converging to nH.

Assumptions

  1. Transmitting counts {fi} costs O(mlogn) bits, negligible as n.
  2. Stirling’s approximation holds rigorously only as n.

So...

Permutation indexing achieves Shannon’s entropy limit because log2M asymptotically equals nH. This mirrors Shannon’s source coding theorem, showing the interplay between combinatorics and information theory.

The Anagrammatist

Ok, that was heavy. Time for some implementation. Now this is the naive algorithm, based on the simple math. The complexity is O(n2). Where n is the length of the string. So this is for educational purposes. At the end you can find a optimised version.

<?php
function anagram_index($string) {
    $index = 0;
    $frequencies = count_chars($string, 1);
    $string_length = strlen($string) - 1;

    for($i = 0; $i < $string_length; $i++) {
        $char = ord($string[$i]);

        $freq_char_index = array_search($char, array_keys($frequencies));

        $multiplier = array_slice($frequencies, 0, $freq_char_index);
        $multiplier = array_sum($multiplier);

        $numerator = gmp_fact(array_sum($frequencies) - 1);

        $denominator = array_map("gmp_fact", $frequencies);
        $denominator = array_reduce($denominator, fn($carry, $den) => gmp_mul($carry, $den), "1");

        $index += gmp_div_q(gmp_mul($multiplier, $numerator), $denominator);

        $frequencies[$char]--;
    }

    return($index);
}

$string = "The quick brown fox jumps over the lazy dog";
echo anagram_index($string);
?>
import gmpy2

def anagram_index(string: str) -> int:
    index = gmpy2.mpz(0)
    frequencies = dict()
    for char in string:
        if char not in frequencies:
            frequencies[char] = 1
        else:
            frequencies[char] += 1
    string_length = len(string) - 1

    for i in range(string_length):
        char = ord(string[i])

        freq_char_index = list(frequencies.keys()).index(chr(char))

        multiplier = sum(frequencies[key] for key in frequencies.keys() if ord(key) < char)
        numerator = gmpy2.fac(sum(frequencies.values())-1)

        denominator = gmpy2.fac(frequencies[chr(char)])
        for key in frequencies:
            if key != chr(char):
                denominator *= gmpy2.fac(frequencies[key])

        index += gmpy2.f_div(gmpy2.mul(multiplier, numerator), denominator)
        frequencies[chr(char)] -= 1
    return index

string = "The quick brown fox jumps over the lazy dog"
print(anagram_index(string))

Decode:

<?php
function anagram_string($index, $frequencies) {
    $string = null;
    $string_length = array_sum($frequencies);

    for($i = 0; $i < $string_length; $i++) {
        $frequencies_sum = array_sum($frequencies);

        $numerator = gmp_fact($frequencies_sum);

        $denominator = array_map("gmp_fact", $frequencies);
        $denominator = array_reduce($denominator, fn($carry, $den) => gmp_mul($carry, $den), "1");

        $total_permutations = gmp_div_q($numerator, $denominator);

        $frequencies_index = gmp_div_q(gmp_mul($index, $frequencies_sum), $total_permutations);

        $cum_frequencies = 0;
        foreach($frequencies as $char => $char_freq) {
            $cum_frequencies += $char_freq;
            if($cum_frequencies > $frequencies_index) break;
        }

        $freq_char_index = array_search($char, array_keys($frequencies));

        $multiplier = array_slice($frequencies, 0, $freq_char_index, true);
        $multiplier = array_sum($multiplier);

        $numerator = gmp_div_q($numerator, $frequencies_sum);

        $index -= gmp_div_q(gmp_mul($multiplier, $numerator), $denominator);

        $frequencies[$char]--;

        $string .= chr($char);
    }

    return($string); 
}

$frequencies = [32=>8, 84=>1, 97=>1, 98=>1, 99=>1, 100=>1, 101=>3, 102=>1, 103=>1, 104=>2, 105=>1, 106=>1, 107=>1, 108=>1, 109=>1, 110=>1, 111=>4, 112=>1, 113=>1, 114=>2, 115=>1, 116=>1, 117=>2, 118=>1, 119=>1, 120=>1, 121=>1, 122=>1];
$index = "254668872108764518031939727748830382196379337";

echo anagram_string($index, $frequencies);
?>
import gmpy2

def anagram_string(index:str, frequencies:dict) -> str:
    string = ""
    string_length = sum(frequencies.values())

    index = gmpy2.mpz(index)

    for i in range(string_length):
        frequencies_sum = sum(frequencies.values())

        numerator = gmpy2.fac(frequencies_sum)
        denominator = 1

        for key in frequencies:
            denominator *= gmpy2.fac(frequencies[key])

        total_permutations = gmpy2.f_div(numerator, denominator)

        frequencies_index = gmpy2.f_div(gmpy2.mul(index, frequencies_sum), total_permutations)

        cum_frequencies = 0
        for char in frequencies.keys():
            cum_frequencies += frequencies[char]
            if cum_frequencies > frequencies_index:
                break

        freq_char_index = list(frequencies.keys()).index(char)

        multiplier = sum(frequencies[key] for key in frequencies.keys() if key < char)
        numerator = gmpy2.f_div(numerator, frequencies_sum)

        index -= gmpy2.f_div(gmpy2.mul(multiplier, numerator), denominator)

        frequencies[char] -= 1

        string += chr(char)
    return string

frequencies = {32: 8, 84: 1, 97: 1, 98: 1, 99: 1, 100: 1, 101: 3, 102: 1, 103: 1, 104: 2, 105: 1, 106: 1, 107: 1, 108: 1, 109: 1, 110: 1, 111: 4, 112: 1, 113: 1, 114: 2, 115: 1, 116: 1, 117: 2, 118: 1, 119: 1, 120: 1, 121: 1, 122: 1}
index = "254668872108764518031939727748830382196379337"

print(anagram_string(index, frequencies))

Go Faster!

So we can squeeze the algo above a lot. We can use Fenwich binary trees to store calculated factorials. Then the first thing I learned when I learning programming on the C64, is take as much as you can out of the loop. We trade memory for processing power, the speed increase is dramatic however.

<?php

class FenwickTree {
    private $tree;
    private $size;

    public function __construct($size) {
        $this->size = $size;
        $this->tree = array_fill(0, $size + 1, 0);
    }

    public function update($index, $delta) {
        $i = $index + 1; // 1-based indexing
        while ($i <= $this->size) {
            $this->tree[$i] += $delta;
            $i += $i & -$i;
        }
    }

    public function query($index) {
        $sum = 0;
        $i = $index + 1; // 1-based indexing
        while ($i > 0) {
            $sum += $this->tree[$i];
            $i -= $i & -$i;
        }
        return $sum;
    }
}

function lets_go_faster_anagram_index($s) {
    $len = strlen($s);
    if ($len == 0) return gmp_init(0);

    // Precompute factorials up to the string length
    $factorials = array();
    $factorials[0] = gmp_init(1);
    for ($i = 1; $i <= $len; $i++) $factorials[$i] = gmp_mul($factorials[$i - 1], $i);

    // Compute frequency counts and sorted unique characters
    $counts = array_count_values(str_split($s));
    $sorted_chars = array_keys($counts);
    sort($sorted_chars);
    $char_to_index = array_flip($sorted_chars);
    $k = count($sorted_chars);

    // Initialize Fenwick Tree with character counts
    $ft = new FenwickTree($k);
    foreach ($sorted_chars as $idx => $c) $ft->update($idx, $counts[$c]);

    // Compute initial denominator using precomputed factorials
    $denominator = gmp_init(1);
    foreach ($counts as $c => $cnt) $denominator = gmp_mul($denominator, $factorials[$cnt]);

    // Precompute remaining_length_minus_1 divisors as GMP numbers
    $remaining_divisors = array();
    for ($i = 0; $i < $len; $i++) $remaining_divisors[$i] = gmp_init($len - $i - 1);

    // Initialize running term T = (n-1)! / denominator
    $T = gmp_div($factorials[$len - 1], $denominator);
    $index = gmp_init(0);

    for ($i = 0; $i < $len; $i++) {
        $current_char = $s[$i];
        $current_idx = $char_to_index[$current_char];
        $original_count = $counts[$current_char];

        // Get sum of counts of characters < current_char
        $sum_less = $ft->query($current_idx - 1);

        if (gmp_cmp($sum_less, 0) > 0) {
            $permutations = gmp_mul($sum_less, $T);
            $index = gmp_add($index, $permutations);
        }

        // Update Fenwick Tree and counts
        $counts[$current_char]--;
        $ft->update($current_idx, -1);

        // Update T for next iteration: T = T * original_count / (remaining_length - 1)
        if ($i < $len - 1) $T = gmp_div(gmp_mul($T, $original_count), $remaining_divisors[$i]);
    }

    return $index;
}

$string = "The quick brown fox jumps over the lazy dog";
echo lets_go_faster_anagram_index($string);
?>

Decode

Same strategy as before:

<?php
class FenwickTree {
    private $tree;
    private $size;

    public function __construct($size) {
        $this->size = $size;
        $this->tree = array_fill(0, $size + 1, 0);
    }

    public function update($index, $delta) {
        $i = $index + 1; // Convert to 1-based indexing
        while ($i <= $this->size) {
            $this->tree[$i] += $delta;
            $i += $i & -$i;
        }
    }

    public function query($index) {
        $sum = 0;
        $i = $index + 1; // Convert to 1-based indexing
        while ($i > 0) {
            $sum += $this->tree[$i];
            $i -= $i & -$i;
        }
        return $sum;
    }

    public function findIndex($target) {
        $low = 0;
        $high = $this->size - 1;
        while ($low < $high) {
            $mid = (int)(($low + $high) / 2);
            if ($this->query($mid) >= $target) $high = $mid; else $low = $mid + 1;
        }
        return $low;
    }
}

function lets_go_faster_anagram_string($index, $counts) {
    $sorted_chars = array_keys($counts);
    sort($sorted_chars);
    $k = count($sorted_chars);
    $n = array_sum($counts);
    if ($n == 0) return null;

    $ft = new FenwickTree($k);
    foreach ($sorted_chars as $i => $char) $ft->update($i, $counts[$char]);

    $factorials = array();
    $factorials[0] = gmp_init(1);
    for ($i = 1; $i <= $n; $i++) $factorials[$i] = gmp_mul($factorials[$i - 1], $i);

    $denom = gmp_init(1);
    foreach ($counts as $char => $cnt) $denom = gmp_mul($denom, $factorials[$cnt]);

    $T = gmp_divexact($factorials[$n - 1], $denom);
    $remaining_index = gmp_init(gmp_strval($index));
    $result = null;

    for ($i = 0; $i < $n; $i++) {
        $div = gmp_div_q($remaining_index, $T);
        $candidate_threshold = gmp_add($div, gmp_init(1));
        $threshold_int = gmp_intval($candidate_threshold);

        $r = $ft->findIndex($threshold_int);
        $chosen = $sorted_chars[$r];
        $count_before = ($r > 0) ? $ft->query($r - 1) : 0;

        $remaining_index = gmp_sub($remaining_index, gmp_mul($T, $count_before));
        $result .= $chosen;

        $old_count = $counts[$chosen];
        $counts[$chosen]--;
        $ft->update($r, -1);

        $remaining = $n - $i - 1;
        if ($remaining > 0) $T = gmp_divexact(gmp_mul($T, $old_count), $remaining);
    }

    return $result;
}

// Example Usage
$index = "254668872108764518031939727748830382196379337";
$frequencies = [" " => 8, "T" => 1, "a" => 1, "b" => 1, "c" => 1, "d" => 1, "e" => 3, "f" => 1, "g" => 1, "h" => 2, "i" => 1, "j" => 1, "k" => 1, "l" => 1, "m" => 1, "n" => 1, "o" => 4, "p" => 1, "q" => 1, "r" => 2, "s" => 1, "t" => 1, "u" => 2, "v" => 1, "w" => 1, "x" => 1, "y" => 1, "z" => 1];

echo lets_go_faster_anagram_string($index, $frequencies);
?>