ID String Length Design, Radix vs Length
ID String Systems
ID is a sequence of numbers or characters. e.g. 8946729361
, or h0fa48f2pt
.
- Length: The total count of characters in a ID.
- Radix (or Base): The total count of possible distinct symbols.
ID String System Calculator
ID System A
=ID System B
=Result
Difference: A - B =
Length required in system B to cover system A:
How many symbols you have to introduce in order to reduce the number of digits?
- plot of
y = Log[10^10]/Log[x]
. - The x-axes (horizontal) is the radix of ID.
- The y-axes (vertical) is the length of ID.
- The red dots are positions where the total possible count is 10^10.
If you look at the plot,
- From x = 8 to x = 10, it takes 2 extra symbols to drop a digit.
- From x 10 to 13, it takes 3.
- From 13 to 18, it takes 5.
- Then it takes 9.
- Let xradix be the number of possible symbols in a ID string system.
- Let xlen be the number of distinct symbols.
- Let xtotal be the total possible ID.
We have:
xradix ^ xlen = xtotal
For example, if xradix is 10, xlen is 3, then xtotal is 10^3
or 1000. Examples of such ID string: {000, 001, 002, …, 999}.
For example, if xradix is 2, xlen is 3, then xtotal is 2^3
or 8.
If we use the symbol “0” and “1”, then the complete set of ID is: {000, 001, 010, 011, 100, 101, 110, 111}.
If we use the symbol “a” and “b”, then the complete set of ID is: {aaa, aab, aba, abb, baa, bab, bba, bbb}.
Number of Digits vs Number of Symbols
Let's say we want a ID system for 10 billion IDs. Let's say we use ten symbols e.g. our decimal number system. So we have xradix = 10; xlen = 10; xtotal = 10^10
. That covers “10,000,000,000” items.
What if we increase the possible symbols to 20? Let's use these symbols: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, a, b, c, d, e, f, g, h, i, j}. If we use 7 such digits, then we have 20^7 = 1,280,000,000
. Not quite enough. If we use 8 digits, then 20^8 = 25,600,000,000
, that does it.
We increased our symbols by 10 yet we only save 2 digits!
In general, increasing the number of symbols does not help saving digits much. From our formula xradix ^ xlen = xtotal
, solving for xradix or xlen, we get:
Solve[xtotal == xradix^xlen, xlen] (* {{xlen -> Log[xtotal]/Log[xradix]}} *)
Look at the formula xlen = Log[xtotal]/Log[xradix]
.
That means, xlen is a function of xtotal and xradix.
If the total xtotal is fixed, and the possible symbols xradix varies, then the total number of digits xlen is a function of the form C/Log[x]
for some constant C.
This means, it gets smaller very very slowly.
We have to increase a lot of symbols just to reduce one digit.
Now, let's look at our original question, about the design of a ID system, on what's the optimal number of symbols and digits.
As we know now, that increasing number of symbols does not reduce digits much.
Let's say we use 0 to 9 plus a to z.
That's 36 symbols, and it requires 7 digits to cover 10^10.
If we add 96 Japanese phonetic alphabets, we have a total of 36+96 = 132 symbols.
We still need 5 digits (Log[10^10]/Log[96]
).
If we go to the extreme, using 3 thousand commonly used Chinese characters, we need 3 digits (Log[10^10]/Log[3000] ≈ 2.87
).
If we use 60 thousand symbols (that's about all Chinese characters and all alphabets of the world's languages), we still need 2 digits.
We'll need 10^10 symbols to cover 10^10 space with just one single symbol.
At this extreme, there's a problem.
ID string in general needs to be at least lots of digits (say, 10) to be recognized as ID string, otherwise we need to add some other way to identify it as a ID string.
For example, if we use 60 thousand symbols with just 2 digits, then, of
might be a ID string, but how do you know it's a English word or a ID string?
There's a standard ID system called Universally Unique Identifier (aka UUID or CLSID).
It uses 32 digits of 16 symbols (conventional hexadecimal representation, with symbols 0 to 9 and a to f), covering possible total of 16^32, or 340282366920938463463374607431768211456.
To cover that much with 65536 (2^16
) symbols, we need 8 digits.