skip to content
Gusty's Oasis

Computing a lovely OEIS sequence

- 4480 words

What’s OEIS?

I’m going to take a wild guess and assume the reader doesn’t have a clue what the OEIS is, stands for, or why it’s so great.

Let’s imagine a situation: I’ll share some numbers with you and ask you what you think the next number is within the sequence. Sounds easy, right? Okay, here’s the first sequence: 1, 2, 5. So, what’s the next number? If your guess was 7, your guess was wrong. I was maybe a bit too harsh, so let me add an extra number to the sequence: 1, 2, 5, 13. I’m pretty sure it will now be harder to see a pattern, but the answer was 89. I think you get the idea of this frustrating game, so let me tell you what the sequence was and move on. It was the fibonacci number of the nth prime number.

Now you may ask yourself, “How was I supposed to know the next number in the sequence?” That’s where the OEIS (On-Line Encyclopedia of Integer Sequences) comes into play. Neil Sloane created it1, and it was basically made for this task. Given a couple of terms in a sequence, it will return you a list of sequences with their respective explanations where these terms show up. So if you search “1, 2, 5, 13, 89” in the OEIS, you indeed get the sequence I was using, A030426, along with other sequences where these terms are showing up.

Every day, people use the OEIS to determine if a given pattern of numbers corresponds to a sequence that is well-known in the academic literature. I’m willing to bet that this simple trick can save a lot of hours of research by simply searching your pattern of numbers in the OEIS.

The lovely sequence

In this post, I want to talk about a specific sequence in the OEIS, A058183 which is titled “Number of digits in concatenation of the first n positive integers” and if you don’t grasp what that title means or implies, don’t worry; it didn’t make much sense to me either in the beginning, so let’s go through it together. We’re lucky because this sequence has two parts that we can discuss separately from each other. We will start with the easy part.

Concatenation of integers

Everyone can add two integers, for example, 2+7=92 + 7 = 9, that’s elementary math. However, what if we want to concatenate two integers? For humans this is trivial. The concatenation of two elements is denoted by the \concat symbol, as in 27=272 \concat 7 = 27, 4531=453145 \concat 31 = 4531 or 123456789=12345676891234567 \concat 89 = 1234567689. You see what I mean. I can even do this for more than two integers, for example, 27139=271392 \concat 71 \concat 3 \concat 9 = 27139.

The sequence was speaking about “concatenation of the first n positive integers”. We already know what concatenation is, but now we have to concatenate 123n1 \concat 2 \concat 3 \concat \ldots \concat n, which still sounds easy. Let’s say we have a magic function f(n)f(n) that returns this number, such that f(1)=1f(1) = 1, f(8)=12345678f(8) = 12345678, and f(15)=123456789101112131415f(15) = 123456789101112131415. That’s already the first part of generating the sequence; nothing too complicated. Just before we move on, we will dip our toes into some high-school math2 and lets try to construct this magic f(n)f(n) function.

We will take a small, but handy, shortcut here3 and simply look at the formulas of A007908, which is the sequence we are generating. Every day, people use the OEIS to determine if a given pattern of numbers corresponds to a series that is well-known in the academic literature. There are three formulas listed; however, only the first one looks promising:

a(n)=a(n1)×10log(10×n)+n\begin{equation} a(n) = a(n-1) \times 10^{\lfloor \log(10 \times n)\rfloor} + n \end{equation}

Sidetrack: If you looked really closely and learned logarithm operations, you might’ve noticed you can simplify log(10×n)\lfloor \log(10 \times n)\rfloor to log(n)+1\lfloor \log(n)\rfloor + 1, because of the following identity: logn(a)+logn(b)=logn(a×b)log_n(a) + log_n(b) = log_n(a \times b).

So let’s go through it! And I will first have to point out that this formula in its current form is unclear. If you look closely, you can see there’s an infinite recursion because, at no point, it will stop calling the function again. So let’s improve it by adding a condition to the function4:

a(n)={a(n1)×10log(10×n)+nif n11if n=1a(n) = \begin{cases} a(n-1) \times 10^{\lfloor \log(10 \times n)\rfloor} + n &\text{if } n \neq 1 \\ 1 &\text{if } n = 1 \end{cases}

Now you can see that it will call itself recursively until it hits n=1n = 1. If that still doesn’t make sense to you, here’s the formula but in Python:

import math # Access to centuries of mathematical knowledge by simply importing it.

def a(n):
    if n == 1:
        return 1
    return a(n - 1) * pow(10, math.floor(math.log10(10 * n))) + n

Alright, now let’s really start examining the function. We can see by induction that the last digit(s) of the return value will be the digits of the input value nn, so the formula simply adds that to the end result, which is the +n\mathrel{+}n part. Now we have the other part that magically constructs the rest of the number; it even looks scary with so many operations that somehow have a relationship to each other. I’m procrastinating again, sigh.

So let’s simply start with the first operation that’s computed, given the order of operations: log(10×n)\lfloor \log(10 \times n)\rfloor. This is actually a bit of foreshadowing for the next section, because in this scenario, the log\log, with the floor operation and nn multiplied by 1010, it will compute the number of digits that a given number has in base 10. The number of digits is now the exponent of 1010, which is the most significant aspect of this function since it will be the multiplication for a(n1)a(n-1), which must be added to number on the right “spot” in the result number, so it will be added “left” of nn, in which case you need to multiply it by 1010 to the power of the amount of digits that nn had, and oh wow what a surprise, that’s exactly what the function does. Do not worry if this is still confusing; let me show you some examples which hopefuly helps to grasph what’s goin on here. If you had a(2)a(2), it would have evaluated to 1×10+2=121 \times 10 + 2 = 12 whereas a(3)a(3) would have evaluated to (1×10+2)×10+3(1 \times 10 + 2) \times 10 + 3.

Did I mention that this was the easy part? Oops, my bad. Either way, we now have a simple mathematical function that can gives us 123n1 \concat 2 \concat 3 \concat \ldots \concat n.

Number of digits

Let’s say we have another magical function, b(x)b(x), that returns the number of digits that a number has; for example, b(5)=1b(5) = 1, b(8)=1b(8) = 1, b(73)=2b(73) = 2, and b(54312)=5b(54312) = 5. For a normal person, they can easily do this themselves because they can simply count the digits they are seeing. However, converting this to a mathematical function is harder than it seems, because . Before we move on, let me point out why we’re doing this, that is because the sequence we are exploring asks “Number of digits in […]“.

Sidetrack: If you know a byte or two about programming, you might imagine that constructing such a function is quite easy! Because you can convert an integer into a string and then get the length of that string, that is a valid, unoptimized choice for constructing such a function; however, that kind of programming concept cannot be transformed into a mathematical function.

Let’s start by examing the range of numbers when a given output is expected:

b(n)={1if 1n92if 10n993if 100n9994if 1000n9999b(n) = \begin{cases} 1 &\text{if } 1 \leq n \leq 9 \\ 2 &\text{if } 10 \leq n \leq 99 \\ 3 &\text{if } 100 \leq n \leq 999 \\ 4 &\text{if } 1000 \leq n \leq 9999 \\ \ldots \\ \end{cases}

You might’ve have expected it, there’s a pattern in the range of numbers for a given output. We can generalize this pattern: for a given output Ω\Omega, the range of input numbers that give this output is: 10Ω1n10Ω110^{\Omega-1} \leq n \leq 10^{\Omega}-1. There happen to be a simple function that also changes it output for every 10n10^n, which is b(n)=log(n)b(n) = \log(n), because it’s the inverse of exponentiation, in this case the 10n10^n. Well there we have it our function for getting the number of digits, let’s try it out: b(1)=0b(1) = 0, b(5)=0.69897000433601880479b(5) = 0.69897000433601880479\ldots, and b(12)=1.07918124604762482772b(12) = 1.07918124604762482772\ldots Okay, that’s not the correct output.

We see two issues that we need to fix: Get rid of the decimal and add 1 to the output. The latter issue is really easy to fix, b(n)=log(n)+1b(n) = \log(n) + 1, but harder to grasp why it’s needed. So let’s do a quick crash course om what a logarithm is. As said before, a logarithm is the inverse of exponentiation, which implies that when you know the base and the output, you can compute what the exponent of the base was; for example, 23=82^3 = 8 implies log2(8)=3log_2(8) = 3. So when we use the base number of 1010,5 and give it a number less than 1010, we will not get 1.xxx1.xxx\ldots back but rather 0.xxx0.xxx\ldots, same goes for every other number. That’s why we simply have to add 1 to fix this issue.

We now still have to get rid of the decimals, but we’re lucky and don’t have to do any rounding because the number is already correct, so we simply apply the floor operation to the log operation to remove the decimal and we have our function:

b(n)=log(n)+1\begin{equation} b(n) = \lfloor \log(n)\rfloor + 1 \end{equation}

Let’s test it: b(5)=1b(5) = 1, b(10)=2b(10) = 2, and b(51423)=5b(51423) = 5. That looks correct! That was quite a ride to accomplish.

Sidetrack: I don’t suppose this would’ve been noticed, but one might curiously ask why not use the ceiling operation instead of the floor operation and avoid the +1\mathrel{+}1 fix. Well, that does actually work! However, there is exactly one scenario where it would return the incorrect number: any 10n10^n input, because it doesn’t benefit from the invisible +1\mathrel{+}1 that the ceil operation gives.

Putting it together

So now we have two nice formulas that, when combined, can give us a lovely sequence, so let’s just do that!

P(n)=b(a(n))\begin{equation} \Rho(n) = b(a(n)) \end{equation}

So given the two previous sections, this equation should be able to generate A058183. So let’s give it some input from the list: P(5)=5\Rho(5) = 5, P(18)=27\Rho(18) = 27 and P(58)=107\Rho(58) = 107; it indeed works! That means we now have a mathematical function to generate the lovely sequence and you might have noticed that this exact formula is already listed in the OEIS: a(n)=A055642(A007908(n))a(n) = A055642(A007908(n)), whereby these two sequences are used: A007908 and A055642.

That was quite a trip and lots of math to just get to this point and discover this can be optimized. Oh, am I spoiling the next section again? No worries; if you understand these two parts, it will only make it easier to follow along into the computation section, because up until this point we’d only cared to get a correct mathematical representation of generating the sequence, which we indeed got.

Computing it

When I ask you to compute the fibonacci sequence up until 25 terms, you might ask WolframAlpha or actually try to do this by hand; either way, something has to go through a mathematical function and crunch the numbers. However, you might’ve noticed that, over the past decades, we’ve gotten pretty handy with CPUs, especially in efficiently telling them what to do, and in turn, CPUs have become fast at doing math operations. It has reached a point where sometimes a mathematical representation of a function does not directly correspond to what the CPU is being told to do because there’s a likelihood that there are certain optimizations that a CPU can take advantage of, either by merely storing numbers in base 2, by doing fast approximations of expensive math functions like square root, or by simply caching expensive results that otherwise would’ve been computed multiple times. My point is that we do have a nice formula, but let’s try to compute it and see if it’s an optimal solution (spoiler: it isn’t) and if we can take advantage of any optimization.

Just to get a feeling for what we’re coding, we will use Python, so that most readers will be able to follow along. So let’s get a prototype going!

import math
import sys

# This will return a concenation of 1, 2, 3, ..., n.
def concenate(n):
	if n == 1:
		return 1
	return concenate(n - 1) * pow(10, math.floor(math.log10(10 * n))) + n

# Returns the amount of digits a number has.
def countDigits(n):
	return math.floor(math.log10(n)) + 1

# Returns the amounts of digits the concenation of 1, 2, 3, ..., n has.
def P(n):
	return countDigits(concenate(n))

P(int(sys.argv[1]))

In this case, we have a perfect representation of the mathematical function, and we can even benchmark it now. I will use the Hyperfine tool for benchmarking. So let’s give it a try and see if the code has good performance.

InputAverage execution time
513.2 ms ± 1.1 ms
8013.4 ms ± 0.9 ms
90013.6 ms ± 0.7 ms

It seems like it’s pretty good for small numbers, so let’s try a bigger number. Oh wait, that will be a problem. As you might notice in the concenate function, there’s recursion happening. For good reason, Python has a recursion limit of 1000 calls, which means any input greater than 1000 will upset Python. I will ruin the fun if we already have to rewrite that part of the code, so we will instead temporarily increase the recursion limit6 to not upset Python. Let’s give it another try with bigger numbers this time.

InputAverage execution time
100013.7 ms ± 1.0 ms
10.00058.8 ms ± 1.6 ms
50.0001.371 s ± 0.002 s

As you can see, the average execution time is increasing fast for a relative low number; the execution time becomes vastly inpractical to use this in a normal application. I’ve already spoiled a bit of why this happening, the concenate function is called quite often, the running time of this function is Θ(n)\Theta(n)7, where nn is obviously the input number. Fortunately, time increases linearly; however, given a big enough number, it will be well over a second before there’s an output.

In order to decrease the execution time, we can do two effective things: fundamentally change the algorithm to have a better upper-bound than O(n)\mathcal{O}(n) or apply optimizations to the code. Obviously, it’s easier to apply optimizations than to come up with a new algorithm; however, in this case, we will not try to optimize the code and cut straight to trying to design a new algorithm. A little spoiler, there is an algorithm running in O(logn)\mathcal{O}(\log n).

Revisting math

So, now that you have a good understanding of what we are trying to calculate, let’s try to come up with a better algorithm, rather, you will read what I was trying to come up with. Of course, the first place to look is at the OEIS again, and surprisingly, there’s a Θ(1)\Theta(1) formula there.8 Quite frankly, I cannot derive this formula from anything, but it seems to look correct. So we will not use that formula and instead come up with our own one, which is quite easy to grasp and something I’ve seen being used in the wild, so once again, this is nothing wildly new. So let’s first come up with the math function of our new algorithm before implementing it, because who doesn’t love math?

Alright, we actually already know the pattern of which range of numbers has nn number of digits, and we will use that knowledge to our advantage because, if you can remember, the first number that has n number of digits is 10n110^{n-1} and the last number of that range was 10n110^n-1. If you use 131 as an input number, you know that [1,9][1, 9] will have one digit and [10,99][10, 99] has two digits, however a simply but important observation to make is, if you now use 132 as an input number, the first two ranges stay constant obviously. What this means is that we can calculate those ranges by simply multiplying the amount of numbers in that range by the number of digits of that range, so in the case of [1,9][1, 9] that would be 9×19 \times 1 and for [10,99][10, 99] it would be 89×289 \times 2 and then calculate the rest of the three digits range by doing (132100)×3(132 - 100) \times 3.

So how do we map this observation to a mathematical function? Well, first I have to do a crash course on the summation notation. Prepare yourself; this is the summation symbol:

\sum

For some people, this looks scary but is actually easy to understand! If this explanation doesn’t fit you, feel free to look online, as there are many explanations. Simply put, it’s a loop that will sum all the results that were calculated inside the loop. You have to specify the start and end range of the iteration as well, so for example, if we want to put the following sequence: 50+51+52++5n5^0 + 5^1 + 5^2 + \ldots + 5^n into a mathematical function, it would look like:

i=0n5i\sum^n_{i=0} 5^i

It starts at i=0i=0 and ii would increase until it hits nn, and for every iteration it will calculate 5i5^i with the respective ii of that iteration and add it to the sum.

Alright, one more detour before we can actually put the new function together. We need to make another observation: how many numbers are there in every range? For the first one, there are 9 numbers; for the second one, there are 90 numbers; and for the third one, there are 900 numbers. Well, this has to be the easiest observation to make so far, and to come up with a formula to compute it: r(n)=9×10n1r(n) = 9 \times 10^{n-1}.

Great! Now we have all the knowledge to put together a mathematical function. So let’s start simply and figure out our iteration range. We can only apply this observation to ranges where every number in that range is <n<n, where nn is the input number. For example, if we have 56, we can apply the observation to [1,9][1,9], but not to [10,99][10,99], because not every number in that range fulfills the <n<n condition. However, if we have 99, then it does. So once again, we have to use log\log somehow. Luckily, it’s simple, and we only have to apply the floor operation: logn\lfloor \log n \rfloor, because we don’t need the decimal part. We can figure out the iteration start; let’s say we have 10 as an input number, the end range would be 1, and we want to run the iteration once, so the start range has to be 1. Yes, I am aware that this approach of determining the iteration start is quite scientific. So we now have the following:

i=1logn\sum^{\lfloor \log n \rfloor}_{i=1}

We still need to determine two things: the formula that must be used for each iteration and the formula to calculate the remaining number, which we did not calculate through the summation. We already know how many numbers are in each range, which is r(n)=9×10n1r(n) = 9 \times 10^{n-1}, so we just multiply that by the number of digits in that range. So to put it into perspective:

i=1logn9×10i1×i\sum^{\lfloor \log n \rfloor}_{i=1} 9 \times 10^{i-1} \times i

Sidetrack: Yes, ii starting at zero is more preferable, but it makes understanding what is going on more difficult. As it will add a term to the summation’s upper condition and parentheses inside the iteration statement.

It hopefully goes without saying that ii happens to correspond to the number of digit a number has in that range. That was actually fairly simple to figure out, entirely by pure induction of recognizing patterns.

So now we have to figure out the rest of the number, so let’s give it a thought about what we need to know and how we can calculate that. There seem to be two essential things that we need to know: the amount of numbers in the partial range and the number of digits that range has. If that doesn’t make sense, let’s once again run through an example: if we have as input number 15, 0–9 will be covered through the summation expression; however, we still need to calculate 10-15, which are 5 numbers (the amount of numbers in the partial range of 10-99) that all have two digits (the number of digits the number has in this partial range), and then simply multiply that by each other and then add it to the result of the summation.

We can fairly easily tick off one box: the number of digits that a number has in the partial range is logn+1\lfloor \log n \rfloor +1. Now we need to figure out the number of numbers there are in the partial range; for that, we need to know the cumulative number of numbers that have been in the other ranges. Once again, let’s run through an example: if we have 119119 as an input number, we would need to calculate 119990119 - 9 - 90, because there are 9 numbers in the [1,9][1,9] range and 90 in the [10,99][10,99] range. Let’s make a observation, we can simplify 990-9 - 90 to 99-99. That’s funny; that likely means that the subtracted amount is something along the lines of 10n110^n−1, we just need to figure out what nn is. Let’s go with the most scientific route and simply make it work for 119119 and then check if it works for other numbers, so in this case, n=2n=2, because that will compute to 9999. Because we’re still working with ranges, we will need to involve log, so let’s start with the simplest variant, logn\lfloor \log n \rfloor, and what a surprise! If we give it 119119, it spits out 22. That was simple to figure out, now let’s test it with some other numbers to make sure it’s actually correct: log5=0\lfloor \log 5 \rfloor = 0, log99=1\lfloor \log 99 \rfloor = 1 and log100=2\lfloor \log 100 \rfloor = 2. That looks correct.

So, using all we now know, let’s finally put it all together:

P(n)=(n(10logn1))×(logn+1)+i=1logn9×10i1×i\begin{equation} \Rho(n) = (n - (10^{\lfloor \log n \rfloor} - 1)) \times (\lfloor \log n \rfloor + 1) + \sum^{\lfloor \log n \rfloor}_{i=1} 9 \times 10^{i-1} \times i \end{equation}

That looks beautiful, I will for the last time use Python and convert this math expression into some readable Python code.

import math
import sys

def countDigits(num):
	return math.floor(math.log10(num)) + 1

def ref(num):
	inputDigits = countDigits(num)
	# The amount of numbers that the other ranges has.
	rangesNumber = pow(10, math.floor(math.log10(num))) - 1
	res = (num - rangesNumber) * inputDigits

	# Summation includes the upper bound value, while Python range doesn't, so we need to simply add 1 to make it work as intended.
	for i in range(1, inputDigits):
		res += pow(10, i-1) * 9 * i
	return res

ref(int(sys.argv[1]))

Crunching the numbers

Before we continue, let’s quickly benchmark the Python code of the new algorithm in comparison to the old algorithm:

InputOld AlgorithmNew Algorithm
1007.6 ms ± 1.3 ms7.3 ms ± 0.7 ms
10.00039.4 ms ± 1.0 ms7.5 ms ± 0.9 ms
50.000906.6 ms ± 9.6 ms7.3 ms ± 0.6 ms

Sidetrack: Python doesn’t seem to like my high amount of recursion, so that might explain why this run of the old algorithm is a bit faster than the previous run. Or Linux was just more nice to the python process, no clue.

Why is this useful?

You might’ve scratched your head by now and asked yourself why you’ve come so far into reading this post and why this is actually useful to compute. Uh, yeah, great question, actually. Well, I do have a personal reason for computing this lovely sequence, pre-allocating memory. Let’s imagine a scenario whereby you have to pre-allocate memory for a given task. Well, this is usually do-able. Your input data might contain 0...n0...n integers, and you want to know how much memory they are going to need to be stored (in their string representation). In such a case, being able to compute this sequence for a given nn will be able to give you that answer.

On the other hand, this is useful to realize that a small and innocuous-looking formula might have a complex history behind it that you cannot derive from purely looking at the formula. Just to have a simple loop code, there was a lot of math code to really grasp what it was doing and why it was an efficient method of doing it. Making observations can go a long way toward coming up with formulas, but do note that it usually doesn’t work with more complex math as there are just too many factors to take into account.

Just to wrap things up, feel free to get in touch with me nicely if you discovered a mistake in my lovely math or just a language mistake.


Footnotes

  1. Even if you do not like math, I can recommend you to watch Numberphile’s Eureka Sequences to understand why he enjoys sequences and why he developed the OEIS.

  2. This is an estimate. I really have no clue if this is covered in high school and will likely differ from country to country. Either way, it requires a bit more thinking on the reader’s part.

  3. This is not a research paper or anything; otherwise, I might’ve tried to come up with my own formula, which would’ve likely resulted in a formula that’s already known.

  4. For what it’s worth, I’m quite nitpicking here, because in most cases, a normal math-savvy person will understand that this function stops at n=1n=1 and not try to compute for an infinite amount of time. In my defense, making it clearer will make it more understandable for the average person.

  5. It’s useful to note that you cannot avoid the +1\mathrel{+}1 fix by using a base of 1, as one may suggest, because no matter how many times you multiply 1 with itself, it will always remain 1.

  6. This is achieved by calling sys.setrecursionlimit(...).

  7. If you just saw and thought, oh that’s big-O notation, I have to disappoint you. This is related to big-O notation, but doesn’t technically mean the same thing. So a little crash course, because this is not a technical computer science blog. The big-O notation means that it runs at worst, but it could run faster than the time described, so for example binary search runs in O(log2n)\mathcal{O}(\log_2n), but this simply means that it runs in worst case log2n\log_2n, while it can also run in O(1)\mathcal{O}(1) because the item was found in the first iteration. So the big-theta notation, which we use now, has the description you might have had of the big-O notation, it always runs in Θ(n)\Theta(n) time, which is essentially the upper and lower bounds of the function.

  8. a(n)=b(n)×(n+1)10b(n)1992a(n) = b(n) \times (n + 1) - \frac{10^{b(n)} - 19}{9} - 2, where b(n)b(n) is A055642, contributed to the OEIS by Lorenzo Sauras Altuzarra.