Functions and Regular Expressions
Regular expressions are one of those things that bring joy and frustration at the same time to most software developers. At one point or another, we’ve all written functions to match strings against some regular expression, as part of a bigger function, class, component etc. In this post, I want to show you one of the counterintuitive results from Computability Theory, that involves regular expressions. The code samples are all in Scala, mainly because, in my opinion, it is concise in many respects, but, at the same time, reasonably familiar to a wide audience of developers.
Terminology
Let’s start with some basic Computer Science terminology. If you are familiar with these terms, feel free to skip to the next section.
A language is any set of strings. A regular language is a set of strings that match some regular expression. Conversely, a non-regular language is a language for which there is no regular expression that matches all the strings in it.
A function accepts a string if it returns true
for that string1. And we say that a function recognizes a language if it returns true
for all strings in that language and false
for all others. Some functions, however, can go into an infinite loop and therefore never return either true
or false
for some strings. These functions are obviously undesirable. We are usually more interested in functions that eventually terminate. We call them deciders. When a decider recognizes a language, we say that it decides that language.
Ok, that’s all we need to define for this article.
Different classes of functions
So let’s say we want to implement a function to recognize the regular language 01*0. That is, the set of all strings that start and end with a 0, with any number of 1s (including none) in the middle. Pretty straightforward to do, using Scala’s built-in regular expression support:
object zerosAtTheEdges extends (String => Boolean) {
override def apply(s: String): Boolean = s.matches("01*0")
}
Another (trivial) example of a regular language is the set of all strings:
object anything extends (String => Boolean) {
override def apply(s: String): Boolean = s.matches("[01]*")
}
We can also implement a function to recognize non-regular languages, such as 0n1n. This language is composed of all the strings that start with a certain number of 0s and end with the same number of 1s. How do we know that this language is non-regular? The intuition is basically this: in order to tell whether a string has an equal number of 0s and 1s, you need to count their occurrences. So you need to store this information in memory, somehow. Since the exponent n can be arbitrarily large, we need an arbitrarily large memory. But, when it comes to regular expressions, we can only rely on a finite, pre-determined memory (for example, we can write a regular expression for 0313 or 0414 or any other fixed number, but not an arbitrarily large number).
A possible implementation of a function that recognizes 0n1n is:
object zerosAndOnes extends (String => Boolean) {
override def apply(s: String): Boolean = {
val (zeros, rest) = s.span(_ == '0')
zeros.length == rest.length && rest.forall(_ == '1')
}
}
Some functions may accept or reject certain strings, but go into an infinite loop for all other strings:
object loop extends (String => Boolean) {
override def apply(s: String): Boolean = {
if ("0000" equals s) true
else {
while (true) {}
false
}
}
}
Wouldn’t it be nice if we could implement some function f
that received another function g
as input (after all, Scala makes it easier to deal with higher-order functions) and decided whether g
recognizes a regular language? This doesn’t seem like an easy task, but let’s sketch out something to get a feel of what it would look like:
/**
* Given a function as input, tells whether
* that function recognizes a regular language.
*/
object regularLanguageDecider extends ((String => Boolean) => Boolean) {
override def apply(g: String => Boolean): Boolean =
g == zerosAtTheEdges || g == anything
}
Obviously we’re cheating here. We’re assuming that there are only two regular languages in our universe and, for each of these languages, there is only one function that recognizes it. So all we need to do is check the identity of the function to reach a decision.
If you’re uncomfortable with this cheap trick, you can think of more sophisticated ways of inspecting the function g
. You could consider, for example, using more advanced language features such as reflection or make the decider work on the source code of g
, so you could do some static analysis and so on.
Regardless of how we think we might implement it, let’s just assume that such a decider already exists and is available in some open source library. We can just import that function in our own code and use it as we see fit. So, from now on, we will not look at its implementation anymore and just treat it as a black box that can be used wherever it’s needed.
A Universal Decider
Let’s now consider a different problem: we want to tell whether a function — any function — accepts a given string as input. For example, we know that zerosAtTheEdges
accepts the strings “00”, “010”, “0110” and so on; and it doesn’t accept strings like “11” and “000”. In the first case, the function returns true
and, in the second case, it returns false
. Similarly, loop
accepts the string “0000”, but doesn’t accept any other string (because, in that case, it goes into an infinite loop).
Can we implement a function to solve this problem? Let’s approach this with a Software Engineering mindset. Let’s think of the possible cases, write one test for each case at a time, see each test fail and then implement the code to make it pass, doing refactorings along the way, as necessary (a.k.a, Test Driven Development). So our first test would look like this:
test("Function accepts its input") {
val decider = new UniversalDecider()
assert(decider.apply(zerosAtTheEdges, "01110"))
}
In this first test, we pass two parameters to our system under test: zerosAtTheEdges
and "01110"
. Since we know beforehand that zerosAtTheEdges("01110") == true
, our test expects that decider.apply()
returns true
. An obvious way to make this test pass is:
class UniversalDecider() extends ((String => Boolean, String) => Boolean) {
override def apply(fn: (String => Boolean), input: String): Boolean =
fn(input)
}
We simply pass the input to the function and return the result! Cool, let’s write our second test, then:
test("Function does not accept its input") {
val decider = new UniversalDecider()
assert(!decider.apply(zerosAtTheEdges, "11"))
}
Without any modification to UniversalDecider
, this test already passes. Things are looking good! Let’s move on to our third and final test:
test("Infinite loop") {
val decider = new UniversalDecider()
assert(!decider.apply(loop, "11"))
}
We know that loop
will never terminate when called with the string "11"
, in which case decider
should terminate with the value false
. However, decider
itself goes into an infinite loop, so our test will eventually fail with a timeout. But all is not lost. We can add a new level of indirection to our code, based on the Factory design pattern:
trait Factory extends ((String => Boolean, String) => (String => Boolean))
This factory receives two parameters, a function and a string, and returns another function. Implementations of this trait should fulfill the following contract:
- If the function in the first parameter accepts the string in the second parameter, the resulting function (the factory’s product, so to speak) recognizes some regular language. It doesn’t matter which regular language is recognized by this function.
- Otherwise, the resulting function does not accept any regular language.
Using the idea of Design by Contract, we can rely on this guarantee without having to know anything about any particular implementations of this factory.
OK, so how does this help us? Remember the regular language decider we’ve built before? It will come in handy now. Let’s refactor UniversalDecider
to this:
class UniversalDecider(factory: Factory) extends ((String => Boolean, String) => Boolean) {
override def apply(fn: (String => Boolean), input: String): Boolean =
regularLanguageDecider(factory(fn, input))
}
We are now injecting the factory into UniversalDecider
’s constructor. It’s easier to understand what’s going on here by looking at it graphically. We can group our three test cases into two scenarios:
In the first scenario, fn
accepts input
. The first step is to feed these two parameters to the factory. Given the contract of the factory, it will produce a function that recognizes some regular language. The next step is to feed this function to regularLanguageDecider
, which, according to its specification, returns true
in this case. And this is the final result of UniversalDecider
.
In the second scenario, fn
doesn’t accept input
, which makes the factory return a function that doesn’t accept a regular language, which makes regularLanguageDecider
return false
, which is the final result.
Now we need to refactor our tests accordingly, passing different mock implementations of the factory, depending on the test case:
test("Function accepts its input") {
val decider = new UniversalDecider((_, _) => zerosAtTheEdges)
assert(decider.apply(zerosAtTheEdges, "01110"))
}
test("Function does not accept its input") {
val decider = new UniversalDecider((_, _) => zerosAndOnes)
assert(!decider.apply(zerosAtTheEdges, "11"))
}
test("Function loops forever") {
val decider = new UniversalDecider((_, _) => zerosAndOnes)
assert(!decider.apply(loop, "11"))
}
In the first test, we are injecting into UniversalDecider
a mock factory that returns zerosAtTheEdges
, which is a function that accepts the regular language, 01*0, which makes decider
return true
. In the other two cases, we are injecting zerosAndOnes
, which accepts a non-regular language, 0n1n. This makes decider
return false
. Run the tests again and everything is green! Fantastic!
Except that… well, what have we actually accomplished with this? It seems that we just moved the problem somewhere else. We are delegating the hard work to this factory, that we have mocked for the unit tests, but we have no idea how to implement. But notice how the problem is simpler, now. The contract says that, if the input function accepts the string, the output function should accept a regular language. But if the input function does not accept the string, the output function can do whatever it wants, as long as it does not accept a regular language. So consider this implementation of Factory
:
object nonRegularOrDelegate extends Factory {
override def apply(f: String => Boolean, w: String): (String => Boolean) =
(s: String) => zerosAndOnes(s) || f(w)
}
The output function can behave in four different ways, depending on the pair (f
, w
) used to create it and the input string s
passed to it at runtime:
f accepts w |
f does not accept w |
|
---|---|---|
s matches 0n1n |
Accept | Accept |
s does not match 0n1n |
Accept | Reject or Loop |
Recognizes a regular language | Yes | No |
If you focus on the columns of this table, you’ll see that the last two correspond to the two different types of functions that can be generated. When f
accepts w
, the generated function accepts all strings passed to it. Since the set of all strings is a regular language, the generated function recognizes a regular language. When f
does not accept w
, the function accepts all strings that match the pattern 0n1n, but for all other strings it will either reject or loop. In other words, it recognizes a non-regular language. Therefore, our implementation fulfills the contract we specified earlier. If we come back to our unit tests and replace the mocks with nonRegularOrDelegate
in all cases, they should all pass.
Contradiction
Let’s recap what we’ve done so far: we are assuming that there is a decider for regular languages out there in the world. We imported that decider and, with just a few lines of simple code around it, we managed to implement a function that can tell wether another function accepts a given string.
But here’s the contradiction: a function like UniversalDecider
does not, and cannot, exist (to understand why, wait for the second part of this post). How is this possible? We have certainly built this function. Its implementation is clear and simple. There are even unit tests to prove that it works, in case there were any doubts. So there remains only one possibility: our assumption that there exists a decider for regular languages is false: it’s impossible to implement a function that can tell whether another function recognizes a regular language!
This result can be disheartening, because it shows some computational limits involving things that are as familiar as regular expressions. On the other hand, there is a sort of logical beauty to this construction that fascinates me every time I think about it. The complete source code is here.
-
Note about convention: we are only interested in boolean functions in this article. So, every time you see a mention to a function, I implictly mean a function that returns
Boolean
. Also, for simplicity, strings can only contain the characters'0'
or'1'
. ↩