The usual syntax for regular expressions is not a regular language because it requires parentheses. The usual syntax for context free grammars/expressions is not context-free because it uses recursion using variable names.

Is there any fundamental reason for this, i.e., some kind of diagonalization/fixed point argument or is it just a coincidence? I think the Goedelian argument doesn't apply because they are weaker than Peano arithmetic?

Regular Expressions Cont'd 

There's an answer on cs.theory stackexchange but it seems like a cheat to me, but how do I formalize the sense in which it is a cheat?:

@maxsnew I think there's a limit to this, because parsing most programming languages is decidable - though Perl 5 and C++ are exceptions! See and

@maxsnew Feels like a coincidence. One could devise an alternative syntax for regex to eliminate parantheses, something like reverse Polish notation.


Recognizing a sequence of tokens is regular, but is *well-formed* RPN a regular language? All of the explanations of RPN use a stack...

@maxsnew Maybe this sounds like cheating, but you can certainly formulate a syntax for regular expressions that's regular, by imposing an interpretation that essentially ignores extra close parens and closes any that are open at the end of the expression. It's true that there are then many ways to specify a given regular expression, but that was already true, no?

@maxsnew ... and yes, you said "the usual syntax". apologies if this was already obvious.


I think this is cheating because there's no rule you could apply to unambiguously recover the regular expression. But I do feel my question is underspecified. I feel like coming up with the precise formulation of this question would as often happens lead quickly to the proof.

Regular Expressions Cont'd 

@maxsnew Sorry, I don't quite get their method. Do they assume all the bracket can be pushed to the beginning or the end of the expression? What about `b(aa)*b?

Regular Expressions Cont'd 


They're basically saying you can recognize well formed regexp by not counting the parens and instead considering open paren and closed parents as arbitrary tokens and then considering any mismatched parens to be implicitly left out either all the way on the left or right. Seems against the spirit because you still have to do a traversal afterwards to turn it into a tree.

Regular Expressions Cont'd 

@maxsnew Oh, yes, but they no longer identify the original regular languages? For example, if I leave the parenthesis of `b(aa)*b` to be `baa)*b`, it would be reconstructed as `(baa)*b`, not the original language. Am I understanding this correctly?

Regular Expressions Cont'd 

@maxsnew Oh yeah, I see what they mean. So you just create a regular expression that ignores bracket matching, and you then rematch all the bracket of all the expression with unmatched bracket trivially (at the back or at the front), so that everything in there would be corresponding to a regular language.

Regular Expressions Cont'd 

@maxsnew I think a more interesting problem might be: Does there exist a regular expression 𝑅 s.t. there is a bijection betweent he language of 𝑅: 𝐿(𝑅), and the set of all regular languages \(\{L(e) \mid e \in \mathrm{REG}\}\)

I think this is true, because all regular language is countable, and its equality is decidable, so we can simply enumerate all the unique regular expressions, and take 𝑅 to be 𝑎*, and map 𝑎ⁿ∈𝐿(𝑅)≜𝐿(𝑎*) to the 𝑛th regular expression.

Sign in to participate in the conversation

A Mastodon instance for programming language theorists and mathematicians. Or just anyone who wants to hang out.