Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why you should not use (f)lex, yacc and bison (tomassetti.me)
37 points by quickthrower2 on June 16, 2021 | hide | past | favorite | 57 comments


> flex uses a BSD license, while Bison uses the GPL. Bison waive the license for most generated parsers, but it can be a problem

Now, this is one of the most atrocious handwavy kind of FUD that I've had a displeasure to see. If the parsers generated by bison are GPL-free, where is the problem? Can you even name the problem? Or you just see the word "GPL" and it's scaring you somehow?


I am guessing that the author is referencing organisations that are pathologically allergic to GPL, where the legal department might not accept the exception.

https://www.google.com/search?q=bison+gpl+exception


The problem is that you have to guess what the problem is, exactly.

The author might just as well put up a comparison chart:

    bison: PROBLEMATIC AND GPL
    ANTLR: does not have a problem*
    ----
    * With yearly consulting contract from our company
The message would be exactly the same.


"Incompetent legal dept" is not a bug in the product.


A quick search for "reentrant flex bison" gets me this:

https://stanislaw.github.io/2016/03/29/reentrant-parser-usin...

which in turn links this:

https://github.com/blynn/symple/tree/75aaea79141a18a234c94dc...

The magic flags are:

  %option reentrant
and

  %lex-param   { yyscan_t scanner }
  %parse-param { yyscan_t scanner }
  %parse-param { val_callback_t callback }
... so ... it seems a reentrant flex/bison parser is easily possible?


Glad to hear the ANTLR vs Yacc holy way (a.k.a. LL vs LR) is still going strong. This one holy war may actually be older than emacs vs vi.

Have the ANTLR guys dropped their Java credo or they're still thinking they are going to convince any C developer with a .jar?


This is actually the first time I've even heard of ANTLR, which came about in the early 90's. But yacc/lex/vi/emacs, et al have been around since the 70's. The emacs vs. vi holy war was pretty much instant, but doomed every time one spoke of headless devices such as routers.


Antlr and LL(k) parsers in general are great when they fit the grammar you want to make well. I remember discovering it when it was called JavaCC a nod to yacc.


Heh. at my grad school we had a pccts fan (antlr before java even existed) and he actually tried to paint the LR vs LL war as a american vs european thing (this was in france).


I wrote a parser for a DSL using ANTLR about ten-ish years ago, and it was an incredibly painful experience. There's very little freely-available documentation for ANTLR, most of it is just "buy my book". ANTLR doesn't know anything about the language it's targeting (not even Java), so if you name a rule e.g., `while` the generated code will be filled with errors, because it names the parser functions after the rules. The graphical tools for building parsers are cool, except that many advanced things can only be done by adding actual code to the parser rules, which the graphical tool does not support (and how could it?) so you're back at square one.


Both ANTLR and Yacc are painful to work with. I ended up writing my own parser generator instead. So much better!


Does that qualify as Yacc shaving?


Hahaha good one :-)


Is it open source?


Unfortunately not (did it working for a company).


I would have upvoted this article except for the fact that it popped up not one but two pop-up windows asking for my email address.


Both of which were ads for ANTLR services, which is what they provide.


On other articles (like this [1]), they put a box at the bottom where you can submit your email address. It's far more respectful way to do it.

[1]: https://tomassetti.me/working-with-excel-in-python/


The title should have been 'Why you should use ANTLR and not use (f)lex, yacc and bison' as if ANTLR is the only alternative.


Previous discussion: https://news.ycombinator.com/item?id=22491536 (89 comments)


Oh that was pretty recent I’d have held off submitting if I saw that.


I've found re2c + lemon to be a pretty good combination.

re2c has good unicode support -- takes a while to compile if attempting TR31 "Unicode Identifier And Pattern Syntax" though.

Haven't used lemon on anything too complicated, I basically just play with grammars for fun, but it works really well for my MiniScheme fork I like to poke at every once in a while.

My new shiny is Coco/R which is a dream to use (once you get up the learning curve) but hasn't really been maintained for ages.


For c/c++ just use lemon: https://en.m.wikipedia.org/wiki/Lemon_(parser_generator)

I think that lemon is comming from the same author as sqlite. I used it like 15 years ago ;) but i still remember lemon as high quality piece of software (Much better compared with flex, yacc, bison...)


This reads more like an ad for ANTLR than a criticism of flex&bison. Very few criticisms are offered, and most of them are quite handwavy.


The re-entrancy complaint seems legit if actually well founded, although I'd think that it probably says somewhere in Flex/bison docs that the generated parsers expect only to be entered from a single C process. It sounds to me like they were using Flex/Bison generated parsers inside of some huge messy corporate web codebase, probably for webscraping or some kind of semi-structured data cleaning, and the code clammed up in some way.


flex & bison use globals to report things per-rule. Of course they are not re-entrant. This is documented by virtue that those are globals.


And the fix is to not use globals. cf. my toplevel post about "%option reentrant", "%lex-param" and "%parse-param".

Globals are just the 80's setup, and flex/bison are supposed to be POSIX lex/yacc compatible, so I'm gonna count this as "shitty defaults due to compatibility with stone tablets".


Correct. It’s the default but it’s no longer necessary.

Flex has had some work done to push it towards being able to generate Go and Rust code too, though this work isn’t finished. In principle it is now possible to generate any Algol–style language.


I learned about the existence of Lex, Yacc, and Bison a bit more than 20 years while discussing the difficulties of debugging handwritten parsers with a friend.

But I was quickly discouraged by the complexity of using these tools, they are more like a new language that you have to learn.

Today I prefer writing parsers by hand, in C, like Fabrice Bellard, it is not super easy but manageable and I never encountered major issues.

I think that Lex/Yacc and similar tools are a good illustration of the power and shortcomings of metaprogramming.

There are some obvious use cases, but it is not always the best choice.


Same experience here; recursive descent parser advantages:

- It's just normal code; you don't have to go learn a whole 'nother thing

- Easy and fun to write (ymmv)

- Does not add extra dependencies to your build system (the time I've wasted getting the right version of yacc installed just to handle software that parses some trivial grammar...)

- You can easily deliver really good error messages to your users (they love this!)

Disadvantages:

If you're not careful, you can accidentally end up w/ some sort of weird frankenstein grammar


How easy is it to make quick changes to hand-written parsers.

Imagine, for example, that the input being parsed is not under the authors control and it suddenly changes. The parser must be changed.


The input changing shouldn't matter. Maybe you could see a performance issue if you had to parse a really large file/files. The issue would be if the grammar to what you are parsing changes. If that changes, then no parser will work, correctly until it supports the new grammar.

As a practical matter the grammar can act like a contract between the two systems. You don't have control of the input, but the input must follow the same grammar.


Does the separation of the grammar into its own file, like a .y file for yacc, make it easier to modify the grammar at a later point. Similarly, is it easier to modify a parser later on, if the input changes, if the patterns to be searched in the input and the state machine are stored in a separate file, like an .l file for flex.


I would think so.

I don't work on parsers and grammar day to day, so I might not be the best person to ask about this. I am interested in the subject though. I just don't get to work on this in my current day job.


It is super easy to change. It is just normal code.


Why then do people use programs like ANTLR, Ragel, Lemon/re2c, etc.


Probably because they don’t know how to write parsers by hand and assume it is hard?


Yep same here. It is really easy to write parsers by hand if you know what you are doing.


I used ANTLR to build a query language parser and interpreter for a past employer. They had a specialized use case. ANTLR was pretty pleasant to use. The solution ended up being a pretty small amount of code and enabled much more powerful searching in their internal data viewing tool. I still think fondly about the project from time to time, partly because it was the one project where my fellow devs were just as happy with the tool as the non-dev users were (this was a very common tool for both devs and non-devs to use).


The IDE and tree parser modes are pretty nice!


It's 2021, why not use parsing expression grammar?


perhaps because you want LALR or other techniques?

Can't remember why but my Compilers teacher (Adrian Johnstone, Royal Holloway) doesn't like PEGs - think due to parse forest not being generated?


I use PEG’s all the time. Works great.


(I maintain Bison.)

While I definitely agree that ANTLR is a great tool, I feel that many of the critics about Bison are somewhat unfair.

First of all, many people like to tie Flex and Bison together, and that's wrong. You don't have to use Flex to feed tokens to your Bison parser. And my personal opinion is that Flex is badly maintained. Releases are infrequent, and issues keep on stacking. So please, stop putting Flex and Bison in the same bucket: they _can_ be used together, but they are not maintained by the same people.

Second, many people seem to not know that Bison is way more than YACC. To name just a few features of Bison

- it goes way beyond LALR(1): IELR(1), canonical LR(1), and GLR are supported

- it generates parsers in C, C++, Java and D

- it supports push and pull parsing

- it supports customized error message generation

- it can generate explicit counterexamples about conflicts (see https://en.wikipedia.org/wiki/GNU_Bison#Counterexample_gener... to see examples)

- it perfectly supports reentrant parsers and it comes with several examples of such parsers (see https://github.com/akimd/bison/tree/master/examples/c)

- and way more.

The original article says:

> Flex and Bison are very much stable software. They are maintained, but development of new features is limited, if not absent.

Before emitting such strong statements about Bison, please at least look at its current state and to the NEWS file (https://git.savannah.gnu.org/cgit/bison.git/tree/NEWS). There were two major releases in 2018, three in 2019 and two in 2020. That's 29 releases in four years if you also count the bug fix releases. Therefore "development of new features is limited, if not absent" is pretty much false. Did the author actually studied Bison before writing all this?


If you push out enough grammars that makes writing lexers and parsers by hand more work than learning tools for that, you will soon have all kinds of other problems.


The piece is written by a consulting company who by design are probably pushing out a lot of grammars but to different systems and companies.


Reentrancy is a problem for yacc. Bison can generate reentrant parsers

https://www.gnu.org/software/bison/manual/bison.html#Pure-De...


Seems odd that the author uses grandiose titles like "The Story of Lex" and "The Story of Yacc" for a few paragraphs of unsubstantiated history. This type of exaggeration makes me skeptical about the accuracy of their claims.


Every person who has compiled a Linux kernel by first typing "make menuconfig" has used a program created with flex and bison.


Antler doesn’t support C or C++. But they’ll use an obscure C##.


Obscure? C#?

Lots of people have opinions on the language (or Microsoft themselves) and it may not be up there with C++, Java, and Javascript in terms of adoption.

However it's 20+ years old, massively used in enterprise, powers sites like StackOverflow, and is most definitely (by language rankings and surveys) a top tier ecosystem - often in the top 5.


many embedded systems could not use C++ namely because of lack of real-time capability caused by too-long pauses made by C++ garbage memory collection overhead.


(I presume you mean C#)

You're right about embedded systems. Most ecosystems that use a GC would struggle there, including DotNet.


Okay, so yes it is good to note that the GNU lexical generator toolkit does not generate unicode parsers and isn't re-entrant. In general, parsing the same text from multiple threads seems like it could probably lead to more problems than just the fact that the parser generator toolkit for compiler writers isn't re-entrant, but that sounds more like a problem for the 'design' of these folk's client's codebases than anything else to me. It would be nice to see a simple LL(1) lexer/parser generator in C that handles unicode and re-entrancy, although again, the latter one seems like something of a complex concept to me. Having the library generate parsers which are stable across multiple threads using generated parsers to parse different things at once seems like the only stable notion that I'd care about, so I will assume that is what this fellow is talking about.


Consider any server side query system. Parsing those queries in only one thread can be prohibitive.


Too many pop-ups; I left the site.


Here's Outline's version: https://www.outline.com/gEuVje




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: