Monday, February 28, 2011

lex_ing and regex_ing

During engineering, this topic came and went.. w/o bothering me much.. just a small assignment on parser and done! But.. no.. the practical exam threw on my face a problem on designing parser in C (not with lex/yacc or generated one) and then I realized - well.. I had not understood it!

This was like 6+ years back.. and now I am here.. writing regex and grammar to generate parser with lex-yacc! That's life! :D ... And... it's very good!

Just came across a nice tutorial on lex yacc. Find it here. Even if you keep technical details apart and just glance through the tutorial, you would get a new outlook towards lex-yacc - specially regex - writing them and debugging them when they don't behave the way you expect them to..!

A very good and important thing I realized during this exercise is as follows. Parser design can easily be segregated in lex and yacc when you know your application well. Taking some reference from the tutorial link posted, for example - a simplified regex could serve your purpose given that you want to capture occurrences of a pattern; and you need a bit complex regex when you want to validate inputs. (Again from the tutorial,) For example, if you want to list down html tags in the browsed web crawl, all you need is a simple pattern that captures anything between a "<" and a ">". You need not __validate__ that __anything__. The basic assumption here - that you are expecting to get positive (meaning correctly matching with pattern) cases more in such scenario inherently comes from the application - i.e. capturing tags from web crawl. In an application of say - writing a SQL kind of language, you expect to get negative cases as probably as positive ones and just capturing __anything__ would not work. Again, I would say that it depends on how you look at it. You could always write a light weight regex and a comparatively heavy code in back end to validate your verbose input. It's up to you! But, understanding internals of how a regex engine works (given in the tutorial, for many cases), and mainly understanding its complexity and contrasting it with the complexity you would introduce by writing validation in code - is definitely important.

Indeed, a nice bunch of information!

No comments: