In 2007, I created an event registration web application for my student association in PHP. It included a custom markup language to create HTML forms. Since that component turned out to be highly reusable, I’m going to publish an improved version thereof.

Problem

The existing event registration script I replaced required the administrator to create a new MySQL table for each event, modify a PHP template and write some HTML. I was the person responsible for IT and since only a few people were able to figure out how the system worked, all requests to create a new registration form ended up with me. Being an ambitious programmer, I decided eliminate that mundane process once and for all, by writing a replacement which would of course take me more time than it can ever save me.

Flexibility

The ground work was straight forward: define a reusable database schema, implement email verification, create a way to change and delete registrations etc. But there is a reason the old process was so complicated. For many events, additional information like food preferences or existing public transportation tickets have to be collected. Of course those additional attributes don’t all require their own table column (like done in the old solution) but can be serialized and stored as one single string. However, how can an administrator define those attributes without writing any HTML or knowing the inner workings of the registration system?

The Form Markup Language

I decided to create my own markup language which can be used to generate HTML forms.

type name(arguments) {
	option name(arguments);
	...
};

The syntax looks similar to a C function definition. Let’s call it a form input definition, or short FID. type defines what kind of html input to use. name will represent the name of the attribute inside the database. The arguments are optional and comma separated. The curly brackets are optional also. Note that the content inside the brackets are themselves FIDs (with the type “option”). Let’s define a simple form.

text name("Name", "Bob");

checkbox skills("Coding Skills") {
	option c("C");
	option php("PHP", true);
	option java("Java", true);
};

radio department("Department") {
	option other("other", true);
	option itet("D-ITET");
	option infk("D-INFK");
};

textarea comments("Comments");

The example defines attributes of type text, checkbox, radio and textarea. The first argument defines the name as shown on the rendered form. The second argument is usually an optional default value.

Note how the corresponding input is selected when clicking on the labels in the rendered form. This is achieved by correctly defining the form labels which is often neglected even on big websites. Using the form markup language, you can get this functionality for free.

Usage

The SJMForm class not only compiles the form markup language to HTML, it also provides some help with handling form submissions and putting data into the forms.

// allocate new form generator with form markup language
$form = new SJMForm($fml);
// get sanitized entered form data
$input = $form->preprocess($_POST);
// fill form with data from database
$form->set($data);
// compile form
echo $form;

After creating a SJMForm object, the preprocess member function can be used to get all the values with corresponding form keys from a POST or GET request. Checkboxes get special treatment, since unchecked boxes are ignored in the request. If the example form gets submitted without any additional user input, $input would look as follows:

{
	"name": "Bob",
	"skills_c": false,
	"skills_php": true,
	"skills_java": true,
	"department": "other",
	"comments": ""
}

The preprocessed data can then be saved in a database or put back into the form with the set method, e.g. to allow a user to edit a registration. The __toString member function handles the HTML conversion automatically when the object is converted to a string, which happens by calling echo.

In addition to the input types in the example, select produces a drop down menu and hidden an invisible input.

Implementation

The main design goal of SJMForm was for it to work and not take too much time from the main project. Hence it produces unpredictable output for invalid markup and doesn’t provide any helpful error messages. While this can be fine when used in a single small project, it should be addressed as soon as more developer start using it. Otherwise people start to mistake bugs for features and future versions have to conform to an unreasonable amount of special cases.

Instead of a proper parser, I use regular expressions. Unfortunately, they can’t be used to parse a non-regular language. But using a couple of dirty tricks, I still got what I wanted. I’m using the expression (\w+)\s+(\w+)\s*$(.*?)$\s*(\{(.*?)\}\s*)?; to match type name (random_string) {random_string}; where random_string is non-greedy (as short as possible) and the last part with the curly brackets is optional. The part inside the brackets is then again searched with the same expression to find the options. Random content between two expressions is simply ignored.

My first implementation did pretty much just that. Unfortunately, it had problems with (argument) strings containing ", ) or }. I solved that problem by extracting all string literals in advance and replacing them with simple keywords (think str4). To find all strings, I use the expression "(.*?)(?<!\\)" which uses a “negative lookbehind” to allow using " within a string when escaped with \". (I skipped the feature to allow using the actual sequence \" in a string by escaping the escape sign.) The value of those keywords can then be substituted back when needed. That way I regain control over all special characters in my markup.

The definitions inside the curly brackets are currently not allowed to have their own curly brackets. This could probably be achieved by using recursive patterns. But since our options don’t need them, why even bother?

{
	"type": "text",
	"name": "name",
	"args": ["Name", "Bob"],
	"childNodes": []
}

The parser produces a syntax tree in form of an array of dictionaries. The optional key childNodes contains the same structures for the option definitions. The parser is relatively short (around 50 lines) and could potentially be used for similar projects.

At this point I want to note that I took a compiler design class and know that this is not how parsing should be done.

To compile the form to HTML, the syntax tree is traversed and the HTML producer for the corresponding type gets called for each node. The set function simply writes the provided values into the syntax tree at the right place.

The source code is available on GitHub. It contains some more sample code to demonstrate all different input types. There is also an interactive demo to play around with. Let me know if and how you use my class so I know whether I should put some more work into the code and documentation.