Antranig Basman antranig.basman at colorado.edu
Wed Jun 1 01:02:09 UTC 2011

Hi there - Justin has asked me to review the pull branch at
https://github.com/fluid-project/infusion/pull/22 - in particular the regexp which on line 67 of 
InlineEditIntegrations.js has been changed from /\<(\S+)[^\>\s]*\>/g to /\<([a-z0-9A-Z\/]+)\>/g to head off 
a JSLint error.

 From what I can see, the error JSLint is reporting is

Problem at line 67 character 38: Insecure '^'.

This seems like a pretty subtle issue... hunting around, I find this fairly clear explanation at 
stackOverflow of what Crockford's intent might have been when issuing this error.


I can think of a few ways of approaching this issue...

Firstly, I think an important angle is that this regular expression is being used for canonicalization, 
rather than for validation. The purpose of the "normalizeHTML" function is to fold different expressions of 
essentially the same markup into a canonical string form, so that it is possible to tell whether, by 
operating the rich text control, the user really made a substantial change to the markup or not, by simply 
comparing the mapped string for equality. Or more accurately, the user experience we want is NOT for the 
user to open the rich text control, issue a "save" without making any changes whatsoever, but for the 
control to have considered there has been a change.

So - it is not necessary for the canonicalizer to be particularly "aggressive", or accurate - but we need to 
make sure that there are no false positives of the kind just described.

Carrying on, the particular regexp that was put there was not terribly "well-attested" - this falls into 
that grey area of functionality that is very hard to address with unit tests since it involves actually 
taking the particular rich text control in question through its lifecycle and seeing what happens. That 
said, it is not certainly impossible to construct a test for this function as it is for focus/blur issues, 
just probably rather difficult. The "acid test" for this regexp consists of taking our "demo" page for the 
rich text control through the cycle of triggering edit, saving immediately, and seeing if the "undo" widget 
appears or not.

Some further things we could say about the regexp - if its domain were intended to be all of XML, it would 
be no good - http://www.w3.org/TR/2004/REC-xml11-20040204/ - there are numerous characters in "NameChar" and 
"NameStartChar" that aren't covered. However, given it is only intended to operate on the standard dialect 
of HTML tags, it is probably good enough - http://www.w3.org/TR/html401/sgml/dtd.html .

Another thing to say about the regexp.... it looks like it is trying to "do that thing which should never be 
done", which is, "to parse HTML using a regexp" -

All we can say here is to fall back on the role of the regexp precisely - just as it is not intended for 
"validation", it is also not intended for "parsing" - all we need is "good enough canonicalisation for this 
purpose" which I think that Cindy's regexp achieves - so long as it passes the manual test. This concept of 
"good enough" does depend on the exact rich text controls that we are supporting - the task may have got 
easier since we ditched FCKEditor.

The final thing we could say about the regexp - this dates from an era where our community standards for 
these kinds of things were lower than they are now, otherwise the author (in this instance, me) would have 
put a comment linking to a JIRA explaining at least some of these issues which are listed above :)

The bottom line - since the regexp is used for canonicalisation rather than validation, I don't think 
Crockford's warning needs to be taken seriously - however, we may as well get rid of it by rewriting the 
expression if we can, as long as we can verify that the expression is still good for its original purpose by 
running manual tests on our integrations demo for the rich text widgets we are intending to support.

More information about the fluid-work mailing list