Bug in SgmlReader

Chris Lovett of Microsoft wrote SgmlReader 1.7 and has kindly shared it with the world. What does it do? In his own words:

An XmlReader implementation for loading SGML (including HTML) converting it to well formed XML, by adding missing quotes, empty attribute values, ignoring duplicate attributes, case folding on tag names, adding missing closing tags based on SGML DTD information, and so on.

It derives from XmlReader so you could pass an SgmlReader object to any class that consumes XmlReader, e.g., XPathDocument.

For my current project at the office, the SgmlReader is used to parse HTML, and an XSL template is then applied to the generated XML. Everything was ok during smoke testing but we got thrown an intermittent System.NullReferenceException ("Object reference not set to an instance of an object") exception in QA. After some hours I found out that it's due to boolean HTML attributes in minimized form, e.g. noshade in the following:

HTML:
  1. <HR width="100%" color=#000000 noShade SIZE=1>

Original code in SgmlReader.cs:

C#:
  1. public override string Value {
  2.     get {
  3.         if (this.state == State.Attr || this.state == State.AttrValue) {
  4.             return this.a.Value;
  5.         }
  6.         return this.node.Value;
  7.     }
  8. }

My solution (explanation in comments):

C#:
  1. public override string Value {
  2.     get {
  3.         if (this.state == State.Attr || this.state == State.AttrValue) {
  4.             // if this.a.Value is null, a NullReferenceException will be thrown
  5.             // this.a.Value will be null if HTML boolean attributes appear in minimized form
  6.             // full list of HTML boolean attributes: compact, nowrap, ismap, declare, noshade, checked, disabled, readonly, multiple, selected, noresize, defer
  7.             // so now we set the value to be the name if the value is null
  8.             // in other words, we transform the minimized form to the non-minimized form.
  9.             return (this.a.Value != null ? this.a.Value : this.a.Name);
  10.         }
  11.         return this.node.Value;
  12.     }
  13. }

26 August 2008 | .NET, C#, ASP.NET | No Comments

Movica: WMV Editor

I wanted to help my friend upload a video to YouTube, but his video was too long (more than YouTube's 10-minute limit). So I was searching for something that could extract sections from a WMV file and found Movica. I don't know whether it's the best but it worked for me. What I like about it is the small size of its installer, and that it's open source. Interestingly it was developed using VB.NET.

21 August 2008 | Applications | No Comments

C# Hidden Danger #2: GetHashCode()

Suppose you have an address book saved as a CSV text file or an XML file. You would like to create a program that loads the address book into memory, and return the contact details of a person, given the name.

There are a number of data structures from which to choose: array, ArrayList, List<T>, Hashtable, etc. If there are only a few hundred entries in your address book, then you could choose any data structure that suits your fancy. But if the number of entries range in the hundreds of thousands, then you need to choose the best data structure and algorithm for fast retrieval. You certainly don't want to create a foreach loop to search through your list.

(Side note: List<T> has the Find(Predicate<T>) method, so you don't need to create an explicit foreach loop, but behind the scenes, your predicate is being called multiple times. To verify this, you can put a breakpoint on your predicate method in Visual Studio. My friend DiRN calls this "looping without looping").

Which is why hash tables were invented. Hash tables take advantage of the fact that array lookup is very fast, if you supply the index: by using the square bracket notation, the program would be able to go straight to the given memory location.

Going back to the address book example, if we have a function to transform the key (which is the name of the person) into an integer, then we would be able to use an array as our data structure. To store and retrieve the contact details of John Smith, we would simply apply this function to the string "John Smith", and use the integer returned by our function as the index of the array. In computer science, this type of function is called a hash function.

The problem with hash functions is hash collision, where the function returns the same integer for two or more different inputs. What do we do if our function above returns the same index for both John Smith and Ramasamy Chandran? There are two solutions to this: chaining and open addressing. By chaining, each element in the array is a linked list, so that a single array location can contain more than one piece of data. Open addressing involves using alternate locations in the array. All in all, it's quite messy.

Fortunately the Hashtable class does all the work for us. Straight out of the box, Hashtable is very simple to use:

C#:
  1. using System.Collections;
  2.  
  3. // insert items into the Hashtable:
  4. Hashtable ht = new Hashtable;
  5. ht.Add(key1, obj1);
  6. ht.Add(key2, obj2);
  7. // etc.
  8.  
  9. // retrieval:
  10. object obj42 = ht[key42];

Here's the catch: for best performance, you need to override the default implementation of GetHashCode(), i.e., the hash function. The default implementation of GetHashCode() for reference types is correct but not necessarily efficient; for value types, it is not necessarily efficient and might even be incorrect. (Even then, overriding Object.operator==() can break Object.GetHashCode(). ValueType.GetHashCode() is correct only for structs whose first field is read-only.)

So, what's the best hash function? Well, if one were to exist, surely it would already be included in the default implementation of GetHashCode(). Constructing a correct and efficient hash function requires extensive knowledge of the type. The root object does not have this advantage, which is why the default implementation is naive and simplistic.

In conclusion, even without overriding the GetHashCode() method of your key object, the Hashtable gives you the fastest retrieval [1], as long as the following conditions are met:

1. If your key is a reference type, ensure that you did not override the == operator.
2. If your key is a value type, ensure that the first field in your struct is read-only.

Otherwise you need to override the default implementation of GetHashCode().

Notes

[1] "The Hashtable and any related blog entries is provided 'as is' without warranty of any kind, either express or implied, including, without limitation, the implied warranties or merchantability, or fitness for a particular purpose. The entire risk arising out of use or performance of the Hashtable remains with you."

15 August 2008 | .NET, Software engineering, C# | 1 Comment

C# Hidden Danger #1: The const Keyword

Declaring a constant with the const keyword makes it a compile-time constant. In the generated IL, all references to a compile-time constant will be replaced by its actual value. Now suppose that a compile-time constant is declared in one assembly and referenced in other assemblies. If the definition of the constant ever needs to be changed, all affected assemblies need to be recompiled. Rebuilding and redistributing just the first assembly will not suffice.

Therefore, use the readonly keyword where possible, unless you are absolutely sure that the value will never change. The slight increase in performance afforded by the const keyword is not worth the inflexibility.

11 August 2008 | .NET, Software engineering, C# | 4 Comments

Just Produce

I've been seeing this advice over and over again: you want to come up with something good? Just produce.

Quantity Always Trumps Quality - [an experiment with two groups in a ceramics class] "It seems that while the 'quantity' group was busily churning out piles of work - and learning from their mistakes - the 'quality' group had sat theorizing about perfection, and in the end had little more to show for their efforts than grandiose theories and a pile of dead clay."

Wearing Out My Delete Key - "Trying to perfect an implementation in one's mind is a form of speculation ... [t]he elegant implementation you see when you read a great programmer's code is often the third or fourth try. The great programmer is often more effective because they can implement several solutions in the same amount of time it takes the average programmer to implement one."

Why I'm Not Working on My Startup (Yet) - comment - "It’s a numbers game. The more times you step up to the plate, the more home runs you will hit."

A corollary of just producing is that you absolutely cannot be a perfectionist, nor can you be too proud. Inevitably some of what you produce will simply suck. But just produce, and your occasional hits will more than make up for your lemons.

5 August 2008 | Uncategorized | 1 Comment