Not goodbye, I’ll be seeing you guys around

How important is a great environment, friends-wise and work-wise, to one’s general happiness? Very important, actually. For the past year, and especially the last six months, I’ve been fortunate enough to be in such an environment. It’s been intellectually stimulating, it’s been fun, it’s how it should be, really. But somehow all good things come to an end.

So, to Zeff, Fadzli (merci beaucoup for the tribute), Ady, Haidir, Lan, Mahzan, Ras, Jet, Hazman: thanks to the amazing instrument known as the blog, I’ll be seeing you guys around. To Janice, Shoba, Adeline: God bless you girls. To Ming Hai and Kenson: all the best with Redux! To Tang WK: you’re the best manager + tech lead, ever; thanks for everything. To J: we hope you stay in Malaysia forever! To all the rest of you in MTM: sweet dreams, and, er, don’t forget your evacuation routes during the next fire drill.

29 August 2008 | Uncategorized | 8 Comments

To Kill A Mockingbird by Harper Lee

To Kill A Mockingbird is one of the great classics of American literature. It’s about justice and standing up for what’s right in the face of overwhelming opposition, told from the point of view of a little girl. As you might have guessed, it’s set in the early 1940’s when segregation was still the norm.

So what does it mean to kill a mockingbird?

When he gave us our air-rifles Atticus wouldn’t teach us to shoot. Uncle Jack instructed us in the rudiments thereof; he said Atticus wasn’t interested in guns. Atticus said to Jem one day, “I’d rather you shot at tin cans in the back yard, but I know you’ll go after birds. Shoot all the bluejays you want, if you can hit ‘em, but remember it’s a sin to kill a mockingbird.”

That was the only time I ever heard Atticus say it was a sin to do something, and I asked Miss Maudie about it.

“Your father’s right,” she said. “Mockingbirds don’t do one thing but make music for us to enjoy. They don’t eat up people’s gardens, don’t nest in corncribs, they don’t do one thing but sing their hearts out for us. That’s why it’s a sin to kill a mockingbird.”

My favourite passage:

“Well, most folks seem to think they’re right and you’re wrong …”

“They’re certainly entitled to think that, and they’re entitled to full respect for their opinions,” said Atticus, “but before I can live with other folks I’ve got to live with myself. The one thing that doesn’t abide by majority rule is a person’s conscience.”

28 August 2008 | Book review | 1 Comment

Bug in SgmlReader

Chris Lovett of Microsoft wrote SgmlReader 1.7 and has kindly shared it with the world. What does it do? In his own words:

An XmlReader implementation for loading SGML (including HTML) converting it to well formed XML, by adding missing quotes, empty attribute values, ignoring duplicate attributes, case folding on tag names, adding missing closing tags based on SGML DTD information, and so on.

It derives from XmlReader so you could pass an SgmlReader object to any class that consumes XmlReader, e.g., XPathDocument.

For my current project at the office, the SgmlReader is used to parse HTML, and an XSL template is then applied to the generated XML. Everything was ok during smoke testing but we got thrown an intermittent System.NullReferenceException ("Object reference not set to an instance of an object") exception in QA. After some hours I found out that it's due to boolean HTML attributes in minimized form, e.g. noshade in the following:

HTML:
  1. <HR width="100%" color=#000000 noShade SIZE=1>

Original code in SgmlReader.cs:

C#:
  1. public override string Value {
  2.     get {
  3.         if (this.state == State.Attr || this.state == State.AttrValue) {
  4.             return this.a.Value;
  5.         }
  6.         return this.node.Value;
  7.     }
  8. }

My solution (explanation in comments):

C#:
  1. public override string Value {
  2.     get {
  3.         if (this.state == State.Attr || this.state == State.AttrValue) {
  4.             // if this.a.Value is null, a NullReferenceException will be thrown
  5.             // this.a.Value will be null if HTML boolean attributes appear in minimized form
  6.             // full list of HTML boolean attributes: compact, nowrap, ismap, declare, noshade, checked, disabled, readonly, multiple, selected, noresize, defer
  7.             // so now we set the value to be the name if the value is null
  8.             // in other words, we transform the minimized form to the non-minimized form.
  9.             return (this.a.Value != null ? this.a.Value : this.a.Name);
  10.         }
  11.         return this.node.Value;
  12.     }
  13. }

26 August 2008 | .NET, C#, ASP.NET | No Comments

Movica: WMV Editor

I wanted to help my friend upload a video to YouTube, but his video was too long (more than YouTube's 10-minute limit). So I was searching for something that could extract sections from a WMV file and found Movica. I don't know whether it's the best but it worked for me. What I like about it is the small size of its installer, and that it's open source. Interestingly it was developed using VB.NET.

21 August 2008 | Applications | No Comments

C# Hidden Danger #2: GetHashCode()

Suppose you have an address book saved as a CSV text file or an XML file. You would like to create a program that loads the address book into memory, and return the contact details of a person, given the name.

There are a number of data structures from which to choose: array, ArrayList, List<T>, Hashtable, etc. If there are only a few hundred entries in your address book, then you could choose any data structure that suits your fancy. But if the number of entries range in the hundreds of thousands, then you need to choose the best data structure and algorithm for fast retrieval. You certainly don't want to create a foreach loop to search through your list.

(Side note: List<T> has the Find(Predicate<T>) method, so you don't need to create an explicit foreach loop, but behind the scenes, your predicate is being called multiple times. To verify this, you can put a breakpoint on your predicate method in Visual Studio. My friend DiRN calls this "looping without looping").

Which is why hash tables were invented. Hash tables take advantage of the fact that array lookup is very fast, if you supply the index: by using the square bracket notation, the program would be able to go straight to the given memory location.

Going back to the address book example, if we have a function to transform the key (which is the name of the person) into an integer, then we would be able to use an array as our data structure. To store and retrieve the contact details of John Smith, we would simply apply this function to the string "John Smith", and use the integer returned by our function as the index of the array. In computer science, this type of function is called a hash function.

The problem with hash functions is hash collision, where the function returns the same integer for two or more different inputs. What do we do if our function above returns the same index for both John Smith and Ramasamy Chandran? There are two solutions to this: chaining and open addressing. By chaining, each element in the array is a linked list, so that a single array location can contain more than one piece of data. Open addressing involves using alternate locations in the array. All in all, it's quite messy.

Fortunately the Hashtable class does all the work for us. Straight out of the box, Hashtable is very simple to use:

C#:
  1. using System.Collections;
  2.  
  3. // insert items into the Hashtable:
  4. Hashtable ht = new Hashtable;
  5. ht.Add(key1, obj1);
  6. ht.Add(key2, obj2);
  7. // etc.
  8.  
  9. // retrieval:
  10. object obj42 = ht[key42];

Here's the catch: for best performance, you need to override the default implementation of GetHashCode(), i.e., the hash function. The default implementation of GetHashCode() for reference types is correct but not necessarily efficient; for value types, it is not necessarily efficient and might even be incorrect. (Even then, overriding Object.operator==() can break Object.GetHashCode(). ValueType.GetHashCode() is correct only for structs whose first field is read-only.)

So, what's the best hash function? Well, if one were to exist, surely it would already be included in the default implementation of GetHashCode(). Constructing a correct and efficient hash function requires extensive knowledge of the type. The root object does not have this advantage, which is why the default implementation is naive and simplistic.

In conclusion, even without overriding the GetHashCode() method of your key object, the Hashtable gives you the fastest retrieval [1], as long as the following conditions are met:

1. If your key is a reference type, ensure that you did not override the == operator.
2. If your key is a value type, ensure that the first field in your struct is read-only.

Otherwise you need to override the default implementation of GetHashCode().

Notes

[1] "The Hashtable and any related blog entries is provided 'as is' without warranty of any kind, either express or implied, including, without limitation, the implied warranties or merchantability, or fitness for a particular purpose. The entire risk arising out of use or performance of the Hashtable remains with you."

15 August 2008 | .NET, Software engineering, C# | 1 Comment