gnegg programming with passion

2Jul/101

stabilizing tempalias

While the maintenance last weekend brought quite a bit of stabilization to the tempalias service, I quickly noticed that it was still dying sooner or later and while before updating node, it died due to not being able to allocate more memory, this time, it died by just not answering any requests any more.

A look at the error log quickly revealed quite many exceptions complaining about a certain request type not being allowed to have a body and finally one complaining about not being able to open a file due to having run out of file handles.

So I quickly improved error logging and restarted the daemon in order to get a stacktrace leading to these tons of exceptions.

Picture by L.G.Mills

This quickly pointed to paperboy which was sending the file even if the request was a HEAD request. http.js in node checks for this and throws whenever you send a body when you should not. That exception lead then to paperboy never closing the file (have I already complained how incredibly difficult it is to do proper exception handling the moment continuations get involved? I think not and I also think it's a good topic for another diary entry). With the help of lsof I've quickly seen that my suspicions were true: the node process serving tempalias had tons of open handles to public/index.html.

I sent a patch for this behavior to @felixge which was quickly applied, so that's fixed now. I hope it's of some use for other people too.

Now knowing that having a look at lsof here and then might be a good idea, quickly revealed another problem: While the file handles were gone, I've noticed tons and tons of SMTP sockets staying open in CLOSE_WAIT state. Not good as that too will lead to handle starvation sooner or later.

On a hunch, I found out that connecting to the SMTP daemon and then disconnecting, not sending QUIT to let the server disconnect was what was causing the lingering sockets. Clients disconnecting like that is very common in case the sender sends a 5xx response which is what the tempalias daemon was designed for.

So I had to fix that in my fork of the node smtp daemon (the original upstream isn't interested in daemon functionality and the owner I forked the daemon for doesn't respond to my pull requests. Hence I'm maintaining my own fork for now).

Futher looks at lsof prove that now we are quite stable in resource consumption: No lingering connections, no unclosed file handles.

But the error log was still filling up. This time something about removeListener needing a function. Thanks to the callstack I now had in my error log, I quickly hunted that one down and fixed it - that was a very stupid mistake. Thankfully, because the mails I usually deliver are small enough so that socket draining usually wasn't required.

Onwards to the next issue filling the error log: «This deferred has already been resolved».

This comes from the Promise.js library if you emit*() multiple times on the same promise. This time, of course, the callstack was useless (... at <anonymous> - why, thank you), but I was very lucky again in that I tested from home and my mail relay didn't trust my home IP address and thus denied relaying with a 500 which immediately led to the exception.

Now, this one is crazy: When you call .addErrback() on a Promise before calling addCallback(), your callback will be executed no matter if the errback was executed first.

Promise.js does some really interesting things to simulate polymorphism in JavaScript and I really didn't want to fix up that library as lately, node.js itself seems go to a simpler continuation style using a callback parameter, so sooner or later, I'll have to patch up the smtp server library anyways to remove Promise.js if I want to adhere to current node style.

So I took the workaround route by just using addCallback() before addErrback() even though the other order feels more natural to me. In addition, I reported an issue with the author as this is clearly unexpected behavior.

Now the error log is pretty much silent (minus some ECONNRESET exceptions due to clients sending RST packets in mid-transfer, but I think they are uncritical to resource consumption), so I hope the overall stability of the site has improved a bunch - I'd love not having to restart the daemon for more than a day :-)

19Apr/100

tempalias.com – rewrites

This is yet another installment in my series of posts about building a web service in node.js. The previous post is here.

Between the last post and current trunk of tempalias, there lie two substantial rewrites of core components of the service. One thing is that I completely misused Object.create() which takes an object to be the prototype of the object you are creating. I was of the wrong opinion that it works like Crockford's object.create() which is creating a clone of the object you are passing.

Also, I learned that only Function objects actually have a prototype.

Not knowing these two things made it impossible to actually deserialize the JSON representation of an alias that was previously stored in redis. This lead to the first rewrite - this time of lib/tempalias.js. Now aliases work more like standard JS objects and require to be instantiated using the new operator, on the plus side though, they work as expected now.

Speaking of serialization. I learned that in V8 (and Safari)

?View Code JAVASCRIPT
isNan(Date.parse( (new Date()).toJSON() )) === true

which, according to the ES5 spec is a bug. The spec states that Date.parse() should be able to parse a string created by Date.doISOStirng() which is what is used by toJSON.

This ended up with me doing an ugly hack (string replacement) and reporting a bug in Chrome (where the bug happens too).

Anyhow. Friday and Saturday I took off the project, but today I was on it again. This time, I was looking into serving static content. This is how we are going to serve the web site after all.

Express does provide a Static plugin, but it's fairly limited in that it doesn't do any client side caching which, even though Node.js is crazy fast, seems imperative to me. Also while allowing you to configure the file system path it should serve static content from, it insists on the static content's URL being /public/whatever, where I would much rather have kept the URL-Space together.

I tried to add If-Modified-Since-support to express' static plugin, but I hit some strange interraction in how express handles the HTTP request that caused some connections to never close - not what I want.

After two hours of investigating, I was looking at a different solution, which leads us to rewrite two:

tempalias trunk now doesn't depend on express any more. Instead, it serves the web service part of the URL space manually and for all the static requests, it uses node-paperboy. paperboy doesn't try to convert node into Rails and it provides nothing but a simple static file handler for your web server which also works completely inside node's standard method for handling web requests.

I prefer this solution by much because express was doing too much in some cases and too little in others: Express tries to somewhat imitate rails or any other web framework in that it not only provides request routing but also template rendering (in HAML and friends). It also abstracts away node's HTTP server module and it does so badly as eveidenced by this strange connection not-quite-ending problem.

On the other hand, it doesn't provide any help if you want to write something that doesn't return text/html.

Personally, if I'm doing a RESTful service anyways, I see no point in doing any server-side HTML generation. I'd much rather write a service that exposes an API at some URL endpoints and then also a static page that uses JavaScript / AJAX to consume said API. This is where express provides next to no help at all.

So if the question is whether to have a huge dependency which fails at some key points and doesn't provide any help with other key points or to have a smaller dependency that handles the stuff I'm not interested in, but otherwise doesn't interfer, I'd much prefer that solution to the first one.

This is why I went with this second rewrite.

Because I was already using a clean MVC separation (the "view" being the JSON I emit in the API - there's no view in the traditional sense yet), the rewrite was quite hassle-free and basically nothing but syntax work.

After completing that, I felt like removing the known issues from my blog post where I was writing about persistence: Alias generation is now race-free and alias length is stored in redis too. The architecture can still be improved in that I'm currently doing two requests to Redis per ALIAS I'm creating (SETNX and SET). By moving stuff around a little bit, I can get away with just the SETNX.

On the other hand, let me show you this picture here:

Screenshot of ab running in a terminalConsidering that the current solution is already creating 1546 aliases per second at a concurrency of 100 requests, I can probably get away without changing the alias creation code any more.

And in case you ask: The static content is served with 3000 requests per second - again with a concurrency of 100.

Node is fast.

Really.

Tomorrow: Philip learns CSS - I'm already dreading this final step to enlightenment: Creating the HTML/CSS front-end UI according to the awesome design provided by Richard.

16Apr/100

tempalias.com – the cake is a lie

This is another installment of my development diary for tempalias.com, a web service that will allow you to create self-destructing email aliases. You can read the last previous here.

This was a triumph.
I'm making a note here: HUGE SUCCESS.
It's hard to overstate my satisfaction.

I didn't post an update on wednesday evening because it got very late and I just wanted to sleep. Today, it's late yet again, but I can gladly report that the backend service is now feature complete.

We are still missing the UI, but with a bit of curl on the command line, you can use the restful web service interface to create aliases and you can use the generated aliases to send email via the now completed SMTP proxy - including time and usage based expiration.

As a reminder: All the code (i.e. the completed backend) is available on my github repository, though keep in mind that there is no documentation what so ever. That I will save for later when this is really going public. If you are brave, feel free to clone it.

You will need the trunk versions for both redis and node.

Screenshot of a terminal showing three consumptions of an alias and a fourth failng.

The screenshot is is showing me consuming an alias four times in a row. Three times, I get the data back, the fourth time, it's gone.

The website itself is still in the process of being designed and I can promise you, it will be awesome. Richard's last design was simply mind-blowing. Unfortunately I can't show it here yet, because he used a non-free picture. Besides, we obviously can't use non-free artwork for a Free Software project.

So this update concerns itself with two days of work. What was going on?

On wednesday, I wanted to complete the SMTP server, but before I went ahead doing so, I revised the servers design. At the end of the last posting here, we had a design where the SMTP proxy would connect to the smarthost the moment a client connects. It would then proceed to proxy through command by command, returning error messages as they are returned by the smarthost.

The issue with this design lies in the fact that tempalias.com is, by definition, not about sending mail, but about rejecting mail. This means that once it's up and running, the majority of mail deliveries will simply fail at the RCPT state.

From this perspective, it doesn't make sense to connect to the smarthost when a client connects. Instead, we should do the handshake up to and including the RCPT TO command, at which time we do the alias expansion. If that fails (which is the more likely case), we don't need to bother to connect to upstream but we can simply deny the recipient.

The consequence of course is that our RCPT TO can now return errors that happened during MAIL FROM on the upstream server. But as MAIL FROM usually only fails with a 5xx error, this isn't terribly wrong anyways - the saved resources far outweigh the not-so-perfect error messages.

Once I completed that design change, the next roadblock I went into was the fact that both the smtp server and the smtp client libraries weren't quite as asynchronous as I would have wanted: The server was reading the complete mail from the client into memory and the client wanted the complete mail as a parameter to its data method.

That felt unpractical to me as in the majority of cases, we won't get the whole mail at once, but we can certainly already begin to push it through to the smarthost, keeping memory usage of our smtp server as low as possible.

So my clone of the node SMTP library now contains support for asynchronous handling for DATA. The server fires data, data_available and data_end and the client provides startData(), sendData() and endData(). Of course the old functionality is still available, but the tempalias.com SMTP server is using the new interface.

So, that was Wednesday's work:

  • only connect to the smarthost when it's no longer inevitable
  • complete the smtp server node library
  • made the smtp server and client libraries fully asynchronous
  • complete the SMTP proxy (but without alias expansion yet)

Before I went to bed, the SMTP server was accepting mail and sending it using the smarthost. It didn't do alias expansion yet but just rewrote the recipient to my private email address.

This is where I picked up Thursday night: The plan was to hook the alias model classes into the SMTP server as to complete the functionality.

While doing that, I had one more architectural thing to clear: How to make sure that I can decrement the usage-counter race-free? Once that was settled, the rest was pure grunt work by just writing the needed code.

As we are getting long and as it's quite late again, I'm saving the post-mortem of this last task for tomorrow. You'll get a chance learn about bugs in node, about redis' DECR command and finally you will get a chance to laugh at me for totally screwing up the usage of Object.create().

Stay tuned.

14Apr/102

tempalias.com – config file, SMTP cleanup, beginnings of a server

Welcome to the next installment of a series of blog posts about the creation of a new web service in node.js. The posts serve as a diary of how the development of the service proceeds and should give you some insight in how working with node.js feels right now. You can read the previous episode here.

Yesterday, I unfortunately didn't have a lot of time to commit to the project, so I chose a really small task to complete: create a configuration file, a configuration file parser and use both to actually configure how the application should behave.

The task in general was made a lot easier by the fact that (current) node contains a really simple parser for INI style configuration files. For the simple type of configuration data I have to handle, the INI format felt perfect and as I got a free parser with node itself, that's what I went with. So as of monday it's possible to configure listening addresses and ports for both HTTP and SMTP daemons and additional settings for the SMTP part of the service.

Today I had more time.

The idea was to seriously look into the SMTP transmission. The general idea is that email sent to the tempalias.com domain will have to end up on the node server where the alias expansion is done and the email is prepared for final delivery.

While I strive to keep the service as self-contained as possible, I opted into forcing a smarthost to be present to do the actual mail delivery.

You see, mail delivery is a complicated task in general as you must deliver the mail and if you can't, you have to notify the sender. Reasons for a failing delivery can be permanent (that's easy - you just tell the sending server that there was a problem and you are done) or temporary. In case of many temporary errors you end up with the responsibility of needing to handle them.

Handling in case of temporary errors usually means: Keep the original email around in a queue and retry after the initial client has long disconnected. If you don't succeed for a reasonably large amount of delivery attempts or if a permanent problem creeps up, then you  have to bounce the message back to the initial sender.

If you want to do the final email delivery, so that your app runs without any other dependencies, then you will end up not only writing an SMTP server but also a queueing system, something that's way beyond the scope of simple alias resolution.

Even if I wanted to go through that hassle, it still wouldn't help much as aside of the purely technical hurdles, there are also others on a more meta level:

If you intend to do the final delivery nowadays, you practically need to have a valid PTR record, you need to be in good standing with the various RBL's, you need to handle SSL - the list goes on and on. Much of this is administrative in nature and might even create additional cost and is completely pointless considering the fact that you do usually have a dedicated smarthost around that takes your mail and does the final delivery. And even if you don't: Installing a local MTA for the queue handling is easily done and whatever you install, it'll be way more mature than what I could write in any reasonable amount of time.

So it's decided: The tempalias codebase will require a smarthost to be configured. As mine doesn't require authentication from a certain IP range, I can even get away without writing any SMTP authentication support.

Once that was clear, the next design decision was clear too: the tempalias smtp daemon should be a really, really thin layer around the smarthost. When a client connects to tempalias, we will connect to the smarthost (500ing (or maybe 400ing) out if we can't - remember: immediate and permanent errors are easy to handle). When a client sends MAIL FROM, just relay it to the smarthost, returning back to the client whatever we got - you get the idea: the tempalias mail daemon is an SMTP proxy.

This keeps the complexity low while still providing all the functionality we need (i.e. rewriting RCPT TO).

Once all of this was clear, I sat down and had a look at the node-smtp servers and clients and it was immediately clear that both need a lot of work to even to the simple thing I had in mind.

This means that most of todays work went into my fork of node-smtp:

  • made the hostname in the banner configurable
  • made the smtp client library work with node trunk
  • fire additional events (on close, on mail from, on rcpt to)
  • fixed various smaller bugs

Of course I have notified upstream of my changes - we'll see what they think about.

On the other hand, the SMTP server part of tempalias (incidentally the first SMTP server I'm writing. ever) also took shape somewhat. It now correctly handles proxying from initial connection up until DATA. It doesn't do real alias expansion yet, but that's just a matter of hooking it into the backend model class I already have - for now I'm happy with it rewriting all passed recipients to my own email address for testing.

I already had a look at how node-smtp's daemon handles the DATA command and I have to say that gobbling up data into memory until the client stops sending data or we run out of heap isn't quite what I need, so tomorrow I will have to change node-smtp even more in that it fires events for every bit of data that was received. That way a consumer of the API can do some validation on the various chunks  (mostly size validation) and I can pass the data directly to the smarthost as it arrives.

This keeps memory usage of the node server small.

So that's what I'm going to do tomorrow.

On a different note, I had some thought going into actual deployment, which probably will end up with me setting up a reverse proxy after all, but this is a topic for another discussion.

12Apr/100

tempalias.com – SMTP and design

After being sick the end of last week, only today I found time and willpower to continue working on this little project of mine.

For people just coming to the series with this article: This is a development diary about the creation of a web service for autodestructing email addresses. Read the previous installment here.

The funny thing about the projcet is that people all around me seem to like the general idea behind the service. I even got some approval from Ebi (who generally dislikes everything that's new) and this evening I was having dinner with a former coworker of mine whom I know for doing kick-ass web design.

He too liked the idea of the project and I could con him into creating the screen design of tempalias.com. This is a really good thing as whatever Richard touches comes out beautiful and usable.

For example, he told me that it makes way more sense to just expose a valid until date and in the form of "Valid for x days" instead of asking the user to provide a real date. This is not only much clearer and easier to use, it also fixes a brewing timezone problem I had with my previous design:

Valid for "3 days from now" is 3 days from now wherever on the world you are. But valid until 2010-04-16 is different depending on where you are.

This is a rare case of where adding usability also keeps the code simpler.

So, this is what Richard came up with so far:

Mockup of the tempalias website designIt's not finalized yet, but in the spirit of publishing here early and often, I'm posting this now. It's actually the third iteration already and Richard is still working on making it even nicer. But it's already 2124 times better than what I could ever come up with.

On the code-front, I was looking into the SMTP server, where I found @kennethkalmer's node-smtp project which provides a very rough implementation of an SMTP daemon.

Unfortunately, it doesn't run under node trunk (or even 0.1.30), but with the power of github, I was able to create my own fork at

http://github.com/pilif/node-smtp

My fork contains a bit of additional code compared to the source:

  • Runs under node trunk (where trunk is defined as "node as it was last tuesday")
  • Enforces proper SMTP protocol sequence (first: HELO, then MAIL FROM, then RCPT TO and finally DATA)
  • Supports multiple recipients (by handling multiple RCPT TO)
  • Does some email address validation (which is way too strict for being RFC compliant)

Tomorrow, I'm going to use this fork to build an SMTP server that we'll be using for alias processing, where I will have to put some thought into actual mail delivery: Do I deliver the mail myself? Am I offloading it to a mail relay (I really want to do this. But read more tomorrow)? If so, how is this done with the most memory efficiency?

We'll see.

7Apr/100

tempalias.com – persistence

(This is the third installment of a development diary about the creation of a self destructing email alias service. Read the previous episode here.)

After the earlier clear idea on how to handle the aliases identity, the next question I needed to tackle was the question of persistence: How do I want to store these aliases? Do I want them to persist a server restart? How would I access them?

On the positive side remains the fact that the data structure for this service is practically non-existant: Each alias has its identity and some data associated with it, mainly a target address and the validity information. And lookup will always happen using that identity (with the exception of garbage collection - something I will tackle later).

So this is a clear candiate to use a very simple key/value store. As I hope to gain at least some traction though (wait until I coded the bookmarklet), I would want this to be at least of some robustness, hence writing flat-files seemed like a bad idea.

Ironically, if you want a really simple, built-in solution for data persistance in node.js, you have two options: Either write your own (which is where I don't want to go to) or use SQLite which is total overkill for the current solution.

So I had the option of just keeping stuff in memory (as plain JS objects or using memcache)  or to use any of the supported key/value storage services.

Aliases going away on server restart felt like a bad thing, so I looked into the various key/value stores.

While looking at the available libraries, I went for the one that was most recently updated, which is redis-node-client. Of course, this meant that I had to use both redis trunk and node trunk as the library is really tracking the bleeding edge. I don't mind that much though because both redis and node are very self-contained and compile easily on both linux (deployment) and mac os (development) while requiring next to no configuration.

So with a decision made for both persistence and identity, I went ahead and wrote more code.

On the project page, you will see few commits completing the full functionality I wanted a POST to /aliases to have - including persistence using redis and identity using the previously described method of brute-forcing the issue.

I still have two issues at the moment that will need tackling

  1. The initial length of the pseudo-uuid isn't persisted. This means that once enough aliases are created that we are increasing the length and I'm restarting the server, I will get needless collisions or even a too heavily-used keyspace.
  2. The current method of checking for ID availability and later usage is totally non-race-proof and needs some serious looking-into.

Stuff I learned:

  • node is extremely work-in-progress. While it runs flawlessly and never surprises me with irreproducible or even just seemingly illogical behavior, features appear and disappear at will.
  • This state of flux in node makes it really hard to work with external dependencies. In this case, multipart.js vanished from node trunk (without change log entry either), but express still depends upon that. On the other hand, I'm forced to use node trunk otherwise redis client won't work.
  • Date("<timestamp>") in node is dependent on the local timezone and changing process.env.TZ post-startup doesn't have any effect. This means that I'm going to have to set TZ=UTC in my start script.
  • Working with an asynchronous API seems strange sometimes, but the power of closures usually comes to the rescue. I certainly wouldn't want to have to write software like this if I didn't have closures at my disposal (and, NO, global variables are NOT a viable alternative...)
6Apr/100

tempalias.com – another day

This is the second installment of an article series about creating a web service for self-destructing email aliases. Read part 1 here.

Today, I spent a lot of thought and experimentation with two issues:

  1. How would I name and identify the temporary aliases?
  2. How would I store the temporary aliases

Naming and identifying sounds easy. One is inclined to just use an incrementing integer or something alike. But that won't work for security reasons. If the address you got is 12@tempalias.net, with any likelyhood, there will be an 11@ and a 13@.

Using that information, you could easily bring the whole service down (and endlessly annoy its users) by requesting an address to get the current ID and then sending a lot of mail to the neighboring IDs. If those were created without a mail count limitation, then you could spam the recipient for the whole validity period and if they were created with a count limitation, you could use up all allowed mails.

So the aliases need to be random.

Which leads to the question of how to ensure uniqueness.

Unique random numbers you ask? Isn't this what UUIDs were invented for?

True. But considering the length of an UUID, would you really want to have an alias in the form e8ea98ce-dabc-42f8-8fcd-c50d20b1f2c5@tempalias.net? That address is so long that it might even hit some length limitation of the target site, which of course is true even if you apply cheap tricks like removing the dashes.

Of course, using base16 to encode an UUID (basically an 128 bit integer) is hopelessly inefficient. By increasing the amount of characters we use, we might be able to decrease the amount of characters.

Keep in mind though, that the string in question is to be a local part of an email address and those tend to be case insensitive with not much guarantees that case is preserved over the process of delivering the message.

That, of course, limits the amount of characters we can use to basically 0-9 and A-Z (plus a few special characters like + . - and _).

This is what Base32 was invented for, but unfortunately, a base32 encoded UUID would still be around 26 characters in length. While that's a bit better, I still wouldn't want the email address scheme to be eda3u3rzcfer3fztdvvd6xnd3i@tempalias.com

So in the end, we need something way smaller (adding + . - and _ to the character space wouldn't help much - what comes out is about 20 characters in length).

In the end, I would probably have to create a elaborate scheme doing something like this:

  • pick a UUID. Use the first n bytes.
  • base32 encode.
  • Check whether that ID is free. If not, add 1 to n and try again.
  • Keep n around so that in the future, we can already start with taking bigger chunks.

So the moment we reach the first collision, we increase the keyspace eight-fold. That feels sufficiently safe from collisions to me, but of course it increases the maintenance burden somewhat.

The next question was how to get UUIDs and how to base32 encode them from JavaScript.

I tried different aspects, one of which even included using uuidjs and doing the b32 encoding/decoding in C. The good part about that: I now have a general idea of how to extend nodejs with C++ code (yeah. it has to be C++ and my b32 code was C, so I had to do a bit of trickery there too).

In the end though, considering that I can't use UUIDs anyways, we can go forward using Math.uuid.js and use their call using both len and radix (with the additional change of only using lowercase to encode the data), increasing the length as we hit collisions.

So the next issue is storage: How to store the alias data? How to access it?

This will be part of the next posting here.

3Mar/100

No. It’s not «just» strings

On Hacker News, I came across this rant about strings in Ruby 1.9 where a developer was complaining about the new string handling in Ruby. Now, I'm no Ruby developer by even a long shot, but I am really interested in strings and string encoding which is why I posted the following comment which I reprint here as it's too big to just be a comment:

Rants about strings and character sets that contain words of the following spirit are usually neither correct nor worth of any further thought:

It's a +String+ for crying out loud! What other language requires you to understand this
level of complexity just to work with strings?!

Clearly the author lives in his ivory tower of English language environments where he is able to use the word "just" right next to "strings" and he probably also can say that he "switched to UTF-8" without actually really having done so because the parts of UTF-8 he uses work exactly the same as the ASCII he used before.

But the rest of the world works differently.

Data can appear in all kinds of encodings and can be required to be in different other kinds of encodings. Some of those can be converted into each other, others can't.

Some Japanese encodings (Ruby's creator is Japanese) can't be converted to a unicode representation for example.

Nowadays, as a programming language, you have three options of handling strings:

1) pretend they are bytes.

This is what older languages have done and what Ruby 1.8 does. This of course means that your application has to keep track of encodings. Basically for every string you keep in your application, you need to also keep track what it is encoded in. When concatenating a string of encoding a to another string you already have that is in encoding b, you must do the conversion manually.

Additionally, because strings are bytes and the programming language doesn't care about encoding, you basically can't use any of the built-in string handling routines because they assume each byte representing one character.

Of course, if you are one of these lucky english UTF-8 users, getting data in ASCII and english text in UTF-8, you can easily "switch" your application to UTF-8 by still pretending strings to be bytes because, well, they are. For all intents and purposes, your UTF-8 is just ASCII called UTF-8.

This is what the author of the linked post wanted.

2) use an internal unicode representation

This is what Python 3 does and what I feel to be a very elegant solution if it works for you: A String is just a collection of Unicode code points. Strings don't worry about encoding. String operations don't worry about it. Only I/O worries about encoding. So whenever you get data from the outside, you need to know what encoding it is in and then you decode it to convert it to a string. Conversely, whenever you want to actually output one of these strings, you need to know in what encoding you need the data and then encode that sequence of Unicode code points to any of these encodings.

You will never be able to convert a bunch of bytes into a string or vice versa without going through some explicit encoding/decoding.

This of course has some overhead associated with it, as you always have to do the encoding and because operations on that internal collection of unicode code points might be slower than the simple array-of-byte-based approach, especially if you are using some kind of variable-length encoding (which you probably are to save memory).

Interestingly, whenever you receive data in an encoding that cannot be represented with Unicode code points and whenever you need to send out data in that encoding, then, you are screwed.

This is a defficiency in the Unicode standard. Unicode was specifically made so that it can be used to represent every encoding, but it turns out that it can't correctly represent some Japanese encodings.

3) The third option is to store an encoding with each string and expose both the strings contents and the encoding to your users

This is what Ruby 1.9 does. It combines methods 1 and 2: It allows you to chose whatever internal encoding you need, it allows you to convert from one encoding to the other and it removes the need to externally keep book of every strings encoding because it does that for you. It also makes sure that you don't intermix encodings, but I'm getting ahead of myself.

You can still use the languages string library functions because they are aware of the encoding and usually do the right thing (minus, of course, bugs)

As this method is independent of the (broken?) Unicode standard, you would never get into the situation where just reading data in some encoding makes you unable to write the same data back in the same encoding as in this case, you would just create a string using this problematic encoding and do your stuff on that.

Nothing prevents the author of the linked post to use Ruby 1.9's facility to do exactly what Python 3 does (of course, again, ignoring the Unicode issue) by internally keeping all strings in, say, UTF-16 (you can't keep strings in "Unicode" - Unicode is no encoding - but that's for another post). You would transcode all incoming and outgoing data to and from that encoding. You would do all string operations on that application-internal representation.

A language throwing an exception when you concatenate a Latin 1-String to a UTF-8 string is a good thing! You see: Once that concatenation happened by accident, it's really hard to detect and fix.

At least it's fixable though because not every Latin1-String is also a UTF-8 string. But if it so happens that you concatenate, say Latin1 and Latin8 by accident, then you are really screwed and there's no way to find out where Latin1 ends and Latin8 begins as every valid Latin 1 string is also a valid Latin 8 string. Both are arrays of bytes with values between 0 and 255 (minus some holes).

In todays small world, you want that exception to be thrown.

In conclusion, what I find really amazing about this complicated problem of character encoding is the fact that nobody feels it's complicated because it usually just works - especially method 1 described above that has constantly been used in years past and also is very convenient to work with.

Also, it still works.

Until your application leaves your country and gets used in countries where people don't speak ASCII (or Latin1). Then all these interesting problems arise.

Until then, you are annoyed by every of the methods I described but method 1.

Then, you will understand what great service Python 3 has done for you and you'll switch to Python 3 which has very clear rules and seems to work for you.

And then you'll have to deal with the japanese encoding problem and you'll have to use binary bytes all over the place and have to stop using strings altogether because just reading input data destroys it.

And then you might finally see the light and begin to care for the seemingly complicated method 3.

22May/090

(Unicode-)String handling done right

Today, found myself reading the chapter about strings on diveintopython3.org.

Now, I'm no Python programmer by any means. Sure. I know my share of Python and I really like many of the concepts behind the language. I have even written some smaller scripts in Python, but it's not my day-to-day language.

That chapter about string handling really really impressed me though.

In my opinion, handling Unicode strings they way python 3 is doing is exactly how it should be done in every development environment: Keep strings and collections of bytes completely separate and provide explicit conversion functions to convert from one to the other.

And hide the actual implementation from the user of the language! A string is a collection of characters. I don't have to care how these characters are stored in memory and how they are accessed. When I need that information, I will have to convert that string to a collection of bytes, giving an explicit encoding how I want that to be done.

This is exactly how it should work, but implementation details leaking into the language are mushing this up in every other environment I know of making it a real pain to deal with multibyte character sets.

Features like this is what convinces me to look into new stuff. Maybe it IS time to do more python after all.

29Apr/090

JavaScript and Applet interaction

As I said earlier this month: While Java applets are dead for games and animations and whatever else they were used back in the nineties, they still have their use when you have to access the local machine from your web application in some way.

There are other possibilities of course, but they all are limited:

  • Flash loads quickly and is available in most browsers, but you can only access the  hardware Adobe has created an API for. That's upload of files the user has to manually select, webcams and microphones.
  • ActiveX doesn't work in browsers, but only in IE.
  • .NET dito.
  • Silverlight is neither commonly installed on your users machines, nor does it provide the native hardware access.

So if you need to, say, access a bar code scanner. Or access a specific file on the users computer - maybe stored in a place that is inconvenient for the user to get to (%Localappdata% for example is hidden in explorer). In this case, a signed Java applet is the only way to go.

You might tell me that a website has no business accessing that kind of data and generally, I would agree, but what if your requirements are to read data from a bar code scanner without altering the target machine at all and without requiring the user to perform any steps but to plug the scanner and click a button.

But Java applets have that certain 1996 look to them, so even if you access the data somehow, the applet still feels foreign to your cool Web 2.0 application: It doesn't quite fit the tight coupling between browser and server that AJAX gets us and even if you use Swing, the GUI will never look as good (and customized) as something you could do in HTML and CSS.

But did you know that Java Applets are fully scriptable?

Per default, any JavaScript function on a page can call any public method of any applet on the site. So let's say your applet implements

public String sayHello(String name){
    return "Hello "+name;
}

Then you can use JavaScript to call that method (using jQuery here):

?View Code JAVASCRIPT
$('#some-div').html(
    $('#id_of_the_applet').get(0).sayHello(
        $('#some-form-field').val())
);

If you do that, you have to remember though that any applet method called this way will run inside the sandbox regardless if the applet is signed or not.

So how do you access the hardware then?

Simple: Tell the JRE that you are sure (you are. aren't you?) that it's ok for a script to call a certain method. To do that, you use AccessController.doPrivileged(). So if for example, you want to check if some specific file is on the users machine. Let's further assume that you have a singleton RuntimeSettings that provides a method to check the existence of the file and then return its name, you could do something like this:

   public String getInterfaceDirectory(){
        return (String) AccessController.doPrivileged(
                new PrivilegedAction() {
                    public Object run() {
                        return RuntimeSettings.getInstance().getInterfaceDirectory();
                    }
                }
            );
    }

Now it's safe to call this method from JavaScript despite the fact that RuntimeSettings.getInterfaceDirectory() directly accesses the underlying system. Whatever is in PrivilegedAction.run() will have full hardware access (provided the applet in question is signed and the user has given permission).

Just keep one thing in mind: Your applet is fully scriptable and if you are not very careful where that Script comes from, your applet may be abused and thus the security of the client browser might be at risk.

Keeping this in mind, try to:

  • Make these elevated methods do one and only one thing.
  • Keep the interface between the page and the applet as simple as possible.
  • In elevated methods, do not call into javascript (see below) and certainly do not eval() any code coming from the outside.
  • Make sure that your pages are sufficiently secured against XSS: Don't allow any user generated content to reach the page unescaped.

The explicit and cumbersome declaration of elevated actions was put in place to make sure that the developer keeps the associated security risk in mind. So be a good developer and do so.

Using this technology, you can even pass around Java objects from the Applet to the page.

Also, if you need your applet to call into the page, you can do that too, of course, but you'll need a bit of additional work.

  1. You need to import JSObject from netscape.javascript (yes - that's how it's called. It works in all browsers though), so to compile the applet, you'll have to add plugin.jar (or netscape.jar - depending on the version of the JRE) from somewhere below your JRE/JDK installation to the build classpath. On a Mac, you'll find it below /System/Library/Frameworks/JavaVM.framework/Versions/<your version>/Home/lib.
  2. You need to tell the Java plugin that you want the applet to be able to call into the page. Use the mayscript attribute of the java applet for that (interestingly, it's just mayscript - without value, thus making your nice XHTML page invalid the moment you add it - mayscript="true" or the correct mayscript="mayscript" don't work consistently on all browsers).
  3. In your applet, call the static JSObject.getWindow() and pass it a reference to your applet to acquire a reference to the current pages window-object.
  4. On that reference you can call eval() or getMember() or just call() to call into the JavaScript on the page.

This tool set allows you to add the applet to the page with 1 pixel size in diameter placed somewhere way out of the viewport distance and with visibility: hidden, while writing the actual GUI code in HTML and CSS, using normal JS/AJAX calls to communicate with the server.

If you need access to specific system components, this (together with JNA and applet-launcher) is the way to go, IMHO as it solves the anachronism that is Java GUIs in applets.

There is still the long launch time of the JRE, but that's getting better and better with every JRE release.

I was having so much fun last week discovering all that stuff.