gnegg programming with passion

4Feb/084

Failing silently is bad

Today, I've experienced the perfect example of why I prefer PostgreSQL (congratulations for a successful 8.3 release today, guys!) to MySQL.

Let me first give you some code, before we discuss it (assume that the data which gets placed in the database is - wrongly so - in ISO-8859-1):

This is what PostgreSQL does:

bench ~ > createdb -Upilif -E utf-8 pilif
CREATE DATABASE
bench ~ > psql -Upilif
Welcome to psql 8.1.4, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

pilif=> create table test (blah varchar(20) not null default '');
CREATE TABLE
pilif=> insert into test values ('gnügg');
ERROR:  invalid byte sequence for encoding "UTF8": 0xfc676727293b
pilif=>

and this is what MySQL does:

bench ~ > mysql test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 97
Server version: 5.0.44-log Gentoo Linux mysql-5.0.44-r2

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create table test( blah varchar(20) not null default '')
    -> charset=utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into test values ('gnügg');
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from test;
+------+
| blah |
+------+
| gn   |
+------+
1 row in set (0.00 sec)

mysql>

Obvisouly it is wrong to try and place latin1 encoded data in an utf-8 formatted data store: While every valid utf-8 byte sequence is a valid latin1 byte sequence (latin1 does not restrict the validity of bytes, though some positions may be undefined), the reverse certainly is not true. The character ü from my example is 0xfc in latin1 and U+00fc in unicode which must be encoed as 0xc3 0xbc in utf-8. 0xfc alone is no valid utf-8 byte sequence.

So if you pass this invalid sequence to any entity accepting an utf-8 encoded byte stream, it will not be clear what to do with that data. It's not utf-8, that's for sure. But assuming that no character set is specified with the stream, it's impossible to guess what to translate the byte sequence into.

So PostgreSQL sees the error and bails out (if both the server and the client are set to utf-8 encoding and data is sent in non-utf8-format - otherwise it knows how to convert the data - conversion from any character set to utf-8 is possible all the time). MySQL on the other hand decides to fail silently and to try to fix up the invalid input.

Now while I could maybe live with the default of assuming latin1 encoding, just stopping to process the data without warning what so ever leads to undetected loss of data!

What if I'm not just entering one word? What if it's a blog-entry like this one? What if the entry is done by a non tech-savvy user? Remember: This mistake can easily be produced: Wrong Content-Type headers, old browsers, broken browsers... it's very easy to get Latin1 when you want utf-8.

While I agree that sanitization must be done in the application tier (preferably on the model), it's inacceptable for a database application to store different data than what it was ordered to store without warning the user in any way. This easily leads to data loss or data corruption.

There are many more little things like this where MySQL decides to silently fail where PostgreSQL (and any other database) bail out correctly. As a novice this can feel tedious for you. It can feel like PostgreSQL is pedantic and like you are faster with MySQL. But let's be honest: What do you prefer? An error message or lost data with no way of knowing that it's lost?

This, by the way, is the outcome of a lengthy debugging session on a Typo3 installation, which also, but not ultimately is to blame here. In a perfect world, MySQL would bail out, but Typo3 would either

  • Not specify charset=utf8 when creating the table unless specifically asked to.
  • Send a charset=utf-8 http-header, knowing that the database has been created as containing utf-8
  • Sanitize user input before handing it over to the mysql-backend which is obviously broken in this instance.

Now back to debugging real software on real databases *grin*

Comments (4) Trackbacks (0)
  1. Woo, sorry … my markup resulted in having my text blown away! Take #2

    Hear hear!

    *X-posted from Joseph Scott’s blog – http://snipurl.com/1z2zm -*

    Coincidentally I was talking with someone about ?what? something like rigour … not always fun to be PITA but sometimes little quibbles are actually fundamental principles.

    I was telling about how it was such a strain to create a brand-spanking new airport landing system … tin cans full of squishy customers. Anyhow, what I describe as the prime mandate, ghastly to implement, is actually dead simple: thou shalt not issue false data. TINA … no alternative … none. If the alternative is permissible then you can be sure what you’re doing is trivial.

    cheers

  2. Philip: How do you have your mysql setting configure. By default MySQL 5.0.x does not come wit utf8 enable at any level.

    Run “show variables like ‘%char%’;” and compare to result below:

    mysql> show variables like ‘%char%’;
    +————————–+—————————-+
    | Variable_name | Value |
    +————————–+—————————-+
    | character_set_client | latin1 |
    | character_set_connection | latin1 |
    | character_set_database | latin1 |
    | character_set_filesystem | binary |
    | character_set_results | latin1 |
    | character_set_server | latin1 |
    | character_set_system | utf8 |
    | character_sets_dir | /usr/share/mysql/charsets/ |
    +————————–+—————————-+
    8 rows in set (0.00 sec)

    Could be you have character_set_client set to latin1 even do the DB,tables,columns are utf8.

  3. You to add more:

    Add the following at /etc/my.cnf:

    [client]
    default-character-set=utf8

    [mysqld]
    default-character-set=utf8
    default-collation=utf8_general_ci

  4. no matter if it’s a configuration problem or not: It should never fail silently, truncating the data in the process. As I said: I could live with an error or even a implicit conversion (where possible), but never ever with a truncation of data.

    Philip


Leave a comment


No trackbacks yet.