I've been making progress on a comprehensive approach to character set management, and have learned a lot. In a nutshell, all the layers have to match - html, xml, php, MySQL (database and connection) and, in the case of FPDF the fonts.
Through SynApp2 1.8.0 (beta 4), there are some discontinuities. The html pages and xml are using utf-8, but the database connection and manipulation assumes latin-1. A conversion layer, implemented by encode_entities() and decode_entites(), compensates. This explicit conversion layer turns out to be unnecessary if all the pieces agree on encoding and character set.
See MySQL and UTF-8 Notes: http://www.phpwact.org/php/i18n/utf-8/mysql
There's a significant difference between utf-8 and many, if not all, of the ISO characters sets like latin-1, latin-2, etc. - the number of bytes per character. UTF-8 is Unicode and each character may be encoded with multiple bytes, whereas the latin- character sets , for example, are represented with single-byte values that are commingled with traditional ASCII.
In order for a client (browser) or database to understand what character is represented by a specific single-byte code (e.g., does \xA1 represent '¡' or 'Ą') the character set must be known. The same code represents a different character depending on the character set. This also implies that the single-byte character sets can't really be mixed. You can have latin-1 or latin-2, but not both. With utf-8 each character has its own code, so characters (languages) can be mixed.
Handling multi-byte characters correctly, requires some additional care and attention to detail. Functions that handle [data] at an application-level, must be implemented appropriately. In the case of PHP particularly, string functions that count, search, or manipulate multi-byte character data, must be suited to the task.
The changes needed for character set management fall into several distinct areas:
- configuration - designation of character set/encoding
- generating html markup with correct meta tag charset
- returning correct xml response encoding
- use correct data processing/manipulation functions
- implement utf-8/unicode multilingual character set and fonts for FPDF (with tFPDF)
- translate database results (as needed) to utf-8 for FPDF reports
This all should be reasonably straight forward and contained.
The ball is rolling.
-Richard