PHP: UTF-8

UTF-8 (UCS Transformation Format — 8-bit) is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks (BOM). UTF-8 encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed ‘octets’ in the Unicode Standard). Code points with lower numerical values (i. e., earlier code positions in the Unicode character set, which tend to occur more frequently in practice) are encoded using fewer bytes, making the encoding scheme reasonably efficient. In particular, the first 128 characters of the Unicode character set, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as the corresponding ASCII character, making valid ASCII text valid UTF-8-encoded Unicode text as well. The official IANA code for the UTF-8 character encoding is UTF-8.

PHP has an optional library specifically for handling multi-byte strings, known as mb_strings (short for multi-byte strings library). This library makes using UTF-8 much easier. Firstly we must correctly set the HTTP headers to instruct the browser to use UTF-8:

header('Content-Type: text/html; charset=UTF-8');
 

By default PHP uses ‘ISO-8859-1’ for it’s internal encoding schema. Change this to UTF-8 which makes the PHP internal functions ‘UTF-8 aware’. It also ensures that input and output are in UTF-8 with PHP trying to force character set changes:

mb_internal_encoding('UTF-8');

Overly long UTF-8 sequences and UTF-16 surrogates are a serious security threat. Validation of input data is very important. In the algorithm below the first preg_replace() only allows well formed Unicode (and rejects overly long 2 byte sequences, as well as characters above U+10000). The second preg_replace() removes overly long 3 byte sequences and UTF-16 surrogates.

$body = preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]'.
                     '|[\x00-\x7F][\x80-\xBF]+'.
                     '|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*'.
                     '|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})'
                     '|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/S',
                     '?', $body);
 
$body = preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]'.
                     '|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body);

Some common text handling fuctions do not work directly in UTF-8 and have equivalent multibyte functions. Some of the more common equivalents are listed below:

  • mail() – mb_send_mail()
  • strlen() – mb_strlen()
  • strpos() – mb_strpos()
  • strrpos() – mb_strrpos()
  • substr() – mb_substr()
  • strlower() – mb_strtolower()
  • strtoupper() – mb_strtoupper()
  • substr_count() – mb_substr_count()
  • split() – mb_split()

Leave a Reply