There are many languages in which all characters can be expressed
by single byte_ Multi_byte character codes are used to express
many characters for many languages_ mbstring
is developed to handle Japanese characters_ However, many
mbstring functions are able to handle
character encoding other than Japanese_
A multi_byte character encoding represents single character with
consecutive bytes_ Some character encoding has shift(escape)
sequences to start/end multi_byte character strings_ Therefore, a
multi_byte character string may be destroyed when it is divided
and/or counted unless multi_byte character encoding safe method
is used_ This module provides multi_byte character safe string
functions and other utility functions such as conversion
functions_
Since PHP is basically designed for ISO_8859_1, some multi_byte
character encoding does not work well with PHP_ Therefore, it is
important to set
mbstring_language to appropriate language
(i_e_ "Japanese" for japanese) and
mbstring_internal_encoding to a character
encoding that works with PHP_
PHP4 Character Encoding Requirements
Per byte encoding
Single byte characters in range of 00h_7fh
which is compatible with ASCII
Multi_byte characters without 00h_7fh
These are examples of internal character encoding that works with
PHP and does NOT work with PHP_
Character encodings work with PHP:
ISO_8859_*, EUC_JP, UTF_8
Character encodings do NOT work with PHP:
JIS, SJIS
Character encoding, that does not work with PHP, may be converted
with mbstring's HTTP input/output conversion
feature/function_
Nota:
SJIS should not be used for internal encoding unless the reader
is familiar with parser/compiler, character encoding and
character encoding issues_
Nota:
If you use databases with PHP, it is recommended that you use the
same character encoding for both database and internal
encoding for ease of use and better performance_
If you are using PostgreSQL, it supports character
encoding that is different from backend character encoding_ See
the PostgreSQL manual for details_
Instalación
mbstring is an extended module_ You must
enable the module with the configure script_
Refer to the Install section for
details_
The following configure options are related to the
mbstring module_
__enable_mbstring=LANG: Enable
mbstring functions_ This option is
required to use mbstring functions_
As of PHP 4_3_0, mbstring extension provides
enhanced support for Simplified Chinese, Traditional Chinese,
Korean, and Russian in addition to Japanese_
To enable that feature, you will have to supply either one of the
following options to the LANG parameter;
__enable_mbstring=cn for Simplified Chinese support,
__enable_mbstring=tw for Traditional Chinese support,
__enable_mbstring=kr for Korean support,
__enable_mbstring=ru for Russian support, and
__enable_mbstring=ja for Japanese support_
Also __enable_mbstring=all is
convenient for you to enable all the supported languages listed above_
Nota:
Japanese language support is also enabled by
__enable_mbstring without any options
for the sake of backwards compatibility_
__enable_mbstr_enc_trans :
Enable HTTP input character encoding conversion using
mbstring conversion engine_ If this
feature is enabled, HTTP input character encoding may be
converted to mbstring_internal_encoding
automatically_
Nota:
As of PHP 4_3_0, the option
__enable_mbstr_enc_trans
will be eliminated and replaced with
mbstring_encoding_translation_
HTTP input character encoding conversion is enabled
when this is set to On
(the default is Off)_
__enable_mbregex: Enable
regular expression functions with multibyte character support_
Configuración en tiempo de
ejecución
El comportamiento de estas
funciones está afectado por los valores definidos en
php_ini_
Tabla 1_ Multi_Byte String configuration options
Name
Default
Changeable
mbstring_language
NULL
PHP_INI_ALL
mbstring_detect_order
NULL
PHP_INI_ALL
mbstring_http_input
NULL
PHP_INI_ALL
mbstring_http_output
NULL
PHP_INI_ALL
mbstring_internal_encoding
NULL
PHP_INI_ALL
mbstring_script_encoding
NULL
PHP_INI_ALL
mbstring_substitute_character
NULL
PHP_INI_ALL
mbstring_func_overload
"0"
PHP_INI_SYSTEM
mbstring_encoding_translation
"0"
PHP_INI_ALL
For further details and definition of the PHP_INI_* constants see
ini_set()_
A continuación se
presenta una corta explicación de las directivas de
configuración
mbstring_language defines
default language used in mbstring_
Note that this option defines
mbstring_internal_encoding
and mbstring_internal_encoding
should be placed after mbstring_language
in php_ini
mbstring_encoding_translation enables
HTTP input character encoding detection and translation into
internal chatacter encoding_
mbstring_internal_encoding defines default
internal character encoding_
mbstring_http_input defines default HTTP
input character encoding_
mbstring_http_output defines default HTTP
output character encoding_
mbstring_detect_order defines default
character code detection order_ See also
mb_detect_order()_
mbstring_substitute_character defines
character to substitute for invalid character encoding_
mbstring_func_overloadoverload(replace) single byte
functions by mbstring functions_ mail(),
ereg(), etc_ are overloaded by
mb_send_mail(), mb_ereg(), etc_
Possible values are 0, 1, 2, 4 or a combination of them_
For example, 7 for overload everything_
0: No overload, 1: Overload mail() function,
2: Overload str*() functions, 4: Overload ereg*() functions_
Web Browsers are supposed to use the same character encoding
when submitting form_ However, browsers may not use the same
character encoding_ See mb_http_input() to
detect character encoding used by browsers_
If enctype is set to
multipart/form_data in HTML forms,
mbstring does not convert character encoding
in POST data_ The user must convert them in the script, if
conversion is needed_
Although, browsers are smart enough to detect character encoding
in HTML_ charset is better to be set in HTTP
header_ Change default_charset according to
character encoding_
Ejemplo 1_ php_ini setting example
; Set default language
mbstring_language = Neutral; Set default language to Neutral(UTF_8) (default)
mbstring_language = English; Set default language to English
mbstring_language = Japanese; Set default language to Japanese
;; Set default internal encoding
;; Note: Make sure to use character encoding works with PHP
mbstring_internal_encoding = UTF_8 ; Set internal encoding to UTF_8
;; HTTP input encoding translation is enabled_
mbstring_encoding_translation = On
;; Set default HTTP input character encoding
;; Note: Script cannot change http_input setting_
mbstring_http_input = pass ; No conversion_
mbstring_http_input = auto ; Set HTTP input to auto
; "auto" is expanded to "ASCII,JIS,UTF_8,EUC_JP,SJIS"
mbstring_http_input = SJIS ; Set HTTP2 input to SJIS
mbstring_http_input = UTF_8,SJIS,EUC_JP ; Specify order
;; Set default HTTP output character encoding
mbstring_http_output = pass ; No conversion
mbstring_http_output = UTF_8 ; Set HTTP output encoding to UTF_8
;; Set default character encoding detection order
mbstring_detect_order = auto ; Set detect order to auto
mbstring_detect_order = ASCII,JIS,UTF_8,SJIS,EUC_JP ; Specify order
;; Set default substitute character
mbstring_substitute_character = 12307 ; Specify Unicode value
mbstring_substitute_character = none ; Do not print character
mbstring_substitute_character = long ; Long Example: U+3000,JIS+7E7E
Ejemplo 2_ php_ini setting for EUC_JP users
;; Disable Output Buffering
output_buffering = Off
;; Set HTTP header charset
default_charset = EUC_JP
;; Set default language to Japanese
mbstring_language = Japanese
;; HTTP input encoding translation is enabled_
mbstring_encoding_translation = On
;; Set HTTP input encoding conversion to auto
mbstring_http_input = auto
;; Convert HTTP output to EUC_JP
mbstring_http_output = EUC_JP
;; Set internal encoding to EUC_JP
mbstring_internal_encoding = EUC_JP
;; Do not print invalid characters
mbstring_substitute_character = none
Ejemplo 3_ php_ini setting for SJIS users
;; Enable Output Buffering
output_buffering = On
;; Set mb_output_handler to enable output conversion
output_handler = mb_output_handler
;; Set HTTP header charset
default_charset = Shift_JIS
;; Set default language to Japanese
mbstring_language = Japanese
;; Set http input encoding conversion to auto
mbstring_http_input = auto
;; Convert to SJIS
mbstring_http_output = SJIS
;; Set internal encoding to EUC_JP
mbstring_internal_encoding = EUC_JP
;; Do not print invalid characters
mbstring_substitute_character = none
Tipos de recursos
Esta extensión no tiene
ningún tipo de recurso definido_
Constantes predefinidas
Estas constantes están
definidas por esta extensión y estarán disponibles
solamente cuando la extensión ha sido o bien compilada dentro
de PHP o grabada dinámicamente en tiempo de ejecución_
HTTP input/output character encoding conversion may convert
binary data also_ Users are supposed to control character
encoding conversion if binary data is used for HTTP
input/output_
Nota:
For PHP 4_3_2 or earlier,
if enctype for HTML form is set to
multipart/form_data,
mbstring does not convert character encoding
in POST data_ If it is the case, strings are needed to be
converted to internal character encoding_
Nota:
Since PHP 4_3_3,
if enctype for HTML form is set to
multipart/form_data, and,
mbstring_encoding_translation is set to
On in php_ini
POST variables and uploaded filename will be converted to
internal character encoding_
But, characters specified in 'name' of HTML form will not be
converted_
HTTP Input
There is no way to control HTTP input character
conversion from PHP script_ To disable HTTP input character
conversion, it has to be done in php_ini_
Ejemplo 4_
Disable HTTP input conversion in php_ini
;; Disable HTTP Input conversion
mbstring_http_input = pass
;; Disable HTTP Input conversion (PHP 4_3_0 or higher)
mbstring_encoding_translation = Off
When using PHP as an Apache module, it is possible to
override PHP ini setting per Virtual Host in
httpd_conf or per directory with
_htaccess_ Refer to the Configuration section and
Apache Manual for details_
HTTP Output
There are several ways to enable output character encoding
conversion_ One is using php_ini, another
is using ob_start() with
mb_output_handler() as
ob_start callback function_
Nota:
For PHP3_i18n users, mbstring's output
conversion differs from PHP3_i18n_ Character encoding is
converted using output buffer_
Ejemplo 5_ php_ini setting example
;; Enable output character encoding conversion for all PHP pages
;; Enable Output Buffering
output_buffering = On
;; Set mb_output_handler to enable output conversion
output_handler = mb_output_handler
Ejemplo 6_ Script example
<?php
// Enable output character encoding conversion only for this page
// Set HTTP output character encoding to SJIS
mb_http_output('SJIS');
// Start buffering and specify "mb_output_handler" as
// callback function
ob_start('mb_output_handler');
?>
Supported Character Encodings
Currently, the following character encoding is supported by the
mbstring module_ Character encoding may
be specified for mbstring functions'
encoding parameter_
The following character encoding is supported in this PHP
extension:
As of PHP 4_3_0, the following character encoding support will be added
experimentally :
EUC_CN, CP936, HZ,
EUC_TW, CP950, BIG_5,
EUC_KR, UHC (CP949),
ISO_2022_KR,
Windows_1251 (CP1251),
Windows_1252 (CP1252),
CP866,
KOI8_R_
php_ini entry, which accepts encoding name,
accepts "auto" and
"pass" also_
mbstring functions, which accepts encoding
name, and accepts "auto"_
If "pass" is set, no character
encoding conversion is performed_
If "auto" is set, it is expanded to
"ASCII,JIS,UTF_8,EUC_JP,SJIS"_
Nota:
"Supported character encoding" does not mean that it
works as internal character code_
Overloading PHP string functions with multi byte string functions
Because almost PHP application written for language using
single_byte character encoding, there are some difficulties for
multibyte string handling including japanese_ Almost PHP string
functions such as substr() do not support
multibyte string_
Multibyte extension (mbstring) has some PHP string functions
with multibyte support (ex_ substr() supports
mb_substr())_
Multibyte extension (mbstring) also supports 'function
overloading' to add multibyte string functionality without
code modification_ Using function overloading, some PHP string
functions will be oveloaded multibyte string functions_
For example, mb_substr() is called
instead of substr() if function overloading
is enabled_ Function overload makes easy to port application
supporting only single_byte encoding for multibyte application_
mbstring_func_overload in php_ini should be
set some positive value to use function overloading_
The value should specify the category of overloading functions,
sbould be set 1 to enable mail function overloading_ 2 to enable
string functions, 4 to regular expression functions_ For
example, if is set for 7, mail, strings, regex functions should
be overloaded_ The list of overloaded functions are shown in
below_
Most Japanese characters need more than 1 byte per character_ In
addition, several character encoding schemas are used under a
Japanese environment_ There are EUC_JP, Shift_JIS(SJIS) and
ISO_2022_JP(JIS) character encoding_ As Unicode becomes popular,
UTF_8 is used also_ To develop Web applications for a Japanese
environment, it is important to use the character set for the
task in hand, whether HTTP input/output, RDBMS and E_mail_
Storage for a character can be up to six
bytes
A multi_byte character is usually twice of the width compared
to single_byte characters_ Wider characters are called
"zen_kaku" _ meaning full width, narrower characters are
called "han_kaku" _ meaning half width_ "zen_kaku" characters
are usually fixed width_
Some character encoding defines shift(escape) sequence for
entering/exiting multi_byte character strings_
ISO_2022_JP must be used for SMTP/NNTP_
"i_mode" web site is supposed to use SJIS_
References
Multi_byte character encoding and its related issues are very
complex_ It is impossible to cover in sufficient detail
here_ Please refer to the following URLs and other resources for
further readings_