mysql实现原理-字符集和排序规则
- 人工智能
- 2025-08-27 06:57:01

介绍 字符集
在计算机中字符是以二进制的格式存储的,每个字符在字符编码系统中都有一个唯一的二进制代码,这个过程确保了字符在存储和传输时不会丢失或混淆。将一个字符映射成一个二进制数据的过程就称为字符编码,而将一个二进制数据映射到一个字符的过程就称为字符解码。而字符集就是一套字符和二进制数据映射关系的集合。
一些重要的字符集 ASCII 字符集共收录128个字符,包括空格、标点符号、数字、大小写字母和一些不可见字符。由于总共才128个字符,所以可以使用1个字节来进行编码。
ISO 8859-1 字符集共收录256个字符,是在 ASCII 字符集的基础上又扩充了128个西欧常用字符(包括德法两国的字母),也可以使用1个字节来进行编码。这个字符集也有一个别名 latin1 。
GB2312 字符集收录了汉字以及拉丁字母、希腊字母、日文平假名及片假名字母、俄语西里尔字母。其中收录汉字6763个,其他文字符号682个。
GBK 字符集GBK 字符集只是在收录字符范围上对 GB2312 字符集作了扩充,编码方式上兼容 GB2312 。
utf8 字符集收录地球上能想到的所有字符,而且还在不断扩充。这种字符集兼容 ASCII 字符集,采用变长编码方式,编码一个字符需要使用1~4个字节。
实准确的说,utf8只是Unicode字符集的一种编码方案,Unicode字符集可以采用utf8、utf16、utf32这几种编码方案,utf8使用1~4个字节编码一个字符,utf16使用2个或4个字节编码一个字符,utf32使用4个字节编码一个字符
比较规则比较规则就是我们查询数据库中的数据,数据以什么样的顺序返回,不同的比较规则,返回的顺序也是区别很大的,可能直接影响到我们对数据的使用
比如:A的ascii码是65,B的ascii码是66,a的ascii码是97,如果以二进制比较规则,则是直接比较字符的ascii码,结果就是 A < B < a ,但是如果二进制比较基础上再加上一个不区分大小写,则又是另外一个顺序了A = a < B 。
所以了解一下数据库的字符集和比较规则对我们正确使用数据库是很重要的,接下来我们展开看看。
数据库中的字符集和比较规则 说明 演示使用的数据库版本为[8.0.13] 查看 查看字符集 show character set ; 或 show charset ; 字符集描述默认比较规则最大长度(字节)armscii8ARMSCII-8 Armenianarmscii8_general_ci1asciiUS ASCIIascii_general_ci1big5Big5 Traditional Chinesebig5_chinese_ci2binaryBinary pseudo charsetbinary1cp1250Windows Central Europeancp1250_general_ci1cp1251Windows Cyrilliccp1251_general_ci1cp1256Windows Arabiccp1256_general_ci1cp1257Windows Balticcp1257_general_ci1cp850DOS West Europeancp850_general_ci1cp852DOS Central Europeancp852_general_ci1cp866DOS Russiancp866_general_ci1cp932SJIS for Windows Japanesecp932_japanese_ci2dec8DEC West Europeandec8_swedish_ci1eucjpmsUJIS for Windows Japaneseeucjpms_japanese_ci3euckrEUC-KR Koreaneuckr_korean_ci2gb18030China National Standard GB18030gb18030_chinese_ci4gb2312GB2312 Simplified Chinesegb2312_chinese_ci2gbkGBK Simplified Chinesegbk_chinese_ci2geostd8GEOSTD8 Georgiangeostd8_general_ci1greekISO 8859-7 Greekgreek_general_ci1hebrewISO 8859-8 Hebrewhebrew_general_ci1hp8HP West Europeanhp8_english_ci1keybcs2DOS Kamenicky Czech-Slovakkeybcs2_general_ci1koi8rKOI8-R Relcom Russiankoi8r_general_ci1koi8uKOI8-U Ukrainiankoi8u_general_ci1latin1cp1252 West Europeanlatin1_swedish_ci1latin2ISO 8859-2 Central Europeanlatin2_general_ci1latin5ISO 8859-9 Turkishlatin5_turkish_ci1latin7ISO 8859-13 Balticlatin7_general_ci1macceMac Central Europeanmacce_general_ci1macromanMac West Europeanmacroman_general_ci1sjisShift-JIS Japanesesjis_japanese_ci2swe77bit Swedishswe7_swedish_ci1tis620TIS620 Thaitis620_thai_ci1ucs2UCS-2 Unicodeucs2_general_ci2ujisEUC-JP Japaneseujis_japanese_ci3utf16UTF-16 Unicodeutf16_general_ci4utf16leUTF-16LE Unicodeutf16le_general_ci4utf32UTF-32 Unicodeutf32_general_ci4utf8UTF-8 Unicodeutf8_general_ci3utf8mb4UTF-8 Unicodeutf8mb4_general_ci4当前版本支持的字符集为41个。
utf8与utf8mb4正常来说,utf8字符集表示一个字符需要1~4个字节,在 MySQL 5.5.3 及之前的版本中,utf8 字符集实际上是一个1~3的 UTF-8 变种。支持常规字符,并不包括所有 Unicode 字符,特别是那些位于扩展区(如表情符号、一些罕见汉字等)的字符。后来在MySQL 5.5.3 之后,utf8mb4 被引入以支持完整的四字节 UTF-8 编码。MySQL 官方也推荐在需要存储全 Unicode 字符集时使用 utf8mb4。考虑到兼容性,utf8字符也保留了下来。
查看比较规则 -- 查看utf8mb4支持的比较规则 show collation like "utf8mb4%"; 比较规则字符集IDutf8mb4_0900_ai_ciutf8mb4255utf8mb4_0900_as_ciutf8mb4305utf8mb4_0900_as_csutf8mb4278utf8mb4_0900_binutf8mb4309utf8mb4_binutf8mb446utf8mb4_croatian_ciutf8mb4245utf8mb4_cs_0900_ai_ciutf8mb4266utf8mb4_cs_0900_as_csutf8mb4289utf8mb4_czech_ciutf8mb4234utf8mb4_danish_ciutf8mb4235utf8mb4_da_0900_ai_ciutf8mb4267utf8mb4_da_0900_as_csutf8mb4290utf8mb4_de_pb_0900_ai_ciutf8mb4256utf8mb4_de_pb_0900_as_csutf8mb4279utf8mb4_eo_0900_ai_ciutf8mb4273utf8mb4_eo_0900_as_csutf8mb4296utf8mb4_esperanto_ciutf8mb4241utf8mb4_estonian_ciutf8mb4230utf8mb4_es_0900_ai_ciutf8mb4263utf8mb4_es_0900_as_csutf8mb4286utf8mb4_es_trad_0900_ai_ciutf8mb4270utf8mb4_es_trad_0900_as_csutf8mb4293utf8mb4_et_0900_ai_ciutf8mb4262utf8mb4_et_0900_as_csutf8mb4285utf8mb4_general_ciutf8mb445utf8mb4_german2_ciutf8mb4244utf8mb4_hr_0900_ai_ciutf8mb4275utf8mb4_hr_0900_as_csutf8mb4298utf8mb4_hungarian_ciutf8mb4242utf8mb4_hu_0900_ai_ciutf8mb4274utf8mb4_hu_0900_as_csutf8mb4297utf8mb4_icelandic_ciutf8mb4225utf8mb4_is_0900_ai_ciutf8mb4257utf8mb4_is_0900_as_csutf8mb4280utf8mb4_ja_0900_as_csutf8mb4303utf8mb4_ja_0900_as_cs_ksutf8mb4304utf8mb4_latvian_ciutf8mb4226utf8mb4_la_0900_ai_ciutf8mb4271utf8mb4_la_0900_as_csutf8mb4294utf8mb4_lithuanian_ciutf8mb4236utf8mb4_lt_0900_ai_ciutf8mb4268utf8mb4_lt_0900_as_csutf8mb4291utf8mb4_lv_0900_ai_ciutf8mb4258utf8mb4_lv_0900_as_csutf8mb4281utf8mb4_persian_ciutf8mb4240utf8mb4_pl_0900_ai_ciutf8mb4261utf8mb4_pl_0900_as_csutf8mb4284utf8mb4_polish_ciutf8mb4229utf8mb4_romanian_ciutf8mb4227utf8mb4_roman_ciutf8mb4239utf8mb4_ro_0900_ai_ciutf8mb4259utf8mb4_ro_0900_as_csutf8mb4282utf8mb4_ru_0900_ai_ciutf8mb4306utf8mb4_ru_0900_as_csutf8mb4307utf8mb4_sinhala_ciutf8mb4243utf8mb4_sk_0900_ai_ciutf8mb4269utf8mb4_sk_0900_as_csutf8mb4292utf8mb4_slovak_ciutf8mb4237utf8mb4_slovenian_ciutf8mb4228utf8mb4_sl_0900_ai_ciutf8mb4260utf8mb4_sl_0900_as_csutf8mb4283utf8mb4_spanish2_ciutf8mb4238utf8mb4_spanish_ciutf8mb4231utf8mb4_sv_0900_ai_ciutf8mb4264utf8mb4_sv_0900_as_csutf8mb4287utf8mb4_swedish_ciutf8mb4232utf8mb4_tr_0900_ai_ciutf8mb4265utf8mb4_tr_0900_as_csutf8mb4288utf8mb4_turkish_ciutf8mb4233utf8mb4_unicode_520_ciutf8mb4246utf8mb4_unicode_ciutf8mb4224utf8mb4_vietnamese_ciutf8mb4247utf8mb4_vi_0900_ai_ciutf8mb4277utf8mb4_vi_0900_as_csutf8mb4300这些比较规则的命名还挺有规律的:
比较规则名称以与其关联的字符集的名称开头后边紧跟着该比较规则主要作用于哪种语言,比如 utf8_polish_ci 表示以波兰语的规则比较, utf8_spanish_ci 是以西班牙语的规则比较, utf8_general_ci 是一种通用的比较规则名称后缀意味着该比较规则是否区分语言中的重音、大小写啥的,具体可以用的值如下: 后缀英文释义描述_aiaccent insensitive不区分重音_asaccent sensitive区分重音_cicase insensitive不区分大小写_cscase sensitive区分大小写_binbinary以二进制方式比较比如 utf8_general_ci 这个比较规则是以 ci 结尾的,说明不区分大小写
字符集和比较规则级别MySQL 有4个级别的字符集和比较规则,分别是:
服务器级别数据库级别表级别列级别 服务器级别 character_set_server 服务器级别的字符集 collation_server 服务器级别的比较规则查看
mysql> SHOW VARIABLES LIKE 'character_set_server'; +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | character_set_server | utf8mb4 | +----------------------+---------+ 1 row in set (0.00 sec) mysql> SHOW VARIABLES LIKE 'collation_server'; +------------------+--------------------+ | Variable_name | Value | +------------------+--------------------+ | collation_server | utf8mb4_0900_ai_ci | +------------------+--------------------+ 1 row in set (0.02 sec)可以看到这台mysql服务器默认的字符集为utf8mb4,默认的比较规则为utf8mb4_0900_ai_ci。
修改
通过配置文件修改 [server] character_set_server=gbk collation_server=gbk_chinese_ci再次查看服务器的字符集和比较规则
mysql> SHOW VARIABLES LIKE 'character_set_server'; +----------------------+-------+ | Variable_name | Value | +----------------------+-------+ | character_set_server | gbk | +----------------------+-------+ 1 row in set (0.01 sec) mysql> SHOW VARIABLES LIKE 'collation_server'; +------------------+----------------+ | Variable_name | Value | +------------------+----------------+ | collation_server | gbk_chinese_ci | +------------------+----------------+ 1 row in set (0.02 sec)已经修改成我们的预期值了。
数据库级别创建数据库时,可以指定字符集和比较规则,如果不指定,则继承服务器级别字符集和比较规则
语法:
-- 创建数据库 CREATE DATABASE 数据库名 [[DEFAULT] CHARACTER SET 字符集名称] [[DEFAULT] COLLATE 比较规则名称]; -- 修改数据库 ALTER DATABASE 数据库名 [[DEFAULT] CHARACTER SET 字符集名称] [[DEFAULT] COLLATE 比较规则名称];示例:
创建数据库时,不指定字符集和比较规则 -- 不指定字符集和比较规则 mysql> create database db1; Query OK, 1 row affected (0.01 sec) mysql> use db1; Database changed mysql> SHOW VARIABLES LIKE 'character_set_database'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | character_set_database | gbk | +------------------------+-------+ 1 row in set (0.00 sec) mysql> SHOW VARIABLES LIKE 'collation_database'; +------------------+----------------+ | Variable_name | Value | +------------------+----------------+ | collation_server | gbk_chinese_ci | +------------------+----------------+ 1 row in set (0.01 sec) mysql> 创建数据库时,指定字符集和比较规则 -- 指定字符集和比较规则 mysql> create database db2 character set utf8mb4 collate utf8mb4_general_ci; Query OK, 1 row affected (0.01 sec) mysql> use db2 Database changed mysql> SHOW VARIABLES LIKE 'character_set_database'; +------------------------+---------+ | Variable_name | Value | +------------------------+---------+ | character_set_database | utf8mb4 | +------------------------+---------+ 1 row in set (0.00 sec) mysql> SHOW VARIABLES LIKE 'collation_database'; +--------------------+--------------------+ | Variable_name | Value | +--------------------+--------------------+ | collation_database | utf8mb4_general_ci | +--------------------+--------------------+ 1 row in set (0.00 sec) mysql> 修改数据库字符集和比较规则 mysql> alter database db2 character set utf8mb3 collate utf8mb3_general_ci; Query OK, 1 row affected, 2 warnings (0.01 sec) mysql> use db2; Database changed mysql> SHOW VARIABLES LIKE 'character_set_database'; +------------------------+---------+ | Variable_name | Value | +------------------------+---------+ | character_set_database | utf8mb3 | +------------------------+---------+ 1 row in set (0.00 sec) mysql> SHOW VARIABLES LIKE 'collation_database'; +--------------------+--------------------+ | Variable_name | Value | +--------------------+--------------------+ | collation_database | utf8mb3_general_ci | +--------------------+--------------------+ 1 row in set (0.00 sec) 如果字符集和比较规则不匹配,会报错 mysql> alter database db2 character set utf8mb3 collate utf8mb4_general_ci; ERROR 1253 (42000): COLLATION 'utf8mb4_general_ci' is not valid for CHARACTER SET 'utf8mb3' mysql> 表级别创建数据表时,可以指定字符集和比较规则,如果不指定,则继承数据库级别字符集和比较规则
语法:
CREATE TABLE 表名 (列的信息) [[DEFAULT] CHARACTER SET 字符集名称] [COLLATE 比较规则名称]] ALTER TABLE 表名 [[DEFAULT] CHARACTER SET 字符集名称] [COLLATE 比较规则名称]示例
创建数据表时,不指定字符集和比较规则 mysql> create table t1 (col VARCHAR(10)); Query OK, 0 rows affected (0.02 sec) mysql> SELECT -> TABLE_SCHEMA, -> TABLE_NAME, -> TABLE_COLLATION, -> CCSA.character_set_name AS CHARACTER_SET_NAME -> FROM -> information_schema.TABLES T, -> information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA -> WHERE -> T.TABLE_SCHEMA = 'db1' -> AND T.TABLE_NAME = 't1' -> AND CCSA.collation_name = T.TABLE_COLLATION; +--------------+------------+-----------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME | +--------------+------------+-----------------+--------------------+ | db1 | t1 | gbk_chinese_ci | gbk | +--------------+------------+-----------------+--------------------+ 1 row in set (0.00 sec) 创建数据表时,指定字符集和比较规则 mysql> create table t2 (col VARCHAR(10)) character set utf8mb4 collate utf8mb4_general_ci; Query OK, 0 rows affected (0.02 sec) mysql> SELECT -> TABLE_SCHEMA, -> TABLE_NAME, -> TABLE_COLLATION, -> CCSA.character_set_name AS CHARACTER_SET_NAME -> FROM -> information_schema.TABLES T, -> information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA -> WHERE -> T.TABLE_SCHEMA = 'db1' -> AND T.TABLE_NAME = 't2' -> AND CCSA.collation_name = T.TABLE_COLLATION; +--------------+------------+--------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME | +--------------+------------+--------------------+--------------------+ | db1 | t2 | utf8mb4_general_ci | utf8mb4 | +--------------+------------+--------------------+--------------------+ 1 row in set (0.01 sec) 修改字符集和比较规则 mysql> alter table t2 character set utf8mb3 collate utf8mb3_general_ci; Query OK, 0 rows affected, 2 warnings (0.01 sec) Records: 0 Duplicates: 0 Warnings: 2 mysql> SELECT -> TABLE_SCHEMA, -> TABLE_NAME, -> TABLE_COLLATION, -> CCSA.character_set_name AS CHARACTER_SET_NAME -> FROM -> information_schema.TABLES T, -> information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA -> WHERE -> T.TABLE_SCHEMA = 'db1' -> AND T.TABLE_NAME = 't2' -> AND CCSA.collation_name = T.TABLE_COLLATION; +--------------+------------+--------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME | +--------------+------------+--------------------+--------------------+ | db1 | t2 | utf8mb3_general_ci | utf8mb3 | +--------------+------------+--------------------+--------------------+ 1 row in set (0.00 sec) 如果字符集和比较规则不匹配,会报错 mysql> alter table t2 character set utf8mb3 collate utf8mb4_general_ci; ERROR 1253 (42000): COLLATION 'utf8mb4_general_ci' is not valid for CHARACTER SET 'utf8mb3' 列级别对于存储字符串的列,同一个表中的不同的列也可以有不同的字符集和比较规则。我们在创建和修改列定义的时候可以指定该列的字符集和比较规则。
语法:
CREATE TABLE 表名( 列名 字符串类型 [CHARACTER SET 字符集名称] [COLLATE 比较规则名称], 其他列... ); ALTER TABLE 表名 MODIFY 列名 字符串类型 [CHARACTER SET 字符集名称] [COLLATE 比较规则名称];示例:
同一个表中的不同的列指定不同的字符集和比较规则 mysql> create table t3 ( -> c1 varchar(100) character set utf8mb3 collate utf8mb3_general_ci, -> c2 varchar(100) character set gbk collate gbk_chinese_ci, -> c3 varchar(100) -> ) character set utf8mb4 collate utf8mb4_general_ci; Query OK, 0 rows affected, 2 warnings (0.02 sec) mysql> show create table t3\G; *************************** 1. row *************************** Table: t3 Create Table: CREATE TABLE `t3` ( `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL, `c2` varchar(100) CHARACTER SET gbk COLLATE gbk_chinese_ci DEFAULT NULL, `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci 1 row in set (0.00 sec)可以看到 c1和 c2 都是自定义字符集和比较规则,c3则是继承了表t3的字符集和比较规则
修改列的字符集和比较规则 -- 修改c2列的字符集和比较规则,使之继承表t3的字符集和比较规则 mysql> alter table t3 modify column `c2` varchar(100); Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> show create table t3\G; *************************** 1. row *************************** Table: t3 Create Table: CREATE TABLE `t3` ( `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL, `c2` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL, `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci 1 row in set (0.00 sec) 补充说明 在转换列的字符集时需要注意,如果转换前列中存储的数据不能用转换后的字符集进行表示会发生错误。比方说原先列使用的字符集是utf8,列中存储了一些汉字,现在把列的字符集转换为ascii的话就会出错,因为ascii字符集并不能表示汉字字符。例如:
-- t3.c1的当前字符集为utf8mb3,比较规则为utf8mb3_general_ci mysql> insert into t3(c1) values("列1"); Query OK, 1 row affected (0.00 sec) -- 修改t3.c1列的字符集会报错 mysql> alter table db1.t3 modify column c1 varchar(100) CHARACTER SET `ascii` COLLATE ascii_general_ci; ERROR 1366 (HY000): Incorrect string value: '\xE5\x88\x971' for column 'c1' at row 1 如果只修改字符集,则比较规则会自动修改为字符集支持的默认比较规则 -- 查看表t3的当前字符集和比较规则 mysql> SELECT -> TABLE_SCHEMA, -> TABLE_NAME, -> TABLE_COLLATION, -> CCSA.character_set_name AS CHARACTER_SET_NAME -> FROM -> information_schema.TABLES T, -> information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA -> WHERE -> T.TABLE_SCHEMA = 'db1' -> AND T.TABLE_NAME = 't3' -> AND CCSA.collation_name = T.TABLE_COLLATION; +--------------+------------+--------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME | +--------------+------------+--------------------+--------------------+ | db1 | t3 | utf8mb4_general_ci | utf8mb4 | +--------------+------------+--------------------+--------------------+ 1 row in set (0.00 sec) -- 修改表t3的字符集 mysql> alter table t3 CHARACTER SET utf8mb3; Query OK, 0 rows affected, 1 warning (0.01 sec) Records: 0 Duplicates: 0 Warnings: 1 -- 查看表t3的修改后的字符集和比较规则 mysql> SELECT -> TABLE_SCHEMA, -> TABLE_NAME, -> TABLE_COLLATION, -> CCSA.character_set_name AS CHARACTER_SET_NAME -> FROM -> information_schema.TABLES T, -> information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA -> WHERE -> T.TABLE_SCHEMA = 'db1' -> AND T.TABLE_NAME = 't3' -> AND CCSA.collation_name = T.TABLE_COLLATION; +--------------+------------+--------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME | +--------------+------------+--------------------+--------------------+ | db1 | t3 | utf8mb3_general_ci | utf8mb3 | +--------------+------------+--------------------+--------------------+ 1 row in set (0.00 sec)比较规则确实自动随着字符集的修改而发生了修改
如果只修改比较规则,则字符集也会自动修改为跟比较规则对应的字符集 -- 查看表t3的当前字符集和比较规则 mysql> SELECT -> TABLE_SCHEMA, -> TABLE_NAME, -> TABLE_COLLATION, -> CCSA.character_set_name AS CHARACTER_SET_NAME -> FROM -> information_schema.TABLES T, -> information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA -> WHERE -> T.TABLE_SCHEMA = 'db1' -> AND T.TABLE_NAME = 't3' -> AND CCSA.collation_name = T.TABLE_COLLATION; +--------------+------------+--------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME | +--------------+------------+--------------------+--------------------+ | db1 | t3 | utf8mb3_general_ci | utf8mb3 | +--------------+------------+--------------------+--------------------+ 1 row in set (0.01 sec) -- 修改表t3的比较规则 mysql> alter table t3 COLLATE utf8mb4_general_ci; Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0 -- 查看表t3的修改后的字符集和比较规则 mysql> SELECT -> TABLE_SCHEMA, -> TABLE_NAME, -> TABLE_COLLATION, -> CCSA.character_set_name AS CHARACTER_SET_NAME -> FROM -> information_schema.TABLES T, -> information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA -> WHERE -> T.TABLE_SCHEMA = 'db1' -> AND T.TABLE_NAME = 't3' -> AND CCSA.collation_name = T.TABLE_COLLATION; +--------------+------------+--------------------+--------------------+ | TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION | CHARACTER_SET_NAME | +--------------+------------+--------------------+--------------------+ | db1 | t3 | utf8mb4_general_ci | utf8mb4 | +--------------+------------+--------------------+--------------------+ 1 row in set (0.01 sec) 我们假设一种级别的父子级关系服务器级别>数据库级别>表级别>列级别,父级的字符集和比较规则发生变化,不影响子级及以后级别的字符集和比较规则,或者说子级的字符集和比较规则在创建时就已经指定好了,不会受其他级别字符集和比较规则修改的影响 -- 查看服务器的字符集和比较规则 mysql> SHOW VARIABLES LIKE 'character_set_server'; +----------------------+-------+ | Variable_name | Value | +----------------------+-------+ | character_set_server | gbk | +----------------------+-------+ 1 row in set (0.01 sec) mysql> SHOW VARIABLES LIKE 'collation_server'; +------------------+----------------+ | Variable_name | Value | +------------------+----------------+ | collation_server | gbk_chinese_ci | +------------------+----------------+ 1 row in set (0.00 sec) -- 创建数据库,继承服务器级别的字符集和比较规则 mysql> create database db5; Query OK, 1 row affected (0.01 sec) -- 查看数据库的字符集和比较规则 mysql> use db5; Database changed mysql> SHOW VARIABLES LIKE 'character_set_database'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | character_set_database | gbk | +------------------------+-------+ 1 row in set (0.01 sec) mysql> SHOW VARIABLES LIKE 'collation_database'; +--------------------+----------------+ | Variable_name | Value | +--------------------+----------------+ | collation_database | gbk_chinese_ci | +--------------------+----------------+ 1 row in set (0.01 sec) ------------------------ -- 修改服务器字符集和比较规则配置 并重启服务器 [server] character_set_server=utf8mb4 collation_server=utf8mb4_general_ci ------------------------ -- 查看服务器的字符集和比较规则 mysql> SHOW VARIABLES LIKE 'character_set_server'; +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | character_set_server | utf8mb4 | +----------------------+---------+ 1 row in set (0.01 sec) mysql> SHOW VARIABLES LIKE 'collation_server'; +------------------+--------------------+ | Variable_name | Value | +------------------+--------------------+ | collation_server | utf8mb4_general_ci | +------------------+--------------------+ 1 row in set (0.00 sec) -- 查看数据库的字符集和比较规则 mysql> use db5; Database changed mysql> SHOW VARIABLES LIKE 'character_set_database'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | character_set_database | gbk | +------------------------+-------+ 1 row in set (0.01 sec) mysql> SHOW VARIABLES LIKE 'collation_database'; +--------------------+----------------+ | Variable_name | Value | +--------------------+----------------+ | collation_database | gbk_chinese_ci | +--------------------+----------------+ 1 row in set (0.00 sec) 字符集的应用上边我们介绍的是在数据存储方面的编码和解码,接下来我们介绍在数据传输方便的编码和解码。
准备 mysql服务器启动配置设置 [server] character_set_server=utf8mb4 常用的mysql客户端连接到mysql服务器后,都会设置客户端字符集,以便数据可以正确传输,但是这样也给我们查看底层的系统变量造成了干扰,为了避免这个情况,这里采用php连接mysql服务器 关键系统变量客户端连接mysql服务器,查询获取返回数据,会有多次的字符集编码转换,这个过程中会用到以下三个系统变量:
character_set_client: 服务器解码请求时使用的字符集character_set_connection: 服务器处理请求时会把请求字符串从 character_set_client 转为 character_set_connectioncharacter_set_results: 服务器向客户端返回数据时使用的字符集查看这几个变量的默认值
<?php // 创建连接 $conn = new mysqli("127.0.0.1", "root", "123456", "db1"); // 检测连接 if ($conn->connect_error) { die("连接失败: " . $conn->connect_error); } echo "连接成功\n"; $ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_client';"); $res = $ret->fetch_all()[0]; echo $res[0]." : ".$res[1].PHP_EOL; $ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_connection';"); $res = $ret->fetch_all()[0]; echo $res[0]." : ".$res[1].PHP_EOL; $ret = $conn->execute_query("SHOW VARIABLES LIKE 'character_set_results';"); $res = $ret->fetch_all()[0]; echo $res[0]." : ".$res[1].PHP_EOL; ?>查询结果:
连接成功 character_set_client : utf8mb4 character_set_connection : utf8mb4 character_set_results : utf8mb4可以看到,这几个变量的默认值继承了mysql服务器级别的字符集
客户端连接成功后,可以指定字符集来修改这三个系统变量的值
$conn->set_charset("gbk");查询结果:
连接成功 character_set_client : gbk character_set_connection : gbk character_set_results : gbk也可以通过客户端单个设置系统变量的值
这是设置是非持久的
$conn->set_charset("gbk"); $ret = $conn->execute_query("SET character_set_client = utf8;");查询结果:
连接成功 character_set_client : utf8mb3 character_set_connection : gbk character_set_results : gbk 编码和解码使用不同字符集的后果正确的查询:
<?php // 创建连接 $conn = new mysqli("127.0.0.1", "root", "123456", "db1"); // 检测连接 if ($conn->connect_error) { die("连接失败: " . $conn->connect_error); } echo "连接成功\n"; $ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'"); print_r($ret->fetch_all());查询结果:
连接成功 Array ( [0] => Array ( [0] => 列1 [1] => [2] => ) ) 客户端单独设置了character_set_client,之后: // 创建连接 $conn = new mysqli("127.0.0.1", "root", "123456", "db1"); // 检测连接 if ($conn->connect_error) { die("连接失败: " . $conn->connect_error); } echo "连接成功\n"; // 设置 character_set_client $ret = $conn->execute_query("SET character_set_client = gbk"); // 查询 $ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'"); print_r($ret->fetch_all()); ?>查询结果:
连接成功 Array ( )结果查询为空,说明了mysql服务端通过gbk解码请求中的%列%时,中文字符不能正确解析出来,导致查询不到正确的结果
客户端设置了跟服务端不兼容的字符集后 <?php // 创建连接 $conn = new mysqli("127.0.0.1", "root", "123456", "db1"); // 检测连接 if ($conn->connect_error) { die("连接失败: " . $conn->connect_error); } echo "连接成功\n"; // 设置字符集为gbk $conn->set_charset("gbk"); $ret = $conn->execute_query("select * from db1.t3 where c1 like'%列%'"); print_r($ret->fetch_all());查询报错:
连接成功 PHP Fatal error: Uncaught mysqli_sql_exception: Illegal mix of collations (utf8mb3_general_ci,IMPLICIT) and (gbk_chinese_ci,COERCIBLE) for operation 'like' in /Users/xxx/code/php/test/test.php:18 Stack trace: #0 /Users/xxx/code/php/test/test.php(18): mysqli->execute_query('select * from d...') #1 {main} thrown in /Users/xxx/code/php/test/test.php on line 18还有很多种情况,这里就不一一列举,但是字符集不一致导致的编码和解码出错的情况,导致的后果还是很严重的,我们需要特别关注
参考下面这张图,我们能更清晰的感觉到一次request,底层需要做很多次的字符串编码解码和字符集的转换,有一个环节出问题,都可能会返回一个错误的结果。 这样绕来绕去看着就很晕,为了简单起见,我们通常都把 character_set_client 、character_set_connection、character_set_results 这三个系统变量设置成和客户端使用的字符集一致的情况,这样减少了很多无谓的字符集转换 为了方便设置,mysql提供了一条非常简单的语句:
SET NAMES 字符集名;这样一次性就可以把这三条设置成统一的字符集
比较规则的应用比较规则的作用通常体现比较字符串大小的表达式以及对某个字符串列进行排序中,所以有时候也称为排序规则,影响的是结果集的顺序
示例:
首先看一下db1.t3.c1列的排序规则:
CREATE TABLE `t3` ( `c1` varchar(100) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL, `c2` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL, `c3` varchar(100) COLLATE utf8mb4_general_ci DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_cic1列目前的排序规则是 utf8mb3_general_ci,不区分大小写 插入几条数据:
mysql> insert into db1.t3(c1) values('A'),('B'),('a'),('b'); Query OK, 4 rows affected (0.04 sec) Records: 4 Duplicates: 0 Warnings: 0查询一下看看返回结果:
mysql> select * from db1.t3 order by c1 asc; +------+------+------+ | c1 | c2 | c3 | +------+------+------+ | 1 | NULL | NULL | | A | NULL | NULL | | a | NULL | NULL | | B | NULL | NULL | | b | NULL | NULL | | 列1 | NULL | NULL | +------+------+------+ 6 rows in set (0.00 sec)我们可以看到 第二行到第五行,是大小写交叉返回的,顺序是26个英文字母的先后顺序,跟我们预期的是一样的
接下来我们修改一下排序规则
utf8mb3_bin 排序规则是直接比较字符编码,是区分大小写的
mysql> alter table db1.t3 modify column c1 varchar(200) collate utf8mb3_bin; Query OK, 0 rows affected, 1 warning (0.03 sec) Records: 0 Duplicates: 0 Warnings: 1再次执行同样的查询语句:
mysql> select * from db1.t3 order by c1 asc; +------+------+------+ | c1 | c2 | c3 | +------+------+------+ | 1 | NULL | NULL | | A | NULL | NULL | | B | NULL | NULL | | a | NULL | NULL | | b | NULL | NULL | | 列1 | NULL | NULL | +------+------+------+ 6 rows in set (0.00 sec)第二行到第五行的返回排序果然发生了调整,跟我们预期的也是一直的。
总结对字符集和排序规则的深入了解,有利于我们正确使用mysql,以及针对相关问题进行快速定位。
本篇时间和篇幅都有限,先整理到这里,后续进行了更深入的整理,会通过版本迭代式地补充到这里。
mysql实现原理-字符集和排序规则由讯客互联人工智能栏目发布,感谢您对讯客互联的认可,以及对我们原创作品以及文章的青睐,非常欢迎各位朋友分享到个人网站或者朋友圈,但转载请说明文章出处“mysql实现原理-字符集和排序规则”