PostgreSQL 按拼音排序 – convert to GBK/EUC_CN coding

发表于编程 2016-12-13 11:11 阅读量: 2,870

国内的应用，在文本排序上基本都是按照拼音来进行排序的。

在不同的字符集中，汉字的编码可能不一样，比如UTF8和GBK，其中GBK是按拼音的顺序进行编码的，而UTF8则不是。

所以如果你的数据库使用了UTF8编码，对中文字段进行排序时，可能得到的并不是按拼音排序的结果。

在PostgreSQL中，中文按拼音排序的编码包括GB18030, EUC_CN, GBK, BIG5, EUC_TW 等。

为了得到拼音排序，可以使用编码转换后的值来排序，索引也可以使用编码转换的表达式索引。

编码

PostgreSQL支持的编码如下

https://www.postgresql.org/docs/9.6/static/multibyte.html

PostgreSQL Character Sets

与中文编码排序相关的包括 GB18030, EUC_CN, GBK, BIG5, EUC_TW

简体常用的包括GBK, EUC_CN。

编码转换

在PostgreSQL中，如果要将字符从一个编码转换为另一个编码，需要告诉数据库（create conversion）怎么转换（使用什么C函数），PG内置了一些转换的C函数和转换方法。

https://www.postgresql.org/docs/9.6/static/catalog-pg-conversion.html

pg_conversion

查看内置的转换方法

可以看到utf8转换为中文编码的都支持了

postgres=> select * from pg_conversion where conname ~* 'gbk|gb18|euc_cn|euc_tw|big5' order by 1;

conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault

-----------------+--------------+----------+----------------+---------------+-----------------+------------

big5_to_euc_tw | 11 | 10 | 36 | 4 | big5_to_euc_tw | t

big5_to_mic | 11 | 10 | 36 | 7 | big5_to_mic | t

big5_to_utf8 | 11 | 10 | 36 | 6 | big5_to_utf8 | t

euc_cn_to_mic | 11 | 10 | 2 | 7 | euc_cn_to_mic | t

euc_cn_to_utf8 | 11 | 10 | 2 | 6 | euc_cn_to_utf8 | t

euc_tw_to_big5 | 11 | 10 | 4 | 36 | euc_tw_to_big5 | t

euc_tw_to_mic | 11 | 10 | 4 | 7 | euc_tw_to_mic | t

euc_tw_to_utf8 | 11 | 10 | 4 | 6 | euc_tw_to_utf8 | t

gb18030_to_utf8 | 11 | 10 | 39 | 6 | gb18030_to_utf8 | t

gbk_to_utf8 | 11 | 10 | 37 | 6 | gbk_to_utf8 | t

mic_to_big5 | 11 | 10 | 7 | 36 | mic_to_big5 | t

mic_to_euc_cn | 11 | 10 | 7 | 2 | mic_to_euc_cn | t

mic_to_euc_tw | 11 | 10 | 7 | 4 | mic_to_euc_tw | t

utf8_to_big5 | 11 | 10 | 6 | 36 | utf8_to_big5 | t

utf8_to_euc_cn | 11 | 10 | 6 | 2 | utf8_to_euc_cn | t

utf8_to_euc_tw | 11 | 10 | 6 | 4 | utf8_to_euc_tw | t

utf8_to_gb18030 | 11 | 10 | 6 | 39 | utf8_to_gb18030 | t

utf8_to_gbk | 11 | 10 | 6 | 37 | utf8_to_gbk | t

(18 rows)

编码转换函数

注意数据库版本

PostgreSQL 8.x(如Greenplum), 将字符串从原编码转换为指定编码的字符串返回。

可能存在显示的问题。

List of functions

Schema | Name | Result data type | Argument data types | Type

------------+---------------+------------------+---------------------+--------

pg_catalog | convert | text | text, name | normal

pg_catalog | convert | text | text, name, name | normal

PostgreSQL 9.x, 将源编码字符串的字节流转换为指定编码的字符串的字节流返回。

避免了显示的问题。

List of functions

Schema | Name | Result data type | Argument data types | Type

------------+--------------+------------------+---------------------+--------

pg_catalog | convert | bytea | bytea, name, name | normal

pg_catalog | convert_from | text | bytea, name | normal

pg_catalog | convert_to | bytea | text, name | normal

如果8.x需要避免显示问题，返回字节流，可以这样使用，推荐使用。

byteain(textout(convert(字符,'源编码','目标编码')))

例子

当前数据库编码为UTF-8，中文排序未按拼音排序。

postgres=# l

List of databases

Name | Owner | Encoding | Collate | Ctype | Access privileges

-----------+----------+----------+---------+-------+-----------------------

db0 | postgres | UTF8 | C | C |

postgres | postgres | UTF8 | C | C |

template0 | postgres | UTF8 | C | C | =c/postgres +

| | | | | postgres=CTc/postgres

template1 | postgres | UTF8 | C | C | =c/postgres +

| | | | | postgres=CTc/postgres

(4 rows)

postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by id;

id

--------

刘少奇

刘德华

(2 rows)

按拼音排序方法(目标可以改成EUC_CN)

8.x

postgres=> select * from (values ('刘德华'), ('刘少奇')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK')));

id

--------

刘德华

刘少奇

(2 rows)

9.x

postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by convert(id::bytea,'UTF-8','GBK');

id

--------

刘德华

刘少奇

(2 rows)

注意多音字

中文有一些多音字，比如重庆(chongqing), 但是编码时它可能是按zhong编码的，所以看这个例子。

postgres=> select * from (values ('中山'), ('重庆')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK')));

id

------

中山

重庆

(2 rows)

索引

表达式索引即可，使用immutable function.

代码

8.x

postgres=> df+ convert

List of functions

Schema | Name | Result data type | Argument data types | Type | Data access | Volatility | Owner | Language | Source code | Description

------------+---------+------------------+---------------------+--------+-------------+------------+-----------+----------+-------------+---------------------------------------------------------

pg_catalog | convert | text | text, name | normal | no sql | stable | xxx | internal | pg_convert | convert string with specified destination encoding name

pg_catalog | convert | text | text, name, name | normal | no sql | stable | xxx | internal | pg_convert2 | convert string with specified encoding names

(2 rows)

9.x

postgres=# df+ convert

List of functions

Schema | Name | Result data type | Argument data types | Type | Volatility | Parallel | Owner | Security | Access privileges | Language | Source code | Description

------------+---------+------------------+---------------------+--------+------------+----------+----------+----------+-------------------+----------+-------------+----------------------------------------------

pg_catalog | convert | bytea | bytea, name, name | normal | stable | safe | postgres | invoker | | internal | pg_convert | convert string with specified encoding names

(1 row)

src/backend/utils/mb/mbutils.c

/*

* Convert string between two arbitrary encodings.

*

* BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)

*/

Datum

pg_convert(PG_FUNCTION_ARGS)

{

bytea *string = PG_GETARG_BYTEA_PP(0);

char *src_encoding_name = NameStr(*PG_GETARG_NAME(1));

int src_encoding = pg_char_to_encoding(src_encoding_name);

char *dest_encoding_name = NameStr(*PG_GETARG_NAME(2));

int dest_encoding = pg_char_to_encoding(dest_encoding_name);

const char *src_str;

char *dest_str;

bytea *retval;

int len;

if (src_encoding < 0)

ereport(ERROR,

(errcode(ERRCODE_INVALID_PARAMETER_VALUE),

errmsg("invalid source encoding name "%s"",

src_encoding_name)));

if (dest_encoding < 0)

ereport(ERROR,

(errcode(ERRCODE_INVALID_PARAMETER_VALUE),

errmsg("invalid destination encoding name "%s"",

dest_encoding_name)));

/* make sure that source string is valid */

len = VARSIZE_ANY_EXHDR(string);

src_str = VARDATA_ANY(string);

pg_verify_mbstr_len(src_encoding, src_str, len, false);

/* perform conversion */

dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str,

len,

src_encoding,

dest_encoding);

/* update len if conversion actually happened */

if (dest_str != src_str)

len = strlen(dest_str);

/*

* build bytea data type structure.

*/

retval = (bytea *) palloc(len + VARHDRSZ);

SET_VARSIZE(retval, len + VARHDRSZ);

memcpy(VARDATA(retval), dest_str, len);

if (dest_str != src_str)

pfree(dest_str);

/* free memory if allocated by the toaster */

PG_FREE_IF_COPY(string, 0);

PG_RETURN_BYTEA_P(retval);

}

-END-

云栖社区

ID：yunqiinsight

云计算丨互联网架构丨大数据丨机器学习丨运维

博客地址：http://blog.yoqi.me/?p=1349

喜欢 0

PostgreSQL 按拼音排序 – convert to GBK/EUC_CN coding

这篇文章还没有评论

发表评论点击取消回复