GB2312字符集与编码对照表(和荣笔记)

Artvine 发表于 2005-3-16 13:33:53

版本3.04
杨和荣

目录

序言
GB2312 字符集和编码说明
GB2312 字符集
GB2312 编码
GB2312 与 Unicode 的关系
从 GB2312 到 Unicode 转换表制作程式
符号区 01-09

第一级汉字区 16-55
第二级汉字区 56-87

从 Unicode 到 GB2312 转换表制作程式

Unicode 到 GB2312 转换表

参考文献
########################################
序言

这本手册用表格列出了 GB2312 汉字国家标准字符集的全部字符和编码，以及每个字符所对应的 Unicode 编码。同时也列出了从 Unicode 到 GB2312 转换表。这些表格的程式制作也收录在这本手册中。

修改记录：

第 3.04 版，二○○四年，局部修改。
第 3.00 版，二○○三年，整理成打印版。
第 2.00 版，一九九九年，整理成网页。
第 1.00 版，一九九七年，初稿完成。
#################################
GB2312 字符集和编码说明

GB2312 字符集

GB2312 是汉字字符集和编码的代号，中文全称为“信息交换用汉字编码字符集”，由中华人民共和国国家标准总局发布，一九八一年五月一日实施。GB 是“国标” 二字的汉语拼音缩写。

GB2312 字符集 (character set) 只收录简化字汉字，以及一般常用字母和符号，主要通行于中国大陆地区和新加坡等地。

GB2312 共收录有 7445 个字符，其中简化汉字 6763 个，字母和符号 682 个。

GB2312 将所收录的字符分为 94 个区，编号为 01 区至 94 区；每个区收录 94 个字符，编号为 01 位至 94 位。GB2312 的每一个字符都由与其唯一对应的区号和位号所确定。例如：汉字“啊”，编号为 16 区 01 位。

GB2312 字符集的区位分布表：

区号字数字符类别

01    94 一般符号
02    72 顺序号码
03    94 拉丁字母
04    83 日文假名
05    86 Katakana
06    48 希腊字母
07    66 俄文字母
08    63 汉语拼音符号
09    76 图形符号
10-15          备用区
16-55 3755 一级汉字，以拼音为序
56-87 3008 二级汉字，以笔划为序
88-94          备用区

这本手册列出了 GB2312 的全部字符和它们的区位号。

GB2312 编码

GB2312 原始编码 (encoding) 是对所收录的每个字符都用两个字节 (byte) 表示。第一字节为“高字节”，由字符的区号值加上 32 而形成；第二字节为“低字节”，由字符的位号值加上 32 而形成。例如：汉字“啊”，编号为 16 区 01 位。它的高字节为 16 + 32 = 48 (0x30)，低字节为 01 + 32 = 33 (0x21)，合并而成的编码为 0x3021。

在区位号值上加 32 的原因大慨是为了避开低值字节区间。

由于 GB2312 原始编码与 ASCII 编码的字节有重叠，现在通行的 GB2312 编码是在原始编码的两个字节上各加 128 修改而形成。例如：汉字“啊”，编号为 16 区 01 位。它的原始编码为 0x3021，通行编码为 0xB0A1。

如果不另加说明，GB2312 常指这种修改过的编码。

这本手册列出了 GB2312 的全部字符和它们的编码。

GB2312 与 Unicode 的关系

GB2312 字符集是 Unicode 字符集的一个子集。这也就是说，GB2312 所收录的每一个字符都收录在 Unicode 之中。

但是 GB2312 编码和 Unicode 编码确没有什么相同之处。同一个汉字，它的 GB2312 编码和 Unicode 编码确毫不相同。例如：汉字“啊”，它的 GB2312 编码为 0xB0A1，但是它的 Unicode 编码为 0x554A。

这本手册为 GB2312 的每一个字符列出了它所对应的 Unicode 编码和 UTF-8 (Unicode Transformation Format - 8-bit) 编码。

Artvine 发表于 2005-3-16 13:35:47

从 GB2312 到 Unicode 转换表制作程式

这本手册里的字符与汉字编码列表由下面的程式所生成。

/**
* GB2312Unicde.java
* Copyright (c) 1997-2003 by Dr. Herong Yang
*/
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
class GB2312Unicde {
static OutputStream out = null;
static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
                        '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
static int b_out[] = {201,267,279,293,484,587,625,657,734,782,827,
   874,901,980,5590};
static int e_out[] = {216,268,280,294,494,594,632,694,748,794,836,
   894,903,994,5594};
public static void main(String[] args) {
   try {
      out = new FileOutputStream("gb2312.gb");
      writeCode();
      out.close();
   } catch (IOException e) {
      System.out.println(e.toString());
   }
}
public static void writeCode() throws IOException {
   boolean reserved = false;
   String name = null;
   // GB2312 is not supported by JDK. So I am using GBK.
   CharsetDecoder gbdc = Charset.forName("GBK").newDecoder();
   CharsetEncoder uxec = Charset.forName("UTF-16BE").newEncoder();
   CharsetEncoder u8ec = Charset.forName("UTF-8").newEncoder();
   ByteBuffer gbbb = null;
   ByteBuffer uxbb = null;
   ByteBuffer u8bb = null;
   CharBuffer cb = null;
   int count = 0;
   for (int i=1; i<=94; i++) {
      // Defining row settings
      if (i>=1 && i<=9) {
         reserved = false;
         name = "Graphic symbols";
      } else if (i>=10 && i<=15) {
         reserved = true;
         name = "Reserved";
      } else if (i>=16 && i<=55) {
         reserved = false;
         name = "Level 1 characters";
      } else if (i>=56 && i<=87) {
         reserved = false;
         name = "Level 2 characters";
      } else if (i>=88 && i<=94) {
         reserved = true;
         name = "Reserved";
      }
      // writing row title
      writeln();
      writeString("<p>");
      writeNumber(i);
      writeString(" Row: "+name);
      writeln();
      writeString("</p>");
      writeln();
      if (!reserved) {
         writeln();
         writeHeader();
      // looping through all characters in one row
         for (int j=1; j<=94; j++) {
            byte hi = (byte)(0xA0 + i);
            byte lo = (byte)(0xA0 + j);
            if (validGB(i,j)) {
               // getting GB, UTF-16BE, UTF-8 codes
               gbbb = ByteBuffer.wrap(new byte[]{hi,lo});
               try {
                  cb = gbdc.decode(gbbb);
                  uxbb = uxec.encode(cb);
                  cb.rewind();
                  u8bb = u8ec.encode(cb);
               } catch (CharacterCodingException e) {
                  cb = null;
                  uxbb = null;
                  u8bb = null;
               }
            } else {
               cb = null;
               uxbb = null;
               u8bb = null;
            }
            writeNumber(i);
            writeNumber(j);
            writeString(" ");
            if (cb!=null) {
               writeByte(hi);
               writeByte(lo);
               writeString(" ");
               writeHex(hi);
               writeHex(lo);
               count++;
            } else {
               writeGBSpace();
               writeString(" null");
            }
            writeString(" ");
            writeByteBuffer(uxbb,2);
            writeString(" ");
            writeByteBuffer(u8bb,3);
            if (j%2 == 0) {
               writeln();
            } else {
               writeString(" ");
            }
         }
         writeFooter();
      }
   }
   System.out.println("Number of GB characters worte: "+count);
}
public static void writeln() throws IOException {
   out.write(0x0D);
   out.write(0x0A);
}
public static void writeByte(byte b) throws IOException {
   out.write(b & 0xFF);
}
public static void writeByteBuffer(ByteBuffer b, int l)
   throws IOException {
   int i = 0;
   if (b==null) {
      writeString("null");
      i = 2;
   } else {
for (i=0; i<b.limit(); i++) writeHex(b.get(i));
   }
   for (int j=i; j<l; j++) writeString("");
}
public static void writeGBSpace() throws IOException {
   out.write(0xA1);
   out.write(0xA1);
}
public static void writeString(String s) throws IOException {
   if (s!=null) {
      for (int i=0; i<s.length(); i++) {
         out.write((int) (s.charAt(i) & 0xFF));
      }
   }
}
public static void writeNumber(int i) throws IOException {
   String s = "00" + String.valueOf(i);
   writeString(s.substring(s.length()-2,s.length()));
}
public static void writeHex(byte b) throws IOException {
   out.write((int) hexDigit[(b >> 4) & 0x0F]);
   out.write((int) hexDigit);
}
public static void writeHeader() throws IOException {
   writeString("<pre>");
   writeln();
   writeString("Q.W. ");
   writeGBSpace();
   writeString(" GB Uni. UTF-8 ");
   writeString(" ");
   writeString("Q.W. ");
   writeGBSpace();
   writeString(" GB Uni. UTF-8 ");
   writeln();
   writeln();
}
public static void writeFooter() throws IOException {
   writeString("</pre>");
   writeln();
}
public static boolean validGB(int i,int j) {
   for (int l=0; l<b_out.length; l++) {
      if (i*100+j>=b_out && i*100+j<=e_out) return false;
   }
   return true;
}
}

Artvine 发表于 2005-3-16 13:40:11

符号区 01-09
http://www.geocities.com/herong_yang/gb2312_gb/symbol.html
第一级汉字区 16-55
http://www.geocities.com/herong_yang/gb2312_gb/pinyin.html
第二级汉字区 56-87
http://www.geocities.com/herong_yang/gb2312_gb/bihua.html

从 Unicode 到 GB2312 转换表制作程式

在我发表了 GB2312 到 Unicode 的转换表以后，收到了读者信件，寻求 Unicode 到 GB2312 的转换表。

下面的程式便可以用来制作这样的转换表。程式的输出结果收入下一章之中。

/**
* UnicodeGB2312.java
* Copyright (c) 1997-2003 by Dr. Herong Yang
*/
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
class UnicodeGB2312 {
static OutputStream out = null;
static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
                        '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
static int b_out[] = {201,267,279,293,484,587,625,657,734,782,827,
   874,901,980,1001,5590,8801};
static int e_out[] = {216,268,280,294,494,594,632,694,748,794,836,
   894,903,994,1594,5594,9494};
public static void main(String[] a) {
   try {
      out = new FileOutputStream("unicode_gb2312.gb");
      writeCode();
      out.close();
   } catch (IOException e) {
      System.out.println(e.toString());
   }
}
public static void writeCode() throws IOException {
   CharsetEncoder gbec = Charset.forName("GBK").newEncoder();
   char[] ca = new char;
   CharBuffer cb = null;
   ByteBuffer gbbb = null;
   writeHeader();
   int count = 0;
   for (int i=0; i<0x010000; i++) {
      ca = (char) i;
      cb = CharBuffer.wrap(ca);
      try {
         gbbb = gbec.encode(cb);
      } catch (CharacterCodingException e) {
         gbbb = null;
      }
      if (validGB(gbbb)) {
         count++;
         writeHex((byte) (ca >>> 8));
         writeHex((byte) (ca & 0xff));
         writeString(" ");
         writeByteBuffer(gbbb,2);
         writeString(" ");
         writeByte(gbbb.get(0));
         writeByte(gbbb.get(1));
         if (count%5 == 0) writeln();
         else writeString(" ");
      }
   }
   if (count%5 != 0) writeln();
   writeFooter();
   System.out.println("Number of GB characters wrote: "+count);
}
public static boolean validGB(ByteBuffer gbbb) {
   if (gbbb==null) return false;
   else if (gbbb.limit()!=2) return false;
   else {
      byte hi = gbbb.get(0);
      byte lo = gbbb.get(1);
      if ((hi&0xFF)<0xA0) return false;
      if ((lo&0xFF)<0xA0) return false;
      int i = (hi&0xFF) - 0xA0;
      int j = (lo&0xFF) - 0xA0;
      if (i<1 || i>94) return false;
      if (j<1 || j>94) return false;
      for (int l=0; l<b_out.length; l++) {
         if (i*100+j>=b_out && i*100+j<=e_out) return false;
      }
   }
   return true;
}
public static void writeHeader() throws IOException {
   writeString("<pre>");
   writeln();
   writeString("Uni. GB ");
   writeGBSpace();
   writeString(" ");
   writeString("Uni. GB ");
   writeGBSpace();
   writeString(" ");
   writeString("Uni. GB ");
   writeGBSpace();
   writeString(" ");
   writeString("Uni. GB ");
   writeGBSpace();
   writeString(" ");
   writeString("Uni. GB ");
   writeGBSpace();
   writeln();
   writeln();
}
public static void writeFooter() throws IOException {
   writeString("</pre>");
   writeln();
}
public static void writeln() throws IOException {
   out.write(0x0D);
   out.write(0x0A);
}
public static void writeGBSpace() throws IOException {
   out.write(0xA1);
   out.write(0xA1);
}
public static void writeByteBuffer(ByteBuffer b, int l)
   throws IOException {
   int i = 0;
   if (b==null) {
      writeString("null");
      i = 2;
   } else {
for (i=0; i<b.limit(); i++) writeHex(b.get(i));
   }
   for (int j=i; j<l; j++) writeString("");
}
public static void writeString(String s) throws IOException {
   if (s!=null) {
      for (int i=0; i<s.length(); i++) {
         out.write((int) (s.charAt(i) & 0xFF));
      }
   }
}
public static void writeHex(byte b) throws IOException {
   out.write((int) hexDigit[(b >> 4) & 0x0F]);
   out.write((int) hexDigit);
}
public static void writeByte(byte b) throws IOException {
   out.write(b & 0xFF);
}
}

上面的程式发表后，又有读者来信要求对程式加以说明，以便理解。其实这个程式的逻辑很简单，阅读时仅需注意以下几点：

一， Unicode 字符集的全体编码都在 0x0000 和 0xFFFF 之间，所以子程式 writeCode() 使用了一个循环复句，以变量 i 走遍了 Unicode 的全体可能编码。

二，把单个 Unicode 编码转换成 GB2312 编码的关键语句是：gbec.encode(cb)，它使用了 JDK 中 CharsetEncoder 的中文编码功能。注意，GBK 是由 GB2312 扩张而成。JDK 只提供 GBK 编码功能。

三，由于 Unicode 字符集比 GB2312 大，gbec.encode(cb) 输出的编码有许多是坏码，或者是 GBK 的扩张码，所以要用子程式 validGB() 进行验证。

四，程式的其它部分主要是用于输出的列表制作。

Artvine 发表于 2005-3-16 13:45:24

Unicode 到 GB2312 转换表
http://www.geocities.com/herong_yang/gb2312_gb/ug_map.html

Code Pages Supported by Windows -- Windows codepages
http://www.microsoft.com/globaldev/reference/wincp.mspx

getright 发表于 2005-3-16 15:29:53

『一， Unicode 字符集的全体编码都在 0x0000 和 0xFFFF 之间，所以子程式 writeCode() 使用了一个循环复句，以变量 i 走遍了 Unicode 的全体可能编码。』

这个说法不严谨。
Unicode1.0的编码范围才是0-0xffff之间，从Unicode2版本已经扩展了编码空间。当前Unicode4编码范围是0-0x10ffff。

[ 本贴由 getright 于 2005-3-1615:31 最后编辑 ]

页: [1]

湘里妹子学术网's Archiver

GB2312字符集与编码对照表(和荣笔记)