Don’t use python to do big5-hkscs encode

常用字都算ok…深少少都總係 big5 既字 e.g. 苽 都出唔到…

只係出到個 ISO unicode

搞左勁耐以為寫錯 , 但java / vb.net 是 ok的 應該係 py d codec 有問題/唔齊

唔好用 py 做d 中文 big5 處理野

玩死自己

# -- coding: utf-8 --
#!/usr/bin/python
import sqlite3
import re
import logging
import binascii

logger = logging.Logger('catch_all')
conn = sqlite3.connect('table.db')

def main():
    filepath = 'mapped_char.txt'
    cnt = 0
    with open(filepath, encoding='utf-8') as fp:
        line = fp.readline()    
        while line:
            result = [x.strip() for x in line.split(',')]
            code = result[0]
            for text in result[1]:
                if not(isCharacterBig5(text)):
                    a = ''.encode('hkscs')
                    print(a.hex())
                # insertRecord(code, text)
            line = fp.readline()
            cnt = cnt + 1
    conn.commit()
    conn.cursor().close()

def insertRecord(code, text):
    cursor = conn.cursor()
    cursor.execute("INSERT INTO dict (code, text) VALUES (?, ?)", (code, text))

def isCharacterBig5(text):
    pattern = '[^\uE000-\uF8FF]'
    return re.match(pattern, text)

#def convertToUnicode():


#def getBig5HexCode(text):


main()

寫到一半 encode(‘hkscs’) 成日話 illegal multibyte sequence enter image description here

diu 啦 其實係佢個 codec 無呢個字 … (但現實 hkscs 有的) 只係出到個 \ueb51 unicode 代碼 你係政府個網都摷到 ISO 個行既, 但就唔是 BIG-5 囉

ref: https://bugs.python.org/issue28693 http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt