常用字都算ok…深少少都總係 big5 既字 e.g. 苽 都出唔到…
只係出到個 ISO unicode
搞左勁耐以為寫錯 , 但java / vb.net 是 ok的 應該係 py d codec 有問題/唔齊
唔好用 py 做d 中文 big5 處理野
玩死自己
# -- coding: utf-8 --
#!/usr/bin/python
import sqlite3
import re
import logging
import binascii
logger = logging.Logger('catch_all')
conn = sqlite3.connect('table.db')
def main():
filepath = 'mapped_char.txt'
cnt = 0
with open(filepath, encoding='utf-8') as fp:
line = fp.readline()
while line:
result = [x.strip() for x in line.split(',')]
code = result[0]
for text in result[1]:
if not(isCharacterBig5(text)):
a = ''.encode('hkscs')
print(a.hex())
# insertRecord(code, text)
line = fp.readline()
cnt = cnt + 1
conn.commit()
conn.cursor().close()
def insertRecord(code, text):
cursor = conn.cursor()
cursor.execute("INSERT INTO dict (code, text) VALUES (?, ?)", (code, text))
def isCharacterBig5(text):
pattern = '[^\uE000-\uF8FF]'
return re.match(pattern, text)
#def convertToUnicode():
#def getBig5HexCode(text):
main()
寫到一半 encode(‘hkscs’) 成日話 illegal multibyte sequence
diu 啦 其實係佢個 codec 無呢個字 … (但現實 hkscs 有的) 只係出到個 \ueb51 unicode 代碼 你係政府個網都摷到 ISO 個行既, 但就唔是 BIG-5 囉
ref: https://bugs.python.org/issue28693 http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt