Python文件编码自动转换模块的实现

Python跨平台处理文本的坑我就先踩了。

基础知识

操作系统默认编码

操作系统	默认编码
Windows	GBK
MacOS	Unicode
Linux	UTF8

Python文件转码模式

源编码 –decode–>> Unicode编码 –encode–>> 目标编码

Python在处理字符串的时候，使用decode会将字符串解码成Unicode编码，然后再按照指定编码进行encode，就可对文本进行编码处理。

chardet的使用

# 获取文件编码
code_style = chardet.detect(f.read()).get('encoding')
# chardet.detect(f.read())获得的返回值为一个字典，包含文件编码与可信度。
# 数据越长得到的编码可信度越高，进行解码的正确的可能性就越高。

坑，大坑

MacOS生成的文件默认的编码为Unicode，在使用chardet进行编码识别的时候，会识别成UTF16-LTE而不是想当然的Unicode。
Mac的编辑器带的BOM，Python在进行处理的时候并不会忽略，所以直接decode会报错
查不到资料的。

自制文件转码模块

# coding:utf8
import re
import chardet
import os
# 文件落盘
def save_file_to_disk(f, fp):
    # 重置文件对象指针
    f.seek(0, os.SEEK_SET)
    try:
        new_file = open(fp, 'wb+')
        new_file.write(f.read())
        new_file.close()
        return True
    except Exception, e:
        print e
        return False
# 文件转码为utf8
def change_file_to_utf8(fp):
    files = os.listdir(fp)
    for f in files:
        path = os.path.join(fp, f)
        f_reader = open(path, 'rb')
        # 获取文件编码
        code_style = chardet.detect(f_reader.read()).get('encoding')
        # 重置文件对象指针
        f_reader.seek(0, os.SEEK_SET)
        content_change = []
        # 获取内容开始转码
        try:
            # 针对mac进行处理
            if code_style == 'UTF-16LE':
                content_change = f_reader.read().decode('utf16', 'ignore').encode('utf8')
            else:
                content_change = f_reader.read().decode(code_style).encode('utf8')
        except Exception, e:
            print u"转码失败"
            print u"编码为%s" % code_style
            print str(e)
        with open(path, 'w') as newFile:
            newFile.writelines(content_change)
            newFile.close()