I really don’t want to know the encoding. I only want the data. In other words, I don’t want to think. I don’t want to open notepad++ and convert between types of encoding.
My old standby doesn’t work on various file encodings that aren’t ansi (ascii, cp1252, whatever):
f = open("poo.txt", "r")
lines = f.readlines()
f.close()
for line in lines:
dosomething(line)
I have had enough. (I am also venturing into Python 3 as I have been on Python 2 forever but that is a different story.)
The following code will read a file of different encoding and split them into lines:
import os
def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']):
for codec in codecs:
try:
return byteArray.decode(codec)
except:
pass
def ReadLinesFromFile(filename):
file = open(filename, "rb")
rawbytes = file.read()
file.close()
content = DecodeBytes(rawbytes)
if content is not None:
return content.split(os.linesep)
lines = ReadLinesFromFile("poo.txt")
for line in lines:
dosomething(line)
If you need to add encodings, simply add them to the codecs default assignment (or make it more elegant as you deem).