Getting lines of a file of any encoding type in Python

I really don’t want to know the encoding.  I only want the data.  In other words, I don’t want to think.  I don’t want to open notepad++ and convert between types of encoding.

My old standby doesn’t work on various file encodings that aren’t ansi (ascii, cp1252, whatever):

f = open("poo.txt", "r")
lines = f.readlines()
f.close()
for line in lines:
  dosomething(line)

I have had enough.  (I am also venturing into Python 3 as I have been on Python 2 forever but that is a different story.)

The following code will read a file of different encoding and split them into lines:

import os

def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']):
  for codec in codecs:
    try:
      return byteArray.decode(codec)
    except:
      pass

def ReadLinesFromFile(filename):
  file = open(filename, "rb")
  rawbytes = file.read()
  file.close()
  content = DecodeBytes(rawbytes)
  if content is not None:
    return content.split(os.linesep)

lines = ReadLinesFromFile("poo.txt")
for line in lines:
  dosomething(line)

If you need to add encodings, simply add them to the codecs default assignment (or make it more elegant as you deem).

 

Leave a Reply

Your email address will not be published. Required fields are marked *