Getting lines of a file of any encoding type in Python

I really don’t want to know the encoding. I only want the data. In other words, I don’t want to think. I don’t want to open notepad++ and convert between types of encoding.

My old standby doesn’t work on various file encodings that aren’t ansi (ascii, cp1252, whatever):

f = open("poo.txt", "r")
lines = f.readlines()
f.close()
for line in lines:
  dosomething(line)

I have had enough. (I am also venturing into Python 3 as I have been on Python 2 forever but that is a different story.)

The following code will read a file of different encoding and split them into lines:

import os

def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']):
  for codec in codecs:
    try:
      return byteArray.decode(codec)
    except:
      pass

def ReadLinesFromFile(filename):
  file = open(filename, "rb")
  rawbytes = file.read()
  file.close()
  content = DecodeBytes(rawbytes)
  if content is not None:
    return content.split(os.linesep)

lines = ReadLinesFromFile("poo.txt")
for line in lines:
  dosomething(line)

If you need to add encodings, simply add them to the codecs default assignment (or make it more elegant as you deem).

Geek Droppings

Getting lines of a file of any encoding type in Python

Leave a Reply Cancel reply

The poop on being a geek