Comparing huge files using Python

I had the need to compare two huge files.

None of the file compare tools that I tried could handle large files, so I decided to write my own compare utility.

Comparing large files is not a common case. I could have solved my issue by loading each file into a database, then using the excellent RedGate DataCompare against the two tables. Heck for that matter – load both in one one table and do a GROUP BY. But I had several files and loading the database would have been tedious.

Use cases could include replacing an ETL process with a new process or upgrading a web service that generates large datasets. For me, it was a system migration of an ETL process.

The script was done ‘quick and dirty’, but really came in handy.

What this does is opens the two files, and starts reading them with a configurable line buffer (set to 10 currently). It looks to see if lines from the second file are in the buffer of the first file. It records misses and increases the buffer. It outputs misses for later review and will stop processing if there are too many misses (configurable).

That is it – easy

### Program to compare two large files

### fails fast – order of rows is important

### smart enough to look ahead for matching rows

fs2 = “C:/Temp/File2.txt”

fs1 = “C:/Temp/File1.txt”

ofs = “C:/Temp/diffb.txt”

maxerrors = 500

Lq1 = []

Lq2 = []

rcount1 = 0

rcount2 = 0

isFound = False

#setup – read first n rows

n = 10

f1pointer = n

f2pointer = n

f1 = open(fs1)

f2 = open(fs2)

of = open(ofs, ‘w’)

for x in range(0,n):

Lq2.append(f2.readline())

# important stuff

errorsFound = 0

rowcounter = 0

goodcounter = 0

row = f1.readline()

while errorsFound < maxerrors and len(row) > 10:

isFound = False

if (rowcounter%10000 == 0) :

print “.”,

rowcounter = rowcounter + 1

for y in range(0,len(Lq2)):

if (row == Lq2[y]):

isFound = True

Lq2.pop(y)

Lq2.append(f2.readline())

break

if not isFound:

errorsFound = errorsFound + 1

of.write(row)

for x in range(0,n):

Lq2.append(f2.readline())

row = f1.readline()

print “”, “done – rows processed: “, rowcounter

f1.close()

f2.close()

of.close()

print len(Lq2)

Geek Droppings

Comparing huge files using Python

Leave a Reply Cancel reply

The poop on being a geek