Comparing huge files using Python

I had the need to compare two huge files.

None of the file compare tools that I tried could handle large files, so I decided to write my own compare utility.

Comparing large files is not a common case. I could have solved my issue by loading each file into a database, then using the excellent RedGate DataCompare  against the two tables. Heck for that matter – load both in one one table and do a GROUP BY. But I had several files and loading the database would have been tedious.

Use cases could include replacing an ETL process with a new process or upgrading a web service that generates large datasets. For me, it was a system migration of an ETL process.

The script was done ‘quick and dirty’, but really came in handy.

What this does is opens the two files, and starts reading them with a configurable line buffer (set to 10 currently). It looks to see if lines from the second file are in the buffer of the first file. It records misses and increases the buffer. It outputs misses for later review and will stop processing if there are too many misses (configurable).

That is it – easy

### Program to compare two large files
### fails fast – order of rows is important
### smart enough to look ahead for matching rows
fs2 = “C:/Temp/File2.txt”
fs1 = “C:/Temp/File1.txt”
ofs = “C:/Temp/diffb.txt”
maxerrors = 500
Lq1 = []
Lq2 = []
rcount1 = 0
rcount2 = 0
isFound = False
#setup – read first n rows
n = 10
f1pointer = n
f2pointer = n
f1 = open(fs1)
f2 = open(fs2)
of = open(ofs, ‘w’)
for x in range(0,n):
    Lq2.append(f2.readline())
# important stuff
errorsFound = 0
rowcounter = 0
goodcounter = 0
row = f1.readline()
while errorsFound < maxerrors and len(row) > 10:
        isFound = False
        if (rowcounter%10000 == 0) :
            print “.”,
        rowcounter = rowcounter + 1
        for y in range(0,len(Lq2)):
            if (row == Lq2[y]):
                isFound = True
                Lq2.pop(y)
                Lq2.append(f2.readline())
                break
        if  not isFound:
            errorsFound = errorsFound + 1
            of.write(row)
            for x in range(0,n):
                Lq2.append(f2.readline())
        row = f1.readline()
print “”, “done – rows processed: “, rowcounter
f1.close()
f2.close()
of.close()
print len(Lq2)

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *