Category Archives: Python

Determine differences in lists in Python

Sometimes, you just want to take two lists like a previous one and a new one and get back information about what is new, what is old, and what is the same.  Here is one way to do it.

def compareLists(old, new):
  oldset = set(old)
  addlist = [x for x in new if x not in oldset]
  newset = set(new)
  dellist = [x for x in old if x not in newset]
  samelist = [x for x in old if x in newset]
  return (addlist, dellist, samelist)
old = ['One', 'Two', 'Three', 'Four']
new = ['One', 'Two', 'Five']

(addlist, dellist, samelist) = compareLists(old, new)
print("addlist=", addlist)
print("dellist=", dellist)
print("samelist=", samelist)

what you get is:

('addlist=', ['Five'])
('dellist=', ['Three', 'Four'])
('samelist=', ['One', 'Two'])

With this, you can take “instantiation” action with the “add” list, take “removal” action with the “del” list, and simply update data with the “same” list.

Starveless Priority Queue in Python

The python priority queue describes messages in absolute terms.  The most important element will always be popped.  The least important element may never be popped.

Sometimes when a message based data path is serialized to something (like a serial port based piece of hardware), all messages have some importance.  Unlike the priority queue, we would rather not use such absolute terms.  We would rather say that most of the time, the important messages are sent.  However, we guarantee that some of the time, the less important messages are sent.

The language is more gray and the queue needs to be a bit more…  “cooperative”.  (We could say “round robin”.)

Let’s implement our own.  But first, a quick back story.  I have some knowledge of Motorola’s (at the time) Time Processor Unit (TPU) on some of their embedded processors.  The TPU had an event driven model where certain sections of microcode would execute based on priority but lower priority microcode would still operated periodically.  Let’s base our design on this philosophy.  (See section 3 for more information on the TPU scheduler.)

Python has a pretty good queue framework that you derive your queue from and implement several methods.  If you ever look at the priority queue code in python, it is pretty simplistic as it relies on the heapq.  There is much work done in the Queue class including thread synchronization.

Here is our proposed round robin priority queue.

#! /usr/bin/python

import Queue

class RRPriorityQueue(Queue.Queue):
  '''Variant of Queue that retrieves open entries in priority order (lowest first)
  but allows highest priorities to be skipped.

  '''

  def _init(self, priorityOrder):
    if priorityOrder is None:
      raise TypeError

    self._queue = []

    self._priorityOrder = priorityOrder
    self._orderIdx = len(self._priorityOrder) + 1

  def _qsize(self, len=len):
    return len(self._queue)

  def _put(self, item):
    self._queue.append(item)

  def _get(self):
    if (len(self._queue) == 0):
      raise IndexError

    retVal = None

    # change the order index for the priority order list
    self._orderIdx = self._orderIdx + 1
    if self._orderIdx >= len(self._priorityOrder):
      self._orderIdx = 0

    # the priority we are interested in and the highest value
    # in case we don't find one at our priority
    filterPriority = self._priorityOrder[self._orderIdx]
    highest = self._queue[0]

    # look for one of our priority but saving highest priority (lowest number)
    for item in self._queue:
      priority = item.priority
      if priority == filterPriority:
        # we found our item
        retVal = item
        break
      if highest.priority > item.priority:
        highest = item

    # if we didn't find our item, the return the highest
    if retVal is None:
      retVal = highest

    self._queue.remove(retVal)
    return retVal

The _qsize and _put methods are basic.  The _init method requires a priority order.  This specifies the priority guaranteed to pop from the queue for the corresponding _get call.  The _orderIdx is used to know which step of the priority order the algorithm is on.  It increments or resets on every _get call.

The _get call is where the magic happens.  The first step is to increment or reset the _orderIdx value.  Then we are going to get our filterPriority from the _priorityOrder list based on the _orderIdx.  Then we iterate through the queue.  If we find an entity that matches our priority, return it.

However, what happens if we don’t find a message with our priority?  That is ok, we simply return the highest priority item first found in the queue.  The highest variable is a reference to the first item in the queue with the highest priority and is returned if we don’t find a match.

The only thing we require from the entities in the queue is that they implement a priority member (or method) that returns the priority number.  This deviates from the priority queue in that the queue wasn’t concerned about the priority number, just that an entity in the queue could compare to others.  In our case, we need to know the priority to filter the entities the way we want to.

Like the priority queue, the lower the priority number, the more important it is.

We can create our round robin queue like this:

self.q = RRPriorityQueue([1,2,1,3,1,2,1])

In this case, priority 1 messages are guaranteed to pop 4 out of 7 times.  Priority 2 message are guaranteed 2 out of 7.  Priority 3 – 1 out of 7.

You can make whatever pattern with whatever priorities you want.  Just understand that if a priority of an entity is not in the priority order, it can be starved.  It will be popped if it is the most important message when a message of the current filter order isn’t found.

 

Two Important SQLite and Python Lessons

When using SQLite3 and Python (2.x), there are two important lessons that are not obvious (at least not to me).

1. Dictionaries and TimeStamps

Ideally, I would like to do two things.  First, access data from a dictionary and not a list.  It is far more intuitive to access by column name (or query name substitution) than by list index.  Second, the datetime values should be correctly coerced even though SQLite has no implicit timestamp type.  I believe these two simple requirements are expected by most people, especially those family with Microsoft SQL and ADO.NET.

Good news, Python supports this!  However, there are some switches to set so to speak.

After importing sqlite3, the following connect statement will suffice for both needs:

conn = sqlite3.connect(dbname, detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES)

The PARSE_COLNAMES returns a dictionary for each row fetched instead of a list.

The PARSE_DECLTYPES empowers python to do a bit more type conversion based on the queries provided.  For example, if you do this:

cur.execute('select mytime from sometable')

You will get mytime as a string.  However, if you do this:

cur.execute('select mytime as "[timestamp]" from sometable')

You will get mytime as a datetime type in python.  This is very useful.

One step further; let’s say you want a timestamp but substitute the column name in the query.  Do this:

cur.execute('select mytime as "mytime [timestamp]" from sometable')

Not only with the data be returned as a datetime object, the dictionary will contain column name substitution provided in the query.  Beware, if you don’t do this on aggregate functions, the python sqlite library will attempt to add to the dictionary with an empty string key.  (Not sure why this is but beware.)

The adapters and converters live in the file dpapi2.py in your python library installation directory for sqlite3.

Refer to the documentation here.

2. Execute and Tuples

This is really a lesson on declaring implicit tuples.

The execute method on a cursor is great because it will safely convert strings for use in the database (reducing the possibility of SQL injection).  It takes a query string and a substitution tuple.

However, I got caught again on the example below as will many who follow me.

This works:

cur.execute('insert into mytable values(?, ?, ?, ?)', (1, 2, 3, 4))

This doesn’t work and throws an exception:

cur.execute('insert into mytable values(?)', (1))

This works:

cur.execute('insert into mytable values(?)', (1,))

The comma after the one declares a tuple of one element.  Argh!!!

I hope this helps.

 

Simple Python JSON server based on jsonrpclib

I needed a simple python JSON server executing in its own thread but that was easily extensible.   Let’s get right to the base class code (or super class for those who build down).

#! /usr/bin/python

import threading

import jsonrpclib
import jsonrpclib.SimpleJSONRPCServer

class JsonServerThread (threading.Thread):
  def __init__(self, host, port):
    threading.Thread.__init__(self)
    self.daemon = True
    self.stopServer = False
    self.host = host
    self.port = port
    self.server = None

  def _ping(self):
    pass

  def stop(self):
    self.stopServer = True
    jsonrpclib.Server("http://" + str(self.host) + ":" + str(self.port))._ping()
    self.join()

  def run(self):
    self.server = jsonrpclib.SimpleJSONRPCServer.SimpleJSONRPCServer((self.host, self.port))
    self.server.logRequests = False
    self.server.register_function(self._ping)

    self.addMethods()

    while not self.stopServer:
      self.server.handle_request()
    self.server = None

  # defined class definitions

  def addMethods(self):
    pass

So the idea is simple,   Derive a new class from this and implement the addMethods method and the methods themselves.

#! /usr/bin/python

import jsonServer

class JsonInterface(jsonServer.JsonServerThread):
  def __init__(self, host, port):
    jsonServer.JsonServerThread.__init__(self, host, port)
    self.directory = directory

  def addMethods(self):
    self.server.register_function(self.doOneThing)
    self.server.register_function(self.doAnother)

  def doOneThing(self, obj):
    return obj

  def doAnother(self):
    return "why am I doing something else?"

In the derived class, implement the methods and register them in addMethods.  That is all.  Now we can worry simply about implementation.  Be aware of any threading synchronization of exception handling.  Jsonrpclib takes care of exception handling as well and converts it into a JSON exception.

One last item of note.  In the base class, the stop method is interesting.  Since handle_request() is a blocking call in the thread, we need to set the “stop” flag and make a simple request.  The _ping method does this for us.  Then we join on the thread waiting for it to end gracefully.

The jsonrpclib is a very useful library and well done.  By the way, this example is for Python 2.7.  On Ubuntu 14.04, you can install this using “apt-get install python-jsonrpclib”.

Pulling Documents for Searching

In a prior post, I noted how to set up elasticsearch with apache2.  In this post, we will look at how to cache a set of files on your web server  from a windows share and index them.

To do this, we need to do the following steps:

  1. Initialize the index the first time.
  2. Mount a share.
  3. Rsync the data between the machines.
  4. Get the files that exist on the SMB share.
  5. Read what has been indexed.
  6. Diff the lists from steps 3 and 4.
  7. Index the new files on the share.
  8. Delete (the index and file) the files that no longer exist on the share.

By the way, there was a lot done in python 2.7 (as opposed to python 3x in some other posts I have).

Initialize the Index

The following script will “reset” the index and create it new.

#! /usr/bin/python

import httplib 
import binascii
import os
import glob
import socket

import hostinfo

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

socket.setdefaulttimeout(15)
conn = httplib.HTTPConnection(hostinfo.HOST)
connInitialize(conn)
connRefresh(conn)

Mount a SMB share

On Ubuntu, you will need to install cif-utils:  “sudo apt-get install cifs-utils”.

Once done, you can mount it by using the following command.  Choose your own mount point obviously and be prepared with your domain password.

sudo mount -t cifs //10.0.4.240/General /mnt/cifs -ousername=maksym.shyte,ro

Rsync Between Server and SMB Share

The easiest way to do this is to create a file list that you want to search for.  Then use that list to rsync with.  This leaves you with copied files with efficiency and a text file list of the files on the SMB share.

function addToList {
  find "$1" -name \*.pdf -o -name \*.doc -o -name \*.docx -o -name \*.xls -o -name \*.xlsx -o -name \*.ppt -o -name \*.pptx -o -name \*.txt | grep -v ".AppleDouble" | grep -v "~$" >> "$2"
}

cd /mnt/cifs

addToList . $currentPath/rsynclist.txt
#addToList ./Some\ Directory $currentPath/rsynclist.txt

rsync -av --files-from=rsynclist.txt /mnt/cifs /var/www/search/data

Read the Index

To read the index, the following script will pull the indexes out and write them to a file.  This will include the name of the document and the key.  You will need to take the step of revolving the path from the previous file list with this index as they are related by the source and destination directory passed to rsync.

#! /usr/bin/python

import httplib 
import json
import sys
import os
import codecs

import hostinfo

argc = len(sys.argv)
if argc != 2:
    print os.path.basename(sys.argv[0]), ""
    sys.exit(-1)

indexFileName = sys.argv[1]

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

conn = httplib.HTTPConnection(hostinfo.HOST)
data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))

print data
total = data["hits"]["total"]

#scroll session id, used to request the next batch of data
scrollId = data["_scroll_id"]
counter = 0; 

data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))

#print data

f = codecs.open(indexFileName, "w", "utf8")

while len(data["hits"]["hits"]) > 0:
    for item in data["hits"]["hits"]: 
        f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
        f.flush()

    counter = counter + len(data["hits"]["hits"])
    print "Reading Index:", counter, "of", total

    scrollId = data["_scroll_id"]
    resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
    #print resp
    data = json.loads(resp)

f.close()

Diff the File List and the Index List

Next we need to diff the two.  We want to know the files we need to index and the files we want to delete.  The following script does that (presuming that the lists have been modified to point at the same directory – i.e. /var/www/search/data).  Out comes an “add” text file and a “delete” text file.

#! /usr/bin/python

import sys
import os

argc = len(sys.argv)
if argc != 5:
    print os.path.basename(sys.argv[0]), "   "
    sys.exit(-1)

def createMap(filename):
    ret = {}
    f = open(filename)
    lines = f.readlines()
    f.close()
    for line in lines:
        line = line.replace('\n','')
        split = line.split(',', 1)
        key = split[0]
        ret[key] = line
    return ret

fileMap = createMap(sys.argv[1])
indexMap = createMap(sys.argv[2])

# if the entry is in fileMap but not indexMap, it goes into the add file
# if the entry is in indexMap but not fileMap, it goes into the delete file
add = {}

for key in fileMap:
    if indexMap.has_key(key):
        del indexMap[key]
    else:
        add[key] = fileMap[key]

f = open(sys.argv[3], "w")
for key in add:
    f.write(add[key] + '\n');
f.close()

f = open(sys.argv[4], "w")
for key in indexMap:
    f.write(indexMap[key] + '\n');
f.close()

Add to the Index

Next we iterate through all the files in the “add” list.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

rootFsDir = sys.argv[2] 

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connAddFile(conn, filename, rootFsDir):
    title = os.path.basename(filename)
    location = filename[len(rootFsDir):]

    with open(filename, 'rb') as f:
        data = f.read()

    if len(data) > hostinfo.LARGEST_BASE64_ATTACHMENT:
        print 'Not indexing because the file is too large', len(data)
    else:
        print 'Indexing file size', len(data)
        base64Data = binascii.b2a_base64(data)[:-1]
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', hostinfo.INDEX + '/attachment/', attachment)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)
#connInitialize(conn)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

rootFsDir = rootFsDir + '/'

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    filename = rootFsDir + line
    print idx, filename
    try:
        connAddFile(conn, filename, rootFsDir)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

connRefresh(conn)

Delete the Files Not Needed

Finally, we delete the index and physical files no longer needed.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connDeleteFile(conn, index):
    print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    split = line.split(',')
    filename = split[0]
    index = split[1]
    print "Delete:", idx, filename, index
    try:
        connDeleteFile(conn, index)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

    try:
    	os.remove(sys.argv[2] + '/' + filename)    
    except:
        pass

connRefresh(conn)

There it is.  I have all these steps including resolving the path between the file list and the index list.  One further thing to note is that the hostinfo file referenced by the python scripts look like this:

#! /usr/bin/python

HOST = '127.0.0.1:9200'
SITE = ''

INDEX = SITE + '/basic'

LARGEST_BASE64_ATTACHMENT = 50000000

 

A Search Engine for Office Documents

Have you ever worked at a place where there was a mass of files and documents on  a share and even old timers forget where important documents are?

Search by file name stinks and SharePoint has been another excuse to dump stuff that gets lost.

So I decided to figure out an easy way to get a content search engine up looking through the files on a share.    I found a solution.  It isn’t pristine for these reasons.

  1. Browsers can’t link to files on a share for obvious security reasons.
  2. For reason one, the decision was made to copy searchable documents onto the web server.  This is time consuming to transfer and duplicates information but the documents are served successfully.
  3. For reason two, it would be possible to add an server plugin that reads and delivers a file on a share.  Just haven’t done that yet.

So we will start with what we have and consider changing it later.

The basis for this will be Ubuntu 12.04 LTS.  Why?  Because I have such a machine handy and it is 9 years old.  This will be based on all the wonderful work of elasticsearch and Lucene.

So, here are the steps.  Remember, this is a bit hacky.

  1. Install apache2.  (In the case of Ubuntu, it is “sudo apt-get install apache2”.)
  2. Install openjdk-7-jre-headless.  (“sudo apt-get isntall openjdk-7-jre-headless”).
  3. Download elasticsearch (from elasticsearch.org – the .com site takes you to pay-for products).  Because I am using Ubuntu, I thought I would use the apt repository.
  4. Follow the steps to start elasticsearch – in my case listed on the web site.  Be advised that elasticsearch binds to all interfaces tp a free port between 9200 and 9300.  We will assume that the port is 9200 as it is in my case.  However, it probably should only bind to a port on localhost or at least, the security should be evaluated to make sure it complies with what you need.
  5. We will need two plugins.  You can install them from you elasticsearch/bin location.  In my case it was /usr/share/elasticsearch/bin/plugin.
    bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.0.0
    bin/plugin -install de.spinscale/elasticsearch-plugin-suggest/1.0.1-2.0.0

    Restart elasticsearch. (“sudo service elasticsearch restart”).  You will also need to verify the versions of these plugins.

  6. For apache2, make sure to enable the proxy, proxy_http, and ssl modules.  On Ubuntu, the “a2enmod” is an easy utility to do this.
  7. In my Apache setup, I added a new file called “elasticsearch” inside /etc/apache2/conf.d.  (Note the 13.10 doesn’t use a conf.d directory.   It could be added to the bottom of apach2.conf although I am sure there is a more “pristine” location.)  The contents are below.
    <IfModule proxy_module>
    <IfModule proxy_http_module>
    
    <Proxy *>
    <Limit GET > 
        allow from all 
    </Limit>
    
    <Limit POST PUT DELETE>
        order deny,allow 
        deny from all 
    </Limit>
    </Proxy>
    
    ProxyPreserveHost On
    ProxyRequests Off
    LogLevel debug
    ProxyPass /es http://localhost:9200/
    ProxyPassReverse /es http://localhost:9200/
    
    </IfModule>
    </IfModule>

    The application depends on the /es directory under web root. This can be changed along with the web pages that use it.

  8. Restart apache2.  (“sudo service apache2 restart”)
  9. Download the HTML and Javascript for the search pages from here:  Search HTML and Javascript.  It uses jQuery and jQueryUI and AJAX to perform the searching and suggestions.  Unzip and place in the web directory where you want it.  For me, I wanted a search subdirectory so I placed my in /var/www/search.
  10. So, the last thing is show how to index the files.  I am a fan of python so this is python code making http requests to elasticsearch adding the information.  The script below deletes the index, recreates, and starts adding content to it – from files in a directory.
    #! /usr/bin/python
    
    import httplib 
    import binascii
    import os
    
    HOST = 'localhost:9200'
    INDEX = '/basic'
    
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
        else:
            conn.request(verb, url, body)
        return conn.getresponse().read()
    
    def connAddFile(conn, filename, rootFsDir, httpPrefix):
        with open(filename, 'rb') as f:
            base64Data = binascii.b2a_base64(f.read())[:-1]
    
        title = os.path.basename(filename)
        location = httpPrefix + filename[len(rootFsDir):]
    
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', INDEX + '/attachment/', attachment)
    
    conn = httplib.HTTPConnection(HOST)
    
    print connRequest(conn, 'DELETE', INDEX)
    
    print connRequest(conn, 'PUT', INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    
    print connRequest(conn, 'PUT', INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )
    
    # Add files here repeatedly
    rootFsDir = '/var/www/search/data/'
    searchDir = ''          # This is for recursion through the directories
    httpPrefix = 'data/'
    # Make this recursive some day
    for file in os.listdir(rootFsDir + searchDir):
        connAddFile(conn, rootFsDir + searchDir + file, rootFsDir, httpPrefix)
    
    print connRequest(conn, 'POST', '/_refresh')
  11. If you decide to get more creative and add only new files and delete the old ones, we need to understand how to get the list of existing files that are indexed.  Then you just have to correlate the current state of the files on disk with the index list.  This script gets the indexes and the files associated with them.
    #! /usr/bin/python
    
    import httplib 
    import json
    import sys
    import os
    
    import hostinfo
    
    argc = len(sys.argv)
    if argc != 2:
        print os.path.basename(sys.argv[0]), ""
        sys.exit(-1)
    
    indexFileName = sys.argv[1]
    
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
        else:
            conn.request(verb, url, body)
        return conn.getresponse().read()
    
    conn = httplib.HTTPConnection(hostinfo.HOST)
    data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))
    
    total = data["hits"]["total"]
    
    #scroll session id, used to request the next batch of data
    scrollId = data["_scroll_id"]
    counter = 0; 
    
    data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))
    
    f = open(indexFileName, 'w')
    
    while len(data["hits"]["hits"]) > 0:
        for item in data["hits"]["hits"]:
            f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
            f.flush()
    
        counter = counter + len(data["hits"]["hits"])
        print "Reading Index:", counter, "of", total
    
        scrollId = data["_scroll_id"]
        resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
        #print resp
        data = json.loads(resp)
    
    f.close()
  12. To delete files, the python snippet looks like this where index is the id for the file we want indexing deleted for.
    def connDeleteFile(conn, index):
        print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

So there we have it.  All we have to do figure out where we are getting our data from and copy it to the “data” directory.  One particular way I have done this is with rsync across an SMB share.

This by no means is meant to be a lesson on elasticsearch.  There can be some improvement here.

However, this is a quick way to set up searching documents for information you never knew existed.  (Side note:  I have had 10 ms search times across 2500 documents.)

 

Grepping any type of file encoding in Python

Let’s take handling any encoding of files one step further.

We need to look for specific text in in files in a directory regardless of encoding.  Here is one way in Python.

#! /usr/bin/python
import sys
import os.path
import os
import re
import fnmatch

def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']):
  for codec in codecs:
    try:
      return byteArray.decode(codec)
    except:
      pass

def ReadLinesFromFile(filename):
  file = open(filename, "rb")
  rawbytes = file.read()
  file.close()
  content = DecodeBytes(rawbytes)
  if content is not None:
    return content.split(os.linesep)

# this came from http://stackoverflow.com/questions/1863236/grep-r-in-python
# with a substitution of ReadLinesFromFile and a file name match filter
def RecursiveGrep(pattern, dir, match):
  r = re.compile(pattern)
  for parent, dnames, fnames in os.walk(dir):
    fnames = fnmatch.filter(fnames, match)
    for fname in fnames:
      filename = os.path.join(parent, fname)
      if os.path.isfile(filename):
        lines = ReadLinesFromFile(filename)
        if lines is not None:
          idx = 0
          for line in lines:
            if r.search(line):
              yield filename + "|" + str(idx) + "|" + line.strip()	
              idx += 1

lines = RecursiveGrep("needle", "\yourpath", "*.cs")

The will recurse all subdirectories, looking in all .cs files to find needed returning the data in this format (pipe separated):

full file path|line number|line content

Very useful on Windows with multilingual files.

Getting lines of a file of any encoding type in Python

I really don’t want to know the encoding.  I only want the data.  In other words, I don’t want to think.  I don’t want to open notepad++ and convert between types of encoding.

My old standby doesn’t work on various file encodings that aren’t ansi (ascii, cp1252, whatever):

f = open("poo.txt", "r")
lines = f.readlines()
f.close()
for line in lines:
  dosomething(line)

I have had enough.  (I am also venturing into Python 3 as I have been on Python 2 forever but that is a different story.)

The following code will read a file of different encoding and split them into lines:

import os

def DecodeBytes(byteArray, codecs=['utf-8', 'utf-16']):
  for codec in codecs:
    try:
      return byteArray.decode(codec)
    except:
      pass

def ReadLinesFromFile(filename):
  file = open(filename, "rb")
  rawbytes = file.read()
  file.close()
  content = DecodeBytes(rawbytes)
  if content is not None:
    return content.split(os.linesep)

lines = ReadLinesFromFile("poo.txt")
for line in lines:
  dosomething(line)

If you need to add encodings, simply add them to the codecs default assignment (or make it more elegant as you deem).

 

Hysteria with PYROmania (special URIs)

I had a need to modify code written using Pyro so that objects on localhost could be exposed remotely.  I have never worked with Pyro before so I was in some hysteria.  I was ready to try though.

There is no definitive guide anywhere using the special names in the URI (actually I found something here and here later).  Love bites.

If you look at Pryo servers, the URI from the daemon is a very long string like the following:

PYRO://127.0.0.1:62100/c0a8006516bc7752e7526becdb059ce9

That is a rather long URI and the number changes on every service start up (obviously like a GUID or time based id).  So on the client side, is this my URI?  Is it too late for love?

Well, no. Here is a quick guide for how to use the special URI strings and you can be a Pyro Animal.

For name servers, you can use:

PYRONAME://<hostname>[:<port>]/<objectname>

For straight remote access (otherwise called the regular method):

PYROLOC://<hostname>[:<port>]/<objectname>

So, it isn’t as bad as it initially seems. No Foolin’.  Finally, Armageddon It.

Comparing huge files using Python

I had the need to compare two huge files.

None of the file compare tools that I tried could handle large files, so I decided to write my own compare utility.

Comparing large files is not a common case. I could have solved my issue by loading each file into a database, then using the excellent RedGate DataCompare  against the two tables. Heck for that matter – load both in one one table and do a GROUP BY. But I had several files and loading the database would have been tedious.

Use cases could include replacing an ETL process with a new process or upgrading a web service that generates large datasets. For me, it was a system migration of an ETL process.

The script was done ‘quick and dirty’, but really came in handy.

What this does is opens the two files, and starts reading them with a configurable line buffer (set to 10 currently). It looks to see if lines from the second file are in the buffer of the first file. It records misses and increases the buffer. It outputs misses for later review and will stop processing if there are too many misses (configurable).

That is it – easy

### Program to compare two large files
### fails fast – order of rows is important
### smart enough to look ahead for matching rows
fs2 = “C:/Temp/File2.txt”
fs1 = “C:/Temp/File1.txt”
ofs = “C:/Temp/diffb.txt”
maxerrors = 500
Lq1 = []
Lq2 = []
rcount1 = 0
rcount2 = 0
isFound = False
#setup – read first n rows
n = 10
f1pointer = n
f2pointer = n
f1 = open(fs1)
f2 = open(fs2)
of = open(ofs, ‘w’)
for x in range(0,n):
    Lq2.append(f2.readline())
# important stuff
errorsFound = 0
rowcounter = 0
goodcounter = 0
row = f1.readline()
while errorsFound < maxerrors and len(row) > 10:
        isFound = False
        if (rowcounter%10000 == 0) :
            print “.”,
        rowcounter = rowcounter + 1
        for y in range(0,len(Lq2)):
            if (row == Lq2[y]):
                isFound = True
                Lq2.pop(y)
                Lq2.append(f2.readline())
                break
        if  not isFound:
            errorsFound = errorsFound + 1
            of.write(row)
            for x in range(0,n):
                Lq2.append(f2.readline())
        row = f1.readline()
print “”, “done – rows processed: “, rowcounter
f1.close()
f2.close()
of.close()
print len(Lq2)