Pulling Documents for Searching

In a prior post, I noted how to set up elasticsearch with apache2.  In this post, we will look at how to cache a set of files on your web server  from a windows share and index them.

To do this, we need to do the following steps:

  1. Initialize the index the first time.
  2. Mount a share.
  3. Rsync the data between the machines.
  4. Get the files that exist on the SMB share.
  5. Read what has been indexed.
  6. Diff the lists from steps 3 and 4.
  7. Index the new files on the share.
  8. Delete (the index and file) the files that no longer exist on the share.

By the way, there was a lot done in python 2.7 (as opposed to python 3x in some other posts I have).

Initialize the Index

The following script will “reset” the index and create it new.

#! /usr/bin/python

import httplib 
import binascii
import os
import glob
import socket

import hostinfo

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

socket.setdefaulttimeout(15)
conn = httplib.HTTPConnection(hostinfo.HOST)
connInitialize(conn)
connRefresh(conn)

Mount a SMB share

On Ubuntu, you will need to install cif-utils:  “sudo apt-get install cifs-utils”.

Once done, you can mount it by using the following command.  Choose your own mount point obviously and be prepared with your domain password.

sudo mount -t cifs //10.0.4.240/General /mnt/cifs -ousername=maksym.shyte,ro

Rsync Between Server and SMB Share

The easiest way to do this is to create a file list that you want to search for.  Then use that list to rsync with.  This leaves you with copied files with efficiency and a text file list of the files on the SMB share.

function addToList {
  find "$1" -name \*.pdf -o -name \*.doc -o -name \*.docx -o -name \*.xls -o -name \*.xlsx -o -name \*.ppt -o -name \*.pptx -o -name \*.txt | grep -v ".AppleDouble" | grep -v "~$" >> "$2"
}

cd /mnt/cifs

addToList . $currentPath/rsynclist.txt
#addToList ./Some\ Directory $currentPath/rsynclist.txt

rsync -av --files-from=rsynclist.txt /mnt/cifs /var/www/search/data

Read the Index

To read the index, the following script will pull the indexes out and write them to a file.  This will include the name of the document and the key.  You will need to take the step of revolving the path from the previous file list with this index as they are related by the source and destination directory passed to rsync.

#! /usr/bin/python

import httplib 
import json
import sys
import os
import codecs

import hostinfo

argc = len(sys.argv)
if argc != 2:
    print os.path.basename(sys.argv[0]), ""
    sys.exit(-1)

indexFileName = sys.argv[1]

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

conn = httplib.HTTPConnection(hostinfo.HOST)
data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))

print data
total = data["hits"]["total"]

#scroll session id, used to request the next batch of data
scrollId = data["_scroll_id"]
counter = 0; 

data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))

#print data

f = codecs.open(indexFileName, "w", "utf8")

while len(data["hits"]["hits"]) > 0:
    for item in data["hits"]["hits"]: 
        f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
        f.flush()

    counter = counter + len(data["hits"]["hits"])
    print "Reading Index:", counter, "of", total

    scrollId = data["_scroll_id"]
    resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
    #print resp
    data = json.loads(resp)

f.close()

Diff the File List and the Index List

Next we need to diff the two.  We want to know the files we need to index and the files we want to delete.  The following script does that (presuming that the lists have been modified to point at the same directory – i.e. /var/www/search/data).  Out comes an “add” text file and a “delete” text file.

#! /usr/bin/python

import sys
import os

argc = len(sys.argv)
if argc != 5:
    print os.path.basename(sys.argv[0]), "   "
    sys.exit(-1)

def createMap(filename):
    ret = {}
    f = open(filename)
    lines = f.readlines()
    f.close()
    for line in lines:
        line = line.replace('\n','')
        split = line.split(',', 1)
        key = split[0]
        ret[key] = line
    return ret

fileMap = createMap(sys.argv[1])
indexMap = createMap(sys.argv[2])

# if the entry is in fileMap but not indexMap, it goes into the add file
# if the entry is in indexMap but not fileMap, it goes into the delete file
add = {}

for key in fileMap:
    if indexMap.has_key(key):
        del indexMap[key]
    else:
        add[key] = fileMap[key]

f = open(sys.argv[3], "w")
for key in add:
    f.write(add[key] + '\n');
f.close()

f = open(sys.argv[4], "w")
for key in indexMap:
    f.write(indexMap[key] + '\n');
f.close()

Add to the Index

Next we iterate through all the files in the “add” list.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

rootFsDir = sys.argv[2] 

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connInitialize(conn):
    print connRequest(conn, 'DELETE', hostinfo.INDEX)
    print connRequest(conn, 'PUT', hostinfo.INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', hostinfo.INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connAddFile(conn, filename, rootFsDir):
    title = os.path.basename(filename)
    location = filename[len(rootFsDir):]

    with open(filename, 'rb') as f:
        data = f.read()

    if len(data) > hostinfo.LARGEST_BASE64_ATTACHMENT:
        print 'Not indexing because the file is too large', len(data)
    else:
        print 'Indexing file size', len(data)
        base64Data = binascii.b2a_base64(data)[:-1]
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', hostinfo.INDEX + '/attachment/', attachment)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)
#connInitialize(conn)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

rootFsDir = rootFsDir + '/'

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    filename = rootFsDir + line
    print idx, filename
    try:
        connAddFile(conn, filename, rootFsDir)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

connRefresh(conn)

Delete the Files Not Needed

Finally, we delete the index and physical files no longer needed.

#! /usr/bin/python

import httplib 
import binascii
import sys
import os
import socket

import hostinfo

argc = len(sys.argv)
if argc != 3:
    print os.path.basename(sys.argv[0]), " "
    sys.exit(-1)

def connRequest(conn, verb, url, body = None):
    if body == None:
        conn.request(verb, url)
    else:
        conn.request(verb, url, body)
    return conn.getresponse().read()

def connRefresh(conn):
    print connRequest(conn, 'POST', '/_refresh')

def connDeleteFile(conn, index):
    print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

socket.setdefaulttimeout(30)
conn = httplib.HTTPConnection(hostinfo.HOST)

f = open(sys.argv[1])
lines = f.readlines()
f.close()

idx = 0

for line in lines:
    line = line.replace('\n', '')
    idx = idx + 1
    split = line.split(',')
    filename = split[0]
    index = split[1]
    print "Delete:", idx, filename, index
    try:
        connDeleteFile(conn, index)
    except Exception, e:
        print str(e)
        conn = httplib.HTTPConnection(hostinfo.HOST)  

    try:
    	os.remove(sys.argv[2] + '/' + filename)    
    except:
        pass

connRefresh(conn)

There it is.  I have all these steps including resolving the path between the file list and the index list.  One further thing to note is that the hostinfo file referenced by the python scripts look like this:

#! /usr/bin/python

HOST = '127.0.0.1:9200'
SITE = ''

INDEX = SITE + '/basic'

LARGEST_BASE64_ATTACHMENT = 50000000

 

A Search Engine for Office Documents

Have you ever worked at a place where there was a mass of files and documents on  a share and even old timers forget where important documents are?

Search by file name stinks and SharePoint has been another excuse to dump stuff that gets lost.

So I decided to figure out an easy way to get a content search engine up looking through the files on a share.    I found a solution.  It isn’t pristine for these reasons.

  1. Browsers can’t link to files on a share for obvious security reasons.
  2. For reason one, the decision was made to copy searchable documents onto the web server.  This is time consuming to transfer and duplicates information but the documents are served successfully.
  3. For reason two, it would be possible to add an server plugin that reads and delivers a file on a share.  Just haven’t done that yet.

So we will start with what we have and consider changing it later.

The basis for this will be Ubuntu 12.04 LTS.  Why?  Because I have such a machine handy and it is 9 years old.  This will be based on all the wonderful work of elasticsearch and Lucene.

So, here are the steps.  Remember, this is a bit hacky.

  1. Install apache2.  (In the case of Ubuntu, it is “sudo apt-get install apache2”.)
  2. Install openjdk-7-jre-headless.  (“sudo apt-get isntall openjdk-7-jre-headless”).
  3. Download elasticsearch (from elasticsearch.org – the .com site takes you to pay-for products).  Because I am using Ubuntu, I thought I would use the apt repository.
  4. Follow the steps to start elasticsearch – in my case listed on the web site.  Be advised that elasticsearch binds to all interfaces tp a free port between 9200 and 9300.  We will assume that the port is 9200 as it is in my case.  However, it probably should only bind to a port on localhost or at least, the security should be evaluated to make sure it complies with what you need.
  5. We will need two plugins.  You can install them from you elasticsearch/bin location.  In my case it was /usr/share/elasticsearch/bin/plugin.
    bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.0.0
    bin/plugin -install de.spinscale/elasticsearch-plugin-suggest/1.0.1-2.0.0

    Restart elasticsearch. (“sudo service elasticsearch restart”).  You will also need to verify the versions of these plugins.

  6. For apache2, make sure to enable the proxy, proxy_http, and ssl modules.  On Ubuntu, the “a2enmod” is an easy utility to do this.
  7. In my Apache setup, I added a new file called “elasticsearch” inside /etc/apache2/conf.d.  (Note the 13.10 doesn’t use a conf.d directory.   It could be added to the bottom of apach2.conf although I am sure there is a more “pristine” location.)  The contents are below.
    <IfModule proxy_module>
    <IfModule proxy_http_module>
    
    <Proxy *>
    <Limit GET > 
        allow from all 
    </Limit>
    
    <Limit POST PUT DELETE>
        order deny,allow 
        deny from all 
    </Limit>
    </Proxy>
    
    ProxyPreserveHost On
    ProxyRequests Off
    LogLevel debug
    ProxyPass /es http://localhost:9200/
    ProxyPassReverse /es http://localhost:9200/
    
    </IfModule>
    </IfModule>

    The application depends on the /es directory under web root. This can be changed along with the web pages that use it.

  8. Restart apache2.  (“sudo service apache2 restart”)
  9. Download the HTML and Javascript for the search pages from here:  Search HTML and Javascript.  It uses jQuery and jQueryUI and AJAX to perform the searching and suggestions.  Unzip and place in the web directory where you want it.  For me, I wanted a search subdirectory so I placed my in /var/www/search.
  10. So, the last thing is show how to index the files.  I am a fan of python so this is python code making http requests to elasticsearch adding the information.  The script below deletes the index, recreates, and starts adding content to it – from files in a directory.
    #! /usr/bin/python
    
    import httplib 
    import binascii
    import os
    
    HOST = 'localhost:9200'
    INDEX = '/basic'
    
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
        else:
            conn.request(verb, url, body)
        return conn.getresponse().read()
    
    def connAddFile(conn, filename, rootFsDir, httpPrefix):
        with open(filename, 'rb') as f:
            base64Data = binascii.b2a_base64(f.read())[:-1]
    
        title = os.path.basename(filename)
        location = httpPrefix + filename[len(rootFsDir):]
    
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', INDEX + '/attachment/', attachment)
    
    conn = httplib.HTTPConnection(HOST)
    
    print connRequest(conn, 'DELETE', INDEX)
    
    print connRequest(conn, 'PUT', INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    
    print connRequest(conn, 'PUT', INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )
    
    # Add files here repeatedly
    rootFsDir = '/var/www/search/data/'
    searchDir = ''          # This is for recursion through the directories
    httpPrefix = 'data/'
    # Make this recursive some day
    for file in os.listdir(rootFsDir + searchDir):
        connAddFile(conn, rootFsDir + searchDir + file, rootFsDir, httpPrefix)
    
    print connRequest(conn, 'POST', '/_refresh')
  11. If you decide to get more creative and add only new files and delete the old ones, we need to understand how to get the list of existing files that are indexed.  Then you just have to correlate the current state of the files on disk with the index list.  This script gets the indexes and the files associated with them.
    #! /usr/bin/python
    
    import httplib 
    import json
    import sys
    import os
    
    import hostinfo
    
    argc = len(sys.argv)
    if argc != 2:
        print os.path.basename(sys.argv[0]), ""
        sys.exit(-1)
    
    indexFileName = sys.argv[1]
    
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
        else:
            conn.request(verb, url, body)
        return conn.getresponse().read()
    
    conn = httplib.HTTPConnection(hostinfo.HOST)
    data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))
    
    total = data["hits"]["total"]
    
    #scroll session id, used to request the next batch of data
    scrollId = data["_scroll_id"]
    counter = 0; 
    
    data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))
    
    f = open(indexFileName, 'w')
    
    while len(data["hits"]["hits"]) > 0:
        for item in data["hits"]["hits"]:
            f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
            f.flush()
    
        counter = counter + len(data["hits"]["hits"])
        print "Reading Index:", counter, "of", total
    
        scrollId = data["_scroll_id"]
        resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
        #print resp
        data = json.loads(resp)
    
    f.close()
  12. To delete files, the python snippet looks like this where index is the id for the file we want indexing deleted for.
    def connDeleteFile(conn, index):
        print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

So there we have it.  All we have to do figure out where we are getting our data from and copy it to the “data” directory.  One particular way I have done this is with rsync across an SMB share.

This by no means is meant to be a lesson on elasticsearch.  There can be some improvement here.

However, this is a quick way to set up searching documents for information you never knew existed.  (Side note:  I have had 10 ms search times across 2500 documents.)

 

Recycling a Third Party Application with System Tray Icon

I had a need to recycle a third party application that had a system tray icon.  The application controlled hardware  and would get into a funky state.

The application was titled the “user mode driver” but I’m not totally sure if it was the user mode driver framework that Microsoft touted with Vista.  The user mode driver (UMD) was really a bridge process between the Ethernet port and a COM (a.k.a. the older timer component object model) in-process DLL that resided in your program memory space.

The UMD also had a system tray component to it that needed a little cleanup when the application was killed.   The sytem tray icon was left behind.

This post is recycling other’s work that we will reference.  This post is about bringing it all together in C#.

There are three parts to this option.

  1. Stop the process
  2. Restart the process
  3. Clean up the system tray.

For this example though, we will assume that we know the full path to the process and that the process name is the base file name without extension.

Stop the Process

C# has a handy way to stop processes.

private void StopUserModeDriver(string userModeDriverPath)
{
  Process[] procs = null;

  try
  {
    procs = Process.GetProcessesByName(Path.GetFileNameWithoutExtension(userModeDriverPath));

    foreach (Process proc in procs)
    {
      proc.Kill();
      proc.WaitForExit(5000);
    }
  }
  finally
  {
    if (procs != null)
      foreach (Process proc in procs)
        proc.Dispose();
  }
}

Restart the Process

This one is simple.

private void StartUserModeDriver(string userModeDriverPath)
{
  Process.Start(userModeDriverPath);
}

Clean Up the System Tray

This code is present here and we will show it again on this post.

[StructLayout(LayoutKind.Sequential)]
public struct RECT
{
  public int left;
  public int top;
  public int right;
  public int bottom;
}
[DllImport("user32.dll")]
public static extern IntPtr FindWindow(string lpClassName, string lpWindowName);
[DllImport("user32.dll")]
public static extern IntPtr FindWindowEx(IntPtr hwndParent, IntPtr hwndChildAfter, string lpszClass, string lpszWindow);
[DllImport("user32.dll")]
public static extern bool GetClientRect(IntPtr hWnd, out RECT lpRect);
[DllImport("user32.dll")]
public static extern IntPtr SendMessage(IntPtr hWnd, uint msg, int wParam, int lParam);

private void RemoveOrphanedIconsFromSystemTray()
{
  IntPtr systemTrayContainerHandle = FindWindow("Shell_TrayWnd", null);
  IntPtr systemTrayHandle = FindWindowEx(systemTrayContainerHandle, IntPtr.Zero, "TrayNotifyWnd", null);
  IntPtr sysPagerHandle = FindWindowEx(systemTrayHandle, IntPtr.Zero, "SysPager", null);
  IntPtr notificationAreaHandle = FindWindowEx(sysPagerHandle, IntPtr.Zero, "ToolbarWindow32", "Notification Area");
  if (notificationAreaHandle == IntPtr.Zero)
  {
    notificationAreaHandle = FindWindowEx(sysPagerHandle, IntPtr.Zero, "ToolbarWindow32", "User Promoted Notification Area");
    IntPtr notifyIconOverflowWindowHandle = FindWindow("NotifyIconOverflowWindow", null);
    IntPtr overflowNotificationAreaHandle = FindWindowEx(notifyIconOverflowWindowHandle, IntPtr.Zero, "ToolbarWindow32", "Overflow Notification Area");
    RefreshSystemTrayArea(overflowNotificationAreaHandle);
  }
  RefreshSystemTrayArea(notificationAreaHandle);
}

private static void RefreshSystemTrayArea(IntPtr windowHandle)
{
  const uint wmMousemove = 0x0200;
  RECT rect;
  GetClientRect(windowHandle, out rect);
  for (var x = 0; x < rect.right; x += 5)
    for (var y = 0; y < rect.bottom; y += 5)
      SendMessage(windowHandle, wmMousemove, 0, (y << 16) + x);
}

Essentially, we are getting window handles to the notification area that is on your system tray and also the  overflow area introduced in Windows 7 (don’t know about Vista – does anyone remember Vista?).  That is the little arrow icon in the system tray that opens a little popup where all pestering but insignificant applications’ system tray icons live.

Do you remember how you have an orphaned system tray icon so you move your mouse over it to find magically disappears?  That is exactly what this code does.  With the Window handles to the system tray and overflow, we simply move our mouse repeatedly up and down and left to right.  We don’t actually move the cursor, just send the windows message.

There was another solution presented somewhere (on code project but I can’t find it now) that got information in the private bytes of the window allocations to determine if a process was still operating.  This approach was more pristine but did some memory allocation tricks in C# that made me nervous.  Sending mouse messages was certainly safer although not elegant.

C++ Speed Test with FPU and ints

I wanted to test the difference on modern hardware between floating point match and integer math. Here is my code (which was similar to C# code previously written).

// CSpeedTest.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include 
#include 
#include 

using namespace std;

#define FP_MULT
//#define FP_NEG
//#define INT_MULT
//#define INT_NEG

class StopWatch
{
    LARGE_INTEGER m_freq;
    LARGE_INTEGER m_startTime;
    LONGLONG m_totalTime;
public: 
    StopWatch() : m_totalTime(0L)
    {
        QueryPerformanceFrequency(&m_freq);
    }

    void Start()
    {
        QueryPerformanceCounter(&m_startTime);
    }

    void Stop()
    {
        LARGE_INTEGER stopTime;
        QueryPerformanceCounter(&stopTime);
        m_totalTime += (stopTime.QuadPart - m_startTime.QuadPart);
    }

    void Reset()
    {
        m_totalTime = 0L;
    }

    double ElapsedTime()
    {
        return (double)(m_totalTime) / (double)(m_freq.QuadPart);
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    #if defined(FP_MULT) || defined(FP_NEG)
    volatile double poo = 0.0;
    #endif
    #if defined(INT_MULT) || defined(INT_NEG)
    volatile int poo = 0;
    #endif

    StopWatch stopWatch;
    for (int idx = 0; idx < 1000000000; idx++)
    {
      stopWatch.Start();
      #if defined(FP_MULT)
        poo = -1.0 * poo;
      #endif
      #if defined(FP_NEG) || defined(INT_NEG)
        poo = -poo;
      #endif
      #if defined(INT_MULT)
        poo = -1 * poo;
      #endif
      stopWatch.Stop();
    }

    double elapsedTime = stopWatch.ElapsedTime();

    int minutes = elapsedTime / 60;
    int seconds = (int) (elapsedTime) % 60;
    int ms10 = (elapsedTime - int(elapsedTime)) * 100;

    cout << setfill('0') << setw(2) << minutes << ':' << seconds << ':' << ms10 << endl;;

    return 0;
}

The code was compiled as a console application for Win32 Debug so the variables would get “registered”.

The test machine is a Dell Precision M4800. The process is an Intel Core i7-4800MQ CPU at 2.70 GHz with 16GB ram. The OS is Windows 7 Professional 64 bit with SP1.

Here is the results. I have also included the assembler for the operation under test.

define time assembly
FP_MULT 7.32s fld qword ptr [__real@bff0000000000000 (0BE7938h)]; fmul qword ptr [poo]; fstp qword ptr [poo]
FP_NEG 7.56s fld qword ptr [poo]; fchs; fstp qword ptr [poo]
INT_MULT 7.58s mov eax,dword ptr [poo]; imul eax,eax,0FFFFFFFFh; mov dword ptr [poo],eax
INT_NEG 7.59s mov eax,dword ptr [poo]; neg eax; mov dword ptr [poo],eax

I actually don’t believe I have accomplished too much as the setup to call the timing functions actually take many, many more opcodes.  However, this was an interesting experiment and I do now have a cool C++ stopwatch on Windows for more extensive testing on much larger blocks of test code.

Intelligent Agent #2
(ninja turtles)

As promised, here is a super simple example of an Intelligent Agent program using NetLogo.

The goal is to find the boundaries within an image. There are powerful algorithms for locating boundaries, but I wanted to do it another way, I wanted to see if Intelligent Agents could be used. As I said in my earlier post, NetLogo provides a rapid prototyping means for creating Agents and offers a visual development and execution environment. Below is the image I started with. Note the low resolution. NetLogo could handle a much higher resolution image, but the image would be too large to fit in this post, so I had to lower the resolution so I could display it here. (now someone who is an expert in NetLogo might point me to a setting to make the ‘patches’ smaller, that would help).

Screen shot before Agent run.
Contour1

Screen shot after Agents have run a bit.
Contour2

As you can see, the Agents did a fair job of locating the color contours, and thus the edges.
My approach is bone headed simple. Agents wander around randomly keeping track of the color of recently visited patches. If the new patch is enough different from the prior patch, then it marks the patch White to indicate a boundary.
The approach has a serious flaw, and a weakness. The flaw is that the boundary is dependent on the direction of the Agent at the time it encounters a boundary. For example, an Agent moving from Blue to Red will mark the Red patch, while a neighboring Agent moving from Red to Blue will mark the Blue patch. This causes the edge to be much more ragged than it really is. This could be solved by adding a rule such as always mark the lighter colored patch.
The weakness is the algorithm could be much more efficient. Rather than wander aimlessly, the Agent could try to determine the direction of the edge and follow it. If two Agents collide, one could jump to a new location to let the other complete the boundary.

I am sure you are eager to see some code. OK, here goes:
extensions [bitmap ]

turtles-own [last-color second-last-color hunt-color]

to setup
let img bitmap:import "C:\\Temp\\Desert.jpg"
;;set img bitmap:to-grayscale img
bitmap:copy-to-pcolors img false

create-turtles 10 [fd 10]
ask turtles [
set last-color pcolor
set second-last-color pcolor
set hunt-color pcolor
]
end

to hunt
ask turtles[
rt random 50
lt random 50
fd 1
if pcolor = 0
[jump-elsewhere]
set second-last-color last-color
set last-color pcolor
if isContour second-last-color last-color
[set pcolor [255 255 255]]
]
hunt

end

to-report isContour [color1 color2]
let retVal false
foreach [0 1 2]
[
;; show item ?1 color1
if abs (item ?1 color1 - item ?1 color2) > 60 ;; or item ?1 color2 - item ?1 color1 > 60
[set retVal true]
]
if approximate-rgb item 0 color1 item 1 color1 item 2 color1 = white or approximate-rgb item 0 color2 item 1 color2 item 2 color2 = white
[set retVal false]
report retVal
end

to jump-elsewhere
set xcor random 40
set ycor random 40
set last-color pcolor
set second-last-color pcolor
set hunt-color pcolor
end

Intelligent Agents
(ninja Turtles)

Most of us in IT are interested in Artificial Intelligence and leveraging AI techniques to solve current problems.
Intelligent Agents is one branch of AI and although not as sexy as Neural Nets and other branches, it is perhaps the most widely used.
Agents are everywhere, from viruses, to web crawlers, there are many bits of code with a degree of autonomy.
Definitions of Agents can be found on Wikipedia.
In a nutshell, an Intelligent Agent receives data from its environment and makes decisions based on that data.
Agents work for someone, so there is usually a communication and reporting aspect.
If this sounds like the proxy pattern, than I have done a poor job describing it, please see the Wikipedia article which goes into much more detail.

A classic simplified example of Intelligent Agents is called Sheep and Wolves. In the most simple form, you have Sheep agents, Wolf agents and an environment made up of grass. Sheep eat grass, Wolves eat Sheep. Sheep wander around aimlessly as do Wolves. When a Wolf bumps into a Sheep, it eats it. If all the Sheep are eaten, the Wolves all die. If the Sheep over populate, they die of starvation.

Even this simple model can be expanded to make the problem more ‘real’ and interesting.
For example – the sheep could learn when wolves hunt. The wolves could learn when sheep graze. There could be different species of grass that grow at different rates and have differing nutritional value. Food is not the only resource that animals need, so you could introduce water and shelter. Sheep don’t clone themselves, they must find a mate etc. Sheep have a notoriously low learning rate, so if one sheep discovers a new resource, the others are slow to learn about the new resource, but some do eventually learn.

You can see how fun this could get even for a contrived example.

Northwestern University has provided NetLogo, a great, easy to use and visual way to develop models such as described above.
Go here: NetLogo for more info and to get started.
MIT has a similar system: StarLogo

Before you go all Donatello on me, download NetLogo and explore some of the 100 or so included models.

Soon I will post a simple model which can locate edges in an image. I call it color contouring.

Screen shot of Sheep and Wolves model running in NetLogo.
sheepwolves

C# Speed Tests with FPU and ints

So I saw this sort of code inside a loop (doing inversion of data) in C# the other day;  assume x is a double:

dblValue = dblValue * -1.0d;

I wondered what the speed comparison was compared to this:

dblValue = -dblValue;

I expected the second to be faster.  I decided to find out. After testing floating point numbers, I also decided to try the same thing with integers just to see.  Here is my code set.

#define FP_MULT
//#define FP_NEG
//#define INT_MULT
//#define INT_NEG

using System;
using System.Diagnostics;

class Script
{
  [STAThread]
  static public void Main(string[] args)
  {
    #if FP_MULT || FP_NEG
    double poo = 0d;
    #endif
    #if INT_MULT || INT_NEG
    int poo = 0;
    #endif

    Stopwatch stopWatch = new Stopwatch();
    for (int idx = 0; idx < 1000000000; idx++)
    {
      stopWatch.Start();
      #if FP_MULT
        poo = -1.0d * poo;
      #endif
      #if FP_NEG || INT_NEG
        poo = -poo;
      #endif
      #if INT_MULT
        poo = -1 * poo;
      #endif
      stopWatch.Stop();
    }

    TimeSpan ts = stopWatch.Elapsed;

    string elapsedTime = String.Format("{0:00}:{1:00}.{2:00}",                                 ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
    Console.WriteLine(elapsedTime);
  }
}

The program was compiled using .NET 4.0 64 bit framework. The exe was compiled to ILOnly verified using CorFlags.exe. This means the exe was running in 64 bit mode.

The test machine is a Dell Precision M4800. The process is an Intel Core i7-4800MQ CPU at 2.70 GHz with 16GB ram. The OS is Windows 7 Professional 64 bit with SP1.

Here is the results. I didn’t really average anything but everytime I ran every one of these tests, the values were always similar.  I have also included the IL for the operation under test.  (I used ILSpy.)

define time IL
FP_MULT 15.49s ldc.r8 -1; ldloc.0; mul; stloc.0
FP_NEG 15.35s ldloc.0; neg; stloc.0
INT_MULT 15.35s ldc.i4.m1; ldloc.0; mul; stloc.0
INT_NEG 15.43s ldloc.0; neg; stloc.0

It has been awhile since I have evaluated floating point and integer math but am impressed that the timing is very similar.

I think I may try this on the same machine using a simple C++ program and performance counters to see the results and dive deeper into this.

Follow up note:  I now don’t believe I accurate measured anything as the stopwatch opcodes were likely more plentiful then the code under test.  However, it was an interesting experiment and we learned about the stopwatch in .NET.

Using crosstool-ng and Cygwin

My goal is to cross compile on Cygwin (on Winderz) for a Linux target – both 64 bit Ubuntu 13.10 or an ARM (such as a Beagle Bone). I sadistically thought that this could be done in MinGW.  Two words: Um, oops.

The real purpose is to take a Windows GUI that generates C code and compile it for a different platform (hence cross compiling).  These are my steps which are based on this guy’s post.

Note that as information on the web becomes quickly out of date, realize that this is the end of March in 2014.

  1. Before you begin, it is imperative to set your file system to be case sensitive in Windows.  Both the kernel headers and C library use file names with the same case insensitive name but different case sensitive name.  Open regedit.exe and set the following to 0.
    HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel\obcaseinsensitive

    Then reboot.

  2. Download and Cygwin from here.  We will assume that you installed it on C:\cygwin.  I am using the Cygwin 2.844 32 bit version as the compiler being built because it will run on 32 bit or 64 bit Windows.
  3. When you run setup, you will get a nice GUI to choose your packages.  If you ever want to add or remove a package, you run setup again (seems counter-intuitive on Windows).  Take the defaults and add the following packages (not all may be required but it didn’t hurt).
      • Devel/gperf
      • Devel/bison
      • Devel/flex
      • Devel/patch
      • Devel/make: The GNU version…
      • Devel/automake
      • Devel/libtool
      • Devel/subversion
      • Devel/gcc-core
      • Devel/gcc-g++
      • Devel/catgets
      • Web/wget
      • Libs/libncursesw-devel
      • Libs/libncurses-devel
      • Libs/gettext
      • Libs/libexpat-devel
  4. Open up your cygwin terminal.  If you have a shortcut on your desktop or in your start menu, use that.  If not, the shortcut contains the following target:
    c:\cygwin\bin\mintty.exe -i /Cygwin-Terminal.ico -
  5. In the terminal, download and build crosstool-ng following the steps here.  Substitute your version.  There are listed below with my version and some other things I did.
    wget http://crosstool-ng.org/download/crosstool-ng/crosstool-ng-1.19.0.tar.bz2
    tar xjf crosstool-ng-1.19.0.tar.bz2
    cd crosstool-ng-1.19.0
    ./bootstrap
    ./configure --prefix=/home/maks/crosstool
  6. There is one issue with curses.  In crosstool-ng-1.19.0/kconfig/nconf.c, there is the line “ESCDELAY = 1;”.  Swap this line with “set_escdelay(1);”  (A patch for this is listed here.  I did not apply the other two patches and had success building.)
  7. After making the previous correction,  we can make and install.
    make
    make install
  8. To make life easier, export your path to include /home/maks/crosstool/bin substituting your home directory.  I added this to my .bashrc so I wouldn’t have to think about it again.
    export PATH="${PATH}:/home/maks/crosstool/bin"
  9. This is where the patching begins.  Do the following.
    mkdir /usr/include/linux
    cp /usr/include/asm/types.h /usr/include/linux

    Then edit /usr/include/linux/types.h and include the following:

    typedef __signed__ long long __s64;
    typedef unsigned long long __u64;
  10. Make a new directory.  Since I wanted a 64 bit compiler for Linux, I did the following.  Adding the src directory seemingly allows the tarballs to be saved.  (This is not used and seems like a bug in the scripts.)  
  11. mkdir ~/src
    mkdir ~/linux64
    cd ~/linux64
    ct-ng i686-nptl-linux-gnu
    ct-ng menuconfig
  12. In menuconfig, I updated everything to the latest compiler, libraries, 64 bit, eglibc (which Ubuntu uses), etc.  If you want to cheat with menuconfig, use my config.  Simply copy the text into a .config file in the linux64 directory.
  13. You can start building with the following.  I recommend that you read the remainder of the post first.  These are tips that may help out.
    ct-ng build
  14. While building the kernel headers, I came to realize that Cygwin doesn’t have enough elf headers to be successful.  I applied the patch found here.  Make sure the patch applies correctly.  I had some issues.  Also, I had to edit /usr/include/sys/elf_common.h so that R_X86_64_JMP_SLOT was spelled R_X86_64_JUMP_SLOT (only that define).
  15. It turns out that for the latest kernel, a new version of make is needed than what comes with Cygwin.  In menuconfig, add make to the list of companion tools.
  16. Make sure you are not trying to build anything statically.  The final build of the compiler will not succeed.
  17. When you get passed installation of first pass of gcc, you are probably well on your way.  It will take around 2 to 3 hours to fully compile.  You may want to turn off anti-virus protection during this time.
  18. If you come across errors, you can restart ct-ng building where it left off by selecting “Paths and misc options / Debug crosstool-NG / Save intermediate steps.  Then to restart, run
    ct-ng list-steps
    ct-ng <last successful step name>+
  19. I never got D.U.M.A to actually build.  I lost patience trying to figure it out.  However, I never really needed a memory overrun checker in my case.

So that is that.  I compiled a C and C++ program on Windows and ran the binaries on Linux 64 bit Ubuntu 13.10.

CS-Script – C# scripting

A quick shout out is in order to Oleg Shilo.  The CS-script is fantastic for basic scripting needs or testing behaviors of C# and .NET.  Oleg has also included a plugin for Notepad++.

It also can be used to generate executable “scripts” to make useful utilities without bringing up an entire development environment and the overhead of projects and solutions.

For me personally, it will never really replace Python but it is good to know that there are alternatives for Windows based development when Python is not an option.  (Yes, those times do actually exist for some of us.)

More poo in the toolbox.

Notes on XML Serialization in C#

The poo crew was having some trouble understanding the behavior of XML serialization on .NET.  So we will add some clarity.

We wanted a serializer where default values could be set in code when reading older serialized XML and all tags were written regardless of “default value”.  This way, a human could inspect the XML and know that the properties they see are all of them, at the time of serialization.

XML Serializer

XmlSerializer has been there since the beginning of .NET time (or at least 1.1).  Here are some characteristics of XmlSerializer.

  1. All public properties from a public object are written using XmlSerializer as long as the [System.ComponentModel.DefaultValueAttribute(x)] is not attributed to the property.  This also can look like [DefaultValue(x)] and it is the same attribute.
  2. For strings, include [XmlElement(IsNullable = true)] as a decorator if you want the tag specifically in the XML and the string is not assigned.
  3. The default construction is performed when deserializing.  This is true for setting values in the constructor or assigning at declaration.

It is important to note that [DefaultValue()] has a second purpose.  Property grids use this to know whether or not to bold a value in the UI.  If the value = default value, the text is not bold.  If value != default value, the text is bold. That is all it does.  It absolutely does not change the class member no matter what your friends say.

Data Contract Serializer

This was added to the framework around .NET 3.0 (if we can believe Microsoft).  Here are the high points:

  1. All properties with [DataMember] appear to be written regardless of default values are set or not.  Why it didn’t work with SpiralToPolarPersistedData is something to look into.
  2. Constructors are not called when deserializing.  This is true for setting values in constructors or assigning at declaration.
  3. The only way to guarantee a default value is to assign the values in a method decorated with [OnDeserializing].  A common pattern is to call the default method assigning from the constructor and from the OnDeserializing method (assuming the default method isn’t overrideable).
  4. If you do not include OnDeserializing, the values in the class are the type’s default values regardless of construction or declaration.

XML Serializer Example Code

using System;
using System.IO;
using System.Xml;
using System.Xml.Serialization;

public class Script
{ 

  public class Record
  {
    private double n1;
    private double n2 = 100;
    private string operation;
    private double result;

    internal Record() 
    { 
      //n2 = 100;
    }

    internal Record(double n1, double n2, string operation, double result)
    {
      this.n1 = n1;
      this.n2 = n2;
      this.operation = operation;
      this.result = result;
    }

    public double OperandNumberOne
    {
      get { return n1; }
      set { n1 = value; }
    }

    public double OperandNumberTwo
    {
      get { return n2; }
      set { n2 = value; }
    }

    [XmlElement(IsNullable = true)]
    public string Operation
    {
      get { return operation; }
      set { operation = value; }
    }

    public double Result
    {
      get { return result; }
      set { result = value; }
    }

    public override string ToString()
    {
      return string.Format("Record: {0} {1} {2} = {3}", n1, operation, n2, result);
    }
  }

  static public void Main(string[] args)
  {
    Record record0 = new Record();
    Console.WriteLine(record0.ToString());

    Record record1 = new Record(1, 2, "+", 3);

    XmlSerializer serializer = new XmlSerializer(typeof(Record));

    using (FileStream stream = File.Open("test.xml", FileMode.Create))
    {
      serializer.Serialize(stream, record1);
    }

    Console.WriteLine("Press any key...");
    Console.ReadKey(false);

    using (FileStream stream = File.Open("test.xml", FileMode.Open))
    {
      Record record2 = (Record) serializer.Deserialize(stream);
      Console.WriteLine(record2.ToString());
    }
  }   
}

Data Contract Serializer Example Code

using System;
using System.Runtime.Serialization;
using System.IO;
using System.Xml;

public class Script
{ 

  [DataContract]
  internal class Record
  {
    private double n1;
    private double n2; // = 100;
    private string operation;
    private double result;

    internal Record() 
    { 
      // n2 = 100; */ 
      SetDefaults();
    }

    [OnDeserializing]
    private void OnDeserializing(StreamingContext context)
    {
      SetDefaults();
    }

    private void SetDefaults()
    {
      n2 = 100;
    }

    internal Record(double n1, double n2, string operation, double result)
    {
      this.n1 = n1;
      this.n2 = n2;
      this.operation = operation;
      this.result = result;
    }

    [DataMember]
    internal double OperandNumberOne
    {
      get { return n1; }
      set { n1 = value; }
    }

    [DataMember]
    internal double OperandNumberTwo
    {
      get { return n2; }
      set { n2 = value; }
    }

    [DataMember]
    internal string Operation
    {
      get { return operation; }
      set { operation = value; }
    }

    [DataMember]
    internal double Result
    {
      get { return result; }
      set { result = value; }
    }

    public override string ToString()
    {
      return string.Format("Record: {0} {1} {2} = {3}", n1, operation, n2, result);
    }
  }

  static public void Main(string[] args)
  {
    Record record0 = new Record();
    Console.WriteLine(record0.ToString());

    Record record1 = new Record(1, 2, "+", 3);

    DataContractSerializer serializer = new DataContractSerializer(typeof(Record));

    using (FileStream stream = File.Open("test.xml", FileMode.Create))
    {
      serializer.WriteObject(stream, record1);
    }

    Console.WriteLine("Press any key...");
    Console.ReadKey(false);

    using (FileStream stream = File.Open("test.xml", FileMode.Open))
    {
      XmlDictionaryReader reader = XmlDictionaryReader.CreateTextReader(stream, new XmlDictionaryReaderQuotas());
      Record record2 = (Record) serializer.ReadObject(reader, true);
      Console.WriteLine(record2.ToString());
    }
  }
}