Category Archives: HTTP

A Search Engine for Office Documents

Have you ever worked at a place where there was a mass of files and documents on  a share and even old timers forget where important documents are?

Search by file name stinks and SharePoint has been another excuse to dump stuff that gets lost.

So I decided to figure out an easy way to get a content search engine up looking through the files on a share.    I found a solution.  It isn’t pristine for these reasons.

  1. Browsers can’t link to files on a share for obvious security reasons.
  2. For reason one, the decision was made to copy searchable documents onto the web server.  This is time consuming to transfer and duplicates information but the documents are served successfully.
  3. For reason two, it would be possible to add an server plugin that reads and delivers a file on a share.  Just haven’t done that yet.

So we will start with what we have and consider changing it later.

The basis for this will be Ubuntu 12.04 LTS.  Why?  Because I have such a machine handy and it is 9 years old.  This will be based on all the wonderful work of elasticsearch and Lucene.

So, here are the steps.  Remember, this is a bit hacky.

  1. Install apache2.  (In the case of Ubuntu, it is “sudo apt-get install apache2”.)
  2. Install openjdk-7-jre-headless.  (“sudo apt-get isntall openjdk-7-jre-headless”).
  3. Download elasticsearch (from – the .com site takes you to pay-for products).  Because I am using Ubuntu, I thought I would use the apt repository.
  4. Follow the steps to start elasticsearch – in my case listed on the web site.  Be advised that elasticsearch binds to all interfaces tp a free port between 9200 and 9300.  We will assume that the port is 9200 as it is in my case.  However, it probably should only bind to a port on localhost or at least, the security should be evaluated to make sure it complies with what you need.
  5. We will need two plugins.  You can install them from you elasticsearch/bin location.  In my case it was /usr/share/elasticsearch/bin/plugin.
    bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/2.0.0
    bin/plugin -install de.spinscale/elasticsearch-plugin-suggest/1.0.1-2.0.0

    Restart elasticsearch. (“sudo service elasticsearch restart”).  You will also need to verify the versions of these plugins.

  6. For apache2, make sure to enable the proxy, proxy_http, and ssl modules.  On Ubuntu, the “a2enmod” is an easy utility to do this.
  7. In my Apache setup, I added a new file called “elasticsearch” inside /etc/apache2/conf.d.  (Note the 13.10 doesn’t use a conf.d directory.   It could be added to the bottom of apach2.conf although I am sure there is a more “pristine” location.)  The contents are below.
    <IfModule proxy_module>
    <IfModule proxy_http_module>
    <Proxy *>
    <Limit GET > 
        allow from all 
        order deny,allow 
        deny from all 
    ProxyPreserveHost On
    ProxyRequests Off
    LogLevel debug
    ProxyPass /es http://localhost:9200/
    ProxyPassReverse /es http://localhost:9200/

    The application depends on the /es directory under web root. This can be changed along with the web pages that use it.

  8. Restart apache2.  (“sudo service apache2 restart”)
  9. Download the HTML and Javascript for the search pages from here:  Search HTML and Javascript.  It uses jQuery and jQueryUI and AJAX to perform the searching and suggestions.  Unzip and place in the web directory where you want it.  For me, I wanted a search subdirectory so I placed my in /var/www/search.
  10. So, the last thing is show how to index the files.  I am a fan of python so this is python code making http requests to elasticsearch adding the information.  The script below deletes the index, recreates, and starts adding content to it – from files in a directory.
    #! /usr/bin/python
    import httplib 
    import binascii
    import os
    HOST = 'localhost:9200'
    INDEX = '/basic'
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
            conn.request(verb, url, body)
        return conn.getresponse().read()
    def connAddFile(conn, filename, rootFsDir, httpPrefix):
        with open(filename, 'rb') as f:
            base64Data = binascii.b2a_base64([:-1]
        title = os.path.basename(filename)
        location = httpPrefix + filename[len(rootFsDir):]
        attachment = '{ "file":"' + base64Data + '", "title" : "' + title + '", "location" : "' + location + '" }'
        print connRequest(conn, 'POST', INDEX + '/attachment/', attachment)
    conn = httplib.HTTPConnection(HOST)
    print connRequest(conn, 'DELETE', INDEX)
    print connRequest(conn, 'PUT', INDEX, '{  "settings" : { "index" : { "number_of_shards" : 1, "number_of_replicas" : 0 }}}') 
    print connRequest(conn, 'GET', '/_cluster/health?wait_for_status=green&pretty=1&timeout=5s' )
    print connRequest(conn, 'PUT', INDEX + '/attachment/_mapping', '{  "attachment" : {   "properties" : {      "file" : {        "type" : "attachment",        "fields" : {          "title" : { "store" : "yes" },          "file" : { "term_vector":"with_positions_offsets", "store":"yes" }        }      }    }  }}' )
    # Add files here repeatedly
    rootFsDir = '/var/www/search/data/'
    searchDir = ''          # This is for recursion through the directories
    httpPrefix = 'data/'
    # Make this recursive some day
    for file in os.listdir(rootFsDir + searchDir):
        connAddFile(conn, rootFsDir + searchDir + file, rootFsDir, httpPrefix)
    print connRequest(conn, 'POST', '/_refresh')
  11. If you decide to get more creative and add only new files and delete the old ones, we need to understand how to get the list of existing files that are indexed.  Then you just have to correlate the current state of the files on disk with the index list.  This script gets the indexes and the files associated with them.
    #! /usr/bin/python
    import httplib 
    import json
    import sys
    import os
    import hostinfo
    argc = len(sys.argv)
    if argc != 2:
        print os.path.basename(sys.argv[0]), ""
    indexFileName = sys.argv[1]
    def connRequest(conn, verb, url, body = None):
        if body == None:
            conn.request(verb, url)
            conn.request(verb, url, body)
        return conn.getresponse().read()
    conn = httplib.HTTPConnection(hostinfo.HOST)
    data = json.loads(connRequest(conn, 'GET', hostinfo.INDEX + '/_search?search_type=scan&scroll=10m&size=10', '{"query":{"match_all" :{}}, "fields":["location"]}' ))
    total = data["hits"]["total"]
    #scroll session id, used to request the next batch of data
    scrollId = data["_scroll_id"]
    counter = 0; 
    data = json.loads(connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId))
    f = open(indexFileName, 'w')
    while len(data["hits"]["hits"]) > 0:
        for item in data["hits"]["hits"]:
            f.write(item["fields"]["location"][0] + ',' + item["_id"] + '\n')
        counter = counter + len(data["hits"]["hits"])
        print "Reading Index:", counter, "of", total
        scrollId = data["_scroll_id"]
        resp = connRequest(conn, 'GET', hostinfo.SITE + '/_search/scroll?scroll=10m', scrollId)
        #print resp
        data = json.loads(resp)
  12. To delete files, the python snippet looks like this where index is the id for the file we want indexing deleted for.
    def connDeleteFile(conn, index):
        print connRequest(conn, 'DELETE', hostinfo.INDEX + '/attachment/' + index)

So there we have it.  All we have to do figure out where we are getting our data from and copy it to the “data” directory.  One particular way I have done this is with rsync across an SMB share.

This by no means is meant to be a lesson on elasticsearch.  There can be some improvement here.

However, this is a quick way to set up searching documents for information you never knew existed.  (Side note:  I have had 10 ms search times across 2500 documents.)


C# WebRequests without Proxy or Delay

WebRequest objects in C# are useful creations when you want to script interaction with a web server in your application.  However, there are some common gotchas that I always have to look up to rectify.  So here are three of the most common in one handy location.

Nagle Algorithm

First, remove the use of the Nagle algorithm.  The Nagle algorithm is great for protocols like telnet where there is a chance of sending small amounts of data.  It is not so good for a protocol where packet sizes are intentionally small.  It will kill performance while queuing up bytes of data to send.  Use the following:

System.Net.ServicePointManager.UseNagleAlgorithm = false;

Expect 100 Continue

Expect 100 Continue is an HTTP 1.1 addition where a request can detect the readiness of a server before sending the a large body in a post.  The WebRequest object always sends an Expect: 100-continue in the header.  Not all web servers support handling this (i.e. lighttpd).  I suppose there is value to the 100 status code when posting large bodies but for most data transfers (i.e. SOAP, REST, XMLRPC, etc.), it doesn’t seem to be very useful.  Use the following to disable this.

System.Net.ServicePointManager.Expect100Continue = false;

WebRequest Proxy

By default, Windows will use the proxy settings internally set.   If you know your network is local, allowing the .NET framework to evaluate the default proxy settings can take unnecessary time.  You can set .NET to not use or look for any proxy by setting the following code:

WebRequest request = WebRequest.Create (resource);
    request.Proxy = null;


WebRequest.DefaultWebProxy = null;

Please remember to flush.