Integrating Big Data search tool Elasticsearch into the Arches geospatial web application

Integrating Big Data search tool Elasticsearch into the Arches geospatial web application

elasticsearchWhen we started developing the Arches Geospatial Cultural Asset Management code base (archesproject.org) we decided upon a graph based data structure. Graphs are particularly nice for the semantic information built into the structure (ie: node x is related to node y by relationship z). But in choosing this structure, you sacrifice a little speed during data access. For example, to relate 2 nodes together effectively requires 2 database joins if they share a parent-child relationship. If they share a parent-grandchild relationship, it’s essentially 4 joins. Every hierarchical level of nesting requires navigating 2 database joins. You can see how deeply nested data can really tax a database.

For create/update/delete operations this sacrifice in performance is probably acceptable. People are fine waiting a few seconds while a “saving…” spinner is displayed. But, for bulk read operations that can occur during search operations or displaying of data on a map, speed is critical. So to achieve the best of both worlds (semantically rich data AND high performance) we chose to use Elasticsearch to speed up data access for searches and map display.

Elasticsearch is an enterprise scale search engine built on top Apache Lucene. We chose ES because of it’s distributed nature, it’s well documented API, it’s ease of integration with our Django/Python based app, and most importantly, the documents that ES indexes are all in JSON format which matched nicely with how Arches data is transfered between the client and sever. Although ES provides the API in several different languages, the best part in my mind is the ability to access it completely over http.

After setting up ES (which took all of 5 min), we were able to start indexing our resource graphs by POSTing them to a url in the form or host:port/{index name}/{index type}/{document id}. Here’s a very simple sample:

{
    "entityid": "dcdd41b8-c624-4a55-a4ef-a5036a82be93", 
    "property": "", 
    "entitytypeid": "FILES.E73", 
    "value": "", 
    "relatedentities": [ 
        {
            "entityid": "5e5a892e-652a-401a-bb8b-44f84272ab34", 
            "property": "-P94", 
            "entitytypeid": "CREATION EVENT.E65", 
            "value": "", 
            "relatedentities": [
                {
                    "entityid": "c963f2cd-5986-47c5-8a9e-37bf48414e89", 
                    "property": "P4", 
                    "entitytypeid": "TIME-SPAN_CREATION EVENT.E52", 
                    "value": "", 
                    "relatedentities": [
                        {
                            "entityid": "dda65ae5-05dd-41f5-93c1-ce182c401036", 
                            "property": "P78", 
                            "entitytypeid": "DATE OF COMPILATION.E50", 
                            "value": "2014-05-06", 
                            "relatedentities": []
                        }
                    ]
                }
            ]
        },
        {
            "entityid": "7ad89e88-0664-4b92-ad88-b49a775f3155", 
            "property": "P3", 
            "entitytypeid": "FILE PATH.E62", 
            "value": "http://localhost/arches_uploaded_file/files/ff_27.jpg", 
            "relatedentities": []
        },
        {
            "entityid": "dda65ae5-05dd-41f5-93c1-ce182c401036", 
            "property": "P1", 
            "entitytypeid": "TITLE.E41", 
            "value": "Stonehenge", 
            "relatedentities": []
        }
    ]
}

Without any configuration, ES will index the above document when you put it in the POST body and submit it to a url like this: localhost:9200/entity/FILES.E73/dcdd41b8-c624-4a55-a4ef-a5036a82be93. Unfortunately because of the deeply nested nature of our data, even a simple example like this will generate a fairly complex mapping scheme (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping.html). This can make structuring queries against the data more difficult, so to simplify the query dsl that we have to generate, we decided to reduce and flatten what we index to just a few key items related to each resource. So, for example, the above document might get translated to something like this:

{
    "DATE OF COMPILATION.E50": "2014-05-06",
    "TITLE.E41": "Stonehenge",
    "FILE PATH.E62": "http://localhost/arches_uploaded_file/files/ff_27.jpg",
    "entitytypeid": FILES.E73,
    "entityid": be74b995-f3ee-4b93-b8c2-f5b3eda5d2e3
}

The mappings that ES generates for something like the above are then much easier to understand. The query dsl to query for images with the title “Stonehenge” in them might look like this:

{
    "sort": [],
    "query": {
        "bool": {
            "should": [],
            "must_not": [],
            "must": [
                {
                    "match_phrase": {
                        "TITLE.E41": {
                            "query": "stonehenge"
                        }
                    }
                }
            ]
        }
    },
    "facets": {},
    "from": 0,
    "size": 50
}

In the simplified document, you can see that we now associate the TITLE.E41 node directly with it’s value (“TITLE.E41”: “Stonehenge”). If we were to have used our original document, the query dsl for that would have to query on the “value” key which could return results we may not want. For example, it may
return a resource that has this entity as one of it’s sub nodes:

{
    "entityid": "7ad89e88-0664-4b92-ad88-b49a775f3155", 
    "property": "P3", 
    "entitytypeid": "FILE PATH.E62", 
    "value": "http://localhost/arches_uploaded_file/files/stonehenge_shot_glass.jpg", 
    "relatedentities": []
}

ES is a very powerful and complex piece of software, and in the Arches project we’ve really only scratched the surface of what it can do. Like most things, it’s good to start simply and slowly build up complexity as needed.

Related articles

FARL_Divider_Graphic-cropped
Deploying ArcGIS Portal and Your First Web Applications with the LGIM
Use your mobile phone to find the best fishing hotspots near you
Port of San Francisco Uses Enterprise GIS to Prepare for Sea Level Rise Caused by Climate Change