Archon Logo

Archon
The Simple Archival Information System

Home

About

Download

Documentation

Forums

ArchivesSpace

Sandbox

Staff

Archon Technical Overview

This document was written in January 2007 by the original Archon developer, Chris Rishel. It provides a very high level overview of the system, and was edited by Chris Prom.

Background

In order to function, Archon only requires a web server with PHP5 and a database server (MSSQL and MYSQL are supported, and more can be implemented by adding a module to the database platform; Archon uses MDB2 as a database abstraction layer). The program is distributed with an automated installer to simplify the process of installation, and can easily configured and running live within five minutes. These characteristics are crucial because the platforms the system runs on are all available free of charge, and the technical details of installation are handled easily, which makes Archon accessible to most institutions.

Development

A great deal of effort was exerted to make the system simple to use from the developer’s perspective. The current version is built upon an object-oriented API which provides an extensive interface to the database, and abstracts much of the error checking out of the scope of the casual developer.

For the purposes of development, the Archon code base can be logically subdivided into several groupings, consisting of the following: public output modules, administrative interface modules, the Archon API, and the database platform. Public output modules are responsible for the public output made available to researchers, such as a finding aid, a marc record, a list of subjects, or a response to a search query. Administrative interface modules are components of the input aspect of Archon. They are only accessible by authenticated users (typically employees of the archival institution). Examples of administrative interface modules would be modules that handled the input of collection data, or the management of user accounts. One important distinction is that the administrative interface modules do not actually execute any database manipulation, but instead generate the interface to gather the data to do so. The Archon API actually executes instructions that manipulate data. For the most part the API methods and procedures are expresses as refinements to subclasses of the core ARCHON object which is loaded into memory every time a script is run. The API performs all necessary error checking to ensure the preservation of standards for data, and operates on a security model that allows administrators to fine tune levels of access for each user or group of users. The database platform is the portion of the system that actually communicates with the database server. It also provides functionality to retrieve data about the database, its structure, and its contents.

One important concept in the system is the multiple levels of abstraction at work. Because the system has grown to be a massive application, it would be nearly impossible to develop if every detail must be known and handled at all levels. In Archon, the hierarchy of abstraction from lowest to highest level is the database platform, the API (expressed in the master ARCHON object class and the various sub classes (e.g. collections, digital library, accession, security, etc, which are tecnically extensions of the ARCHON object and inherits its properties and methods), and the output and administrative interface, which are equivalent in that they both just generate view and handle data that is processed by the methods avialable on the various objects that are loaded when a script is executed..

Because the system should support whatever web technologies the end user has available, it was necessary to support the most commonly used database servers. Unfortunately, each of these database servers has its own commands to create tables, insert, update, and remove data, etc. The database platform in Archon abstracts these details from the developer, and handles any necessary manipulation of queries automatically without any concern to the next level of the system.

The Archon API is a powerful set of over 300 functions that handle the management of the data in Archon. The API abstracts the specifics of retrieval, searching, input/output, and validation of data, protection from malicious attacks, security, and generalizes functions to optimize efficiency both from the perspective of speed and storage requirements. Executive functions in the API return either true or false (if false, the API makes a detailed error message available to the calling function). This allows complex database operations to be executed with one statement and minimal error checking.

Effectively, this means that developers working at a higher level (such as the output or administrative interface) do not need to have any knowledge whatsoever as to the structure or the implementation of the database; they can just call the appropriate function/method and the required data will be loaded into an array or posted back to the constructor as specificed in the ______update function. All they need is to read the documentation of the API or find the approriate method to find the details about how to call the necessary function(s). Furthermore, the API was carefully crafted to be consistent so that function and variable names are predictable, and therefore it can be easily learned.

Add some additional details and examples here

Examples of Non-Trivial API Optimizations

The generation of finding aids is one of the key components of Archon. These finding aids can range in size from describing less than 100 to thousands of items. As one might imagine, the process of loading and preparing these finding aids is somewhat computationally expensive. Before analyzing the algorithms to handle the generation of finding aids, it is necessary to consider the data structures involved. To maintain a sound logical organization, the “collection content” uses a recursive data structure, where one piece of content contains its immediate children, as well as an identifier to its parent. A snippet of the relevant class definitions follows:

class Collection 
{
    …

    /** 
     * @var CollectionContent[] 
     */ 
    public $Content = array(); 

    …
}

class CollectionContent 
{
    …

    /** 
     * @var Collection 
     */ 
    public $Collection = NULL;

    /** 
     * @var CollectionContent[] 
     */ 
    public $Content = array(); 

    /** 
     * @var CollectionContent 
     */ 
    public $Parent = NULL;

    …
}

When a finding aid is requested, a collection object is loaded which contains a Content class variable which is an array of the “root-level” CollectionContent objects. The root-level content (usually series) have their Parent variables set to NULL (because they have no parent), and their Content variables containing an array of all Content they directly contain.

Because the CollectionContent data is recursive, and is somewhat similar to a tree in structure, the initial algorithm (as shown below) used to load a collection’s content was a slightly modified depth-first search traversal, where all of the root-level content would be loaded, and then for each of the root-level content, a DFS traversal would be performed. Note that code and database query has been simplified substantially to improve readability.

function traversal_DisplayCollection($id) 
{ 
   $objCollection = New Collection($id); 
   $objCollection->Content = traversal_RecurseContent($id, 0); 
} 

function traversal_RecurseContent($collectionid, $containedbyid) 
{ 
    $query = " 
      SELECT * 
      FROM 
         tblCol_Content 
      WHERE 
         CollectionID = '$collectionid' 
       AND 
         ContainedByID = '$containedbyid'"; 

    $result = $db->query($query); 
    while($row = $db->fetch_array($result)) 
    { 
        traversal_DisplayContent($row); 
        traversal_RecurseContent($collectionid, $row['ID']); 
    } 
}

For one test collection, which contained over 10000 content entries, preparing a finding aid took over two minutes. After some investigation, it became clear that the latency in making thousands of queries to the database server was the cause of the problem. The current version of Archon utilizes a “dump-and-sort” algorithm, which makes one query to the database selecting all the content for a given collection, and then sorts the data into the proper structure. This also alleviates the overhead of many recursive calls. This method of loading takes approximately 3 seconds for the same collection. The following code is a simplified form of the dbLoadContent function of the Collection class.

public function dbLoadContent() 
{ 
    $query = " 
      SELECT * 
      FROM 
         tblCol_Content 
      WHERE 
         tblCol_Content.CollectionID = '$this->ID'"; 

    $result = $db->query($query); 

    while($row = $db->fetch_array($result)) 
    { 
        // If Content[$row['ID']] is already a CollectionContent, for example, in the
        // case where a child was found before the parent, we don't want a new instance,
        // but we do want to run the constructor. 
        if(($this->Content[$row['ID']] instanceof CollectionContent)) 
        { 
            $this->Content[$row['ID']]->CollectionContent($row); 
        } 
        else 
        { 
            $this->Content[$row['ID']] = New CollectionContent($row); 
        } 

        $this->Content[$row['ID']]->Collection = $this; 

        // If the current CollectionContent has a parent, add it to Content[] of the
        // parent. 
        if($row['ParentID']) 
        { 
            // If the parent has not been found yet, make a new CollectionContent
            // instance for it. 
            if(!($this->Content[$row['ParentID']] instanceof CollectionContent)) 
            { 
                $this->Content[$row['ParentID']] =
                   New CollectionContent($row['ParentID']); 
            } 

            $this->Content[$row['ParentID']]->Content[$row['ID']] =
               $this->Content[$row['ID']]; 
            $this->Content[$row['ID']]->Parent = $this->Content[$row['ParentID']]; 
        } 
    } 
}

Another optimization that stemmed from this improvement is the caching of data whenever possible. There are a great deal of functions in the Archon API of the form getAllX() where X is some set of data (for example, LevelContainers, Subjects, or even Collections). Throughout the execution of a script, many functions are called, any some may call the same getAllX() that has already been called elsewhere. Archon has an abstracted system to cache the data retrieved from these calls such that, so long as the database table from which this data was loaded has not been changed, any subsequent calls will return the cached version of the data, saving a great deal of time.


Archon license

Questions? Comments?: Chris Prom