Applications

Introduction

Most of the applications discussed in this chapter are built on a regular basis, at least once a day from the latest sources, and if you are in NCBI, then you can find the latest version in the directory: $NCBI/c++/Release/bin/ (or $NCBI/c++/Debug/bin/).

Chapter Outline

The following is an outline of the topics presented in this chapter:

DATATOOL: code generation and data serialization utility
Load Balancing
NCBI Genome Workbench
- Design goals
- Design
NCBI NetCache Service

DATATOOL: Code Generation and Data Serialization Utility

DATATOOL source code is located at c++/src/serial/datatool; this application can perform the following:

Generate C++ data storage classes based on ASN.1, DTD, XML Schema or JSON Schema specification to be used with NCBI data serialization streams.
Convert ASN.1 specification into a DTD, XML Schema or JSON Schema specification and vice versa.
Convert data between ASN.1, XML and JSON formats.
Generate SOAP client code based on WSDL specification.

Note: Because ASN.1, XML and JSON are, in general, incompatible, the last two functions are supported only partially.

The following topics are discussed in subsections:

Invocation
Data specification conversion
Definition file
Module file
Generated code
Class diagrams

Invocation

The following topics are discussed in this section:

Main arguments
Code generation arguments

Main Arguments

See Table 1.

Table 1. Main arguments

Argument	Effect	Comments
-h	Display the DATATOOL arguments	Ignores other arguments
-m <file>	module specification file(s) - ASN.1, DTD, XSD or JSON	Required argument
-M <file>	External module file(s)	Is used for IMPORT type resolution
-i	Ignore unresolved types	Is used for IMPORT type resolution
-f <file>	Write ASN.1 module file
-fx <file>	Write DTD module file	“-fx m” writes modular DTD file
-fxs <file>	Write XML Schema file
-fjs <file>	Write JSON Schema file
-fd <file>	Write specification dump file in datatool internal format
-ms <string>	Suffix of modular DTD or XML Schema file name
-dn <string>	DTD module name in XML header	No extension. If empty, omit DOCTYPE declaration.
-v <file>	Read value in ASN.1 text format
-vx <file>	Read value in XML format
-vj <file>	Read value in JSON format
-F	Read value completely into memory
-p <file>	Write value in ASN.1 text format
-px <file>	Write value in XML format
-pj <file>	Write value in JSON format
-d <file>	Read value in ASN.1 binary format	-t argument required
-t <type>	Binary value type name	See -d argument
-e <file>	Write value in ASN.1 binary format
-xmlns	XML namespace name	When specified, also makes XML output file reference Schema instead of DTD
-sxo	No scope prefixes in XML output
-sxi	No scope prefixes in XML input
-logfile <File_Out>	File to which the program log should be redirected
conffile <File_In>	Program’s configuration (registry) data file
-version	Print version number	Ignores other arguments

Code Generation Arguments

See Table 2.

Table 2. Code generation arguments

Argument	Effect	Comments
-od <file>	C++ code definition file	See Definition file
-ods	Generate an example definition file (e.g. `MyModuleName._sample_def`)	Must be used with another option that generates code such as -oA.
-odi	Ignore absent code definition file
-odw	Issue a warning about absent code definition file
-oA	Generate C++ files for all types	Only types from the main module are used (see -m and -mx arguments).
-ot <types>	Generate C++ files for listed types	Only types from the main module are used (see -m and -mx arguments).
-ox <types>	Exclude types from generation
-oX	Turn off recursive type generation
-of <file>	Write the list of generated C++ files
-oc <file>	Write combining C++ files
-on <string>	Default namespace	The value “-“ in the Definition file means don’t use a namespace at all and overrides the -on option specified elsewhere.
-opm <dir>	Directory for searching source modules
-oph <dir>	Directory for generated *.hpp files
-opc <dir>	Directory for generated *.cpp files
-or <prefix>	Add prefix to generated file names
-orq	Use quoted syntax form for generated include files
-ors	Add source file dir to generated file names
-orm	Add module name to generated file names
-orA	Combine all -or* prefixes
-ocvs	create “.cvsignore” files
-oR <dir>	Set -op* and -or* arguments for NCBI directory tree
-oDc	Turn ON generation of Doxygen-style comments	The value “-“ in the Definition file means don’t generate Doxygen comments and overrides the -oDc option specified elsewhere.
-odx <string>	URL of documentation root folder	For Doxygen
-lax_syntax	Allow non-standard ASN.1 syntax accepted by asntool	The value “-“ in the Definition file means don’t allow non-standard syntax and overrides the -lax_syntax option specified elsewhere.
-pch <string>	Name of the precompiled header file to include in all *.cpp files
-oex <export>	Add storage-class modifier to generated classes	Can be overriden by [-]._export in the definition file.

Data Specification Conversion

When parsing a data specification, DATATOOL identifies the specification format based on the source file extension - ASN, DTD, XSD, JSD or WSDL.

Scope Prefixes

Initially, DATATOOL and the serial library supported serialization in ASN.1 and XML format, and conversion of ASN.1 specification into DTD. Compared to ASN.1, DTD is a very sketchy specification in the sense that there is only one primitive type - string, and all elements are defined globally. The latter feature of DTD led to a decision to use ‘scope prefixes’ in XML output to avoid potential name conflicts. For example, consider the following ASN.1 specification:

Date ::= CHOICE {
    str VisibleString, 
    std Date-std
}
Time ::= CHOICE {
    str VisibleString, 
    std Time-std
}

Here, accidentally, element str is defined identically both in Date and Time productions; while the meaning of element std depends on the context. To avoid ambiguity, this specification translates into the following DTD:

<!ELEMENT Date (Date_str | Date_std)>
<!ELEMENT Date_str (#PCDATA)>
<!ELEMENT Date_std (Date-std)>
<!ELEMENT Time (Time_str | Time_std)>
<!ELEMENT Time_str (#PCDATA)>
<!ELEMENT Time_std (Time-std)>

Accordingly, these scope prefixes made their way into XML output.

Later, DTD parsing was added into DATATOOL. Here, scope prefixes were not needed. Also, since these prefixes considerably increase the size of the XML output, they could be omitted when it is known in advance that there can be no ambiguity. So, DATATOOL has got command line flags, which would enable that.

With the addition of XML Schema parser and generator, when converting ASN.1 specification, elements can be declared in Schema locally if needed, and scope prefixes make almost no sense. Still, they are preserved for compatibility.

Modular DTD and Schemata

Here, ‘module’ means ASN.1 module. Single ASN.1 specification file may contain several modules. When converting it into DTD or XML schema, it might be convenient to put each module definitions into a separate file. To do so, one should specify a special file name in -fx or -fxs command line parameter. The names of output DTD or Schema files will then be chosen automatically - they will be named after ASN.1 modules defined in the source. ‘Modular’ output does not make much sense when the source specification is DTD or Schema.

You can find a number of DTDs and Schema converted by DATATOOL from NCBI public ASN.1 specifications here.

Converting XML Schema into ASN.1

There are two major problems in converting XML schema into ASN.1 specification: how to define XML attributes and how to convert complex content models. The solution was greatly affected by the underlying implementation of data storage classes (classes which DATATOOL generates based on a specification). So, for example the following Schema

<xs:element name="Author">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="LastName" type="xs:string"/>
      <xs:choice minOccurs="0">
        <xs:element name="ForeName" type="xs:string"/>
        <xs:sequence>
          <xs:element name="FirstName" type="xs:string"/>
          <xs:element name="MiddleName" type="xs:string" minOccurs="0"/>
        </xs:sequence>
      </xs:choice>
      <xs:element name="Initials" type="xs:string" minOccurs="0"/>
      <xs:element name="Suffix" type="xs:string" minOccurs="0"/>
    </xs:sequence>
    <xs:attribute name="gender" use="optional">
      <xs:simpleType>
        <xs:restriction base="xs:string">
          <xs:enumeration value="male"/>
          <xs:enumeration value="female"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
  </xs:complexType>
</xs:element>

translates into this ASN.1:

Author ::= SEQUENCE {
  attlist SET {
    gender ENUMERATED {
      male (1),
      female (2)
    } OPTIONAL
  },
  lastName VisibleString,
  fF CHOICE {
    foreName VisibleString,
    fM SEQUENCE {
      firstName VisibleString,
      middleName VisibleString OPTIONAL
    }
  } OPTIONAL,
  initials VisibleString OPTIONAL,
  suffix VisibleString OPTIONAL
}

Each unnamed local element gets a name. When generating C++ data storage classes from Schema, DATATOOL marks such data types as anonymous.

It is possible to convert source Schema into ASN.1, and then use DATATOOL to generate C++ classes from the latter. In this case DATATOOL and serial library provide compatibility of ASN.1 output. If you generate data storage classes from Schema, and use them to write data in ASN.1 format (binary or text), if you then convert that Schema into ASN.1, generate classes from it, and again write same data in ASN.1 format using this new set of classes, then these two files will be identical.

Definition File

It is possible to tune up the C++ code generation by using a definition file, which could be specified in the -od argument. The definition file uses the generic NCBI configuration format also used in the configuration (*.ini) files found in NCBI’s applications.

DATATOOL looks for code generation parameters in several sections of the file in the following order:

[ModuleName.TypeName]
[TypeName]
[ModuleName]
[-]

Parameter definitions follow a “name = value” format. The “name” part of the definition serves two functions: (1) selecting the specific element to which the definition applies, and (2) selecting the code generation parameter (such as _class) that will be fine-tuned for that element.

To modify a top-level element, use a definition line where the name part is simply the desired code generation parameter (such as _class). To modify a nested element, use a definition where the code generation parameter is prefixed by a dot-separated “path” of the successive container element names from the data format specification. For path elements of type SET OF or SEQUENCE OF, use an “E” as the element name (which would otherwise be anonymous). Note: Element names will depend on whether you are using ASN.1, DTD, or Schema.

For example, consider the following ASN.1 specification:

MyType ::= SEQUENCE {
    label VisibleString ,
    points SEQUENCE OF
        SEQUENCE {
            x INTEGER ,
            y INTEGER
        }
}

Code generation for the various elements can be fine-tuned as illustrated by the following sample definition file:

[MyModule.MyType]
; modify the top-level element (MyType)
_class = CMyTypeX

; modify a contained element
label._class = CTitle

; modify a "SEQUENCE OF" container type
points._type = vector

; modify members of an anonymous SEQUENCE contained in a "SEQUENCE OF"
points.E.x._type = double
points.E.y._type = double

; modify a DATATOOL-assigned class name
points.E._class = CPoint

Note: DATATOOL assigns arbitrary names to otherwise anonymous containers. In the example above, the SEQUENCE containing x and y has no name in the specification, so DATATOOL assigned the name E. If you want to change the name of a DATATOOL-assigned name, create a definition file and rename the class using the appropriate _class entry as shown above. To find out what the DATATOOL-assigned name will be, create a sample definition file using the DATATOOL -ods option. This approach will work regardless of the data specification format (ASN.1, DTD, or XSD).

The following additional topics are discussed in this section:

Common definitions
Definitions that affect specific types
The Special [-] Section
Examples

Common Definitions

Some definitions refer to the generated class as a whole.

_file Defines the base filename for the generated or referenced C++ class.

For example, the following definitions:

[ModuleName.TypeName]
_file=AnotherName

Or

[TypeName]
_file=AnotherName

would put the class CTypeName in files with the base name AnotherName, whereas these two:

[ModuleName]
_file=AnotherName

Or

[-]
_file=AnotherName

put all the generated classes into a single file with the base name AnotherName.

_extra_headers Specify additional header files to include.

For example, the following definition:

[-]
_extra_headers=name1 name2 \"name3\"

would put the following lines into all generated headers:

#include <name1>
#include <name2>
#include "name3"

Note the name3 clause. Putting name3 in quotes instructs DATATOOL to use the quoted syntax in generated files. Also, the quotes must be escaped with backslashes.

_dir Subdirectory in which the generated C++ files will be stored (in case _file not specified) or a subdirectory in which the referenced class from an external module could be found. The subdirectory is added to include directives.

_class The name of the generated class (if _class=- is specified, then no code is generated for this type).

For example, the following definitions:

[ModuleName.TypeName]
_class=CAnotherName

Or

[TypeName]
_class=CAnotherName

would cause the class generated for the type TypeName to be named CAnotherName, whereas these two:

[ModuleName]
_class=CAnotherName

Or

[-]
_class=CAnotherName

would result in all the generated classes having the same name CAnotherName (which is probably not what you want).

_namespace The namespace in which the generated class (or classes) will be placed.

_parent_class The name of the base class from which the generated C++ class is derived.

_parent_type Derive the generated C++ class from the class, which corresponds to the specified type (in case _parent_class is not specified).

It is also possible to specify a storage-class modifier, which is required on Microsoft Windows to export/import generated classes from/to a DLL. This setting affects all generated classes in a module. An appropriate section of the definition file should look like this:

[-]
_export = EXPORT_SPECIFIER

Because this modifier could also be specified in the command line, the DATATOOL code generator uses the following rules to choose the proper one:

If no -oex flag is given in the command line, no modifier is added at all.
If -oex "" (that is, an empty modifier) is specified in the command line, then the modifier from the definition file will be used.
The command-line parameter in the form -oex FOOBAR will cause the generated classes to have a FOOBAR storage-class modifier, unless another one is specified in the definition file. The modifier from the definition file always takes precedence.

Definitions That Affect Specific Types

The following additional topics are discussed in this section:

INTEGER, REAL, BOOLEAN, NULL
ENUMERATED
OCTET STRING
SEQUENCE OF, SET OF
SEQUENCE, SET
CHOICE

INTEGER, REAL, BOOLEAN, NULL

_type C++ type: int, short, unsigned, long, etc.

ENUMERATED

_type C++ type: int, short, unsigned, long, etc.

_prefix Prefix for names of enum values. The default is “e”.

OCTET STRING

_char Vector element type: char, unsigned char, or signed char.

SEQUENCE OF, SET OF

_type STL container type: list, vector, set, or multiset.

SEQUENCE, SET

memberName._delay Mark the specified member for delayed reading.

CHOICE

_virtual_choice If not empty, do not generate a special class for choice. Rather make the choice class as the parent one of all its variants.

variantName._delay Mark the specified variant for delayed reading.

The Special [-] Section

There is a special section [-] allowed in the definition file which can contain definitions related to code generation. This is a good place to define a namespace or identify additional headers. It is a “top level” section, so entries placed here will override entries with the same name in other sections or on the command-line. For example, the following entries set the proper parameters for placing header files alongside source files:

[-]
; Do not use a namespace at all:
-on  = -

; Use the current directory for generated .cpp files:
-opc = .

; Use the current directory for generated .hpp files:
-oph = .

; Do not add a prefix to generated file names:
-or  = -

; Generate #include directives with quotes rather than angle brackets:
-orq = 1

Any of the code generation arguments in Table 2 (except -od, -odi, and -odw which are related to specifying the definition file) can be placed in the [-] section.

In some cases, the special value "-" causes special processing as noted in Table 2.

Examples

If we have the following ASN.1 specification (this not a “real” specification - it is only for illustration):

Date ::= CHOICE {
    str VisibleString,
    std Date-std
}
Date-std ::= SEQUENCE {
    year INTEGER,
    month INTEGER OPTIONAL
}
Dates ::= SEQUENCE OF Date
Int-fuzz ::= CHOICE {
    p-m INTEGER,
    range SEQUENCE {
        max INTEGER,
        min INTEGER
    },
    pct INTEGER,
    lim ENUMERATED {
        unk (0),
        gt (1),
        lt (2),
        tr (3),
        tl (4),
        circle (5),
        other (255)
    },
    alt SET OF INTEGER
}

Then the following definitions will effect the generation of objects:

Definition	Effected Objects
`[Date]` `str._type = string`	the `str` member of the `Date` structure
`[Dates]` `E._pointer = true`	elements of the `Dates` container
`[Int-fuzz]` `range.min._type = long`	the `min` member of the `range` member of the `Int-fuzz` structure
`[Int-fuzz]` `alt.E._type = long`	elements of the `alt` member of the `Int-fuzz` structure

As another example, suppose you have a CatalogEntry type comprised of a Summary element and either a RecordA element or a RecordB element, as defined by the following XSD specification:

<?xml version="1.0" encoding="UTF-8"?>

<schema
    xmlns="http://www.w3.org/2001/XMLSchema"
    xmlns:tns="https://ncbi.nlm.nih.gov/some/unique/path"
    targetNamespace="https://ncbi.nlm.nih.gov/some/unique/path"
    elementFormDefault="qualified"
>

    <element name="CatalogEntry" type="tns:CatalogEntryType" />

    <complexType name="CatalogEntryType">
        <sequence>
            <element name="Summary" type="string" />
            <choice>
                <element name="RecordA" type="int" />
                <element name="RecordB" type="int" />
            </choice>
        </sequence>
    </complexType>

</schema>

In this specification, the <choice> element in CatalogEntryType is anonymous, so DATATOOL will assign an arbitrary name to it. The assigned name will not be descriptive, but fortunately you can use a definition file to change the assigned name.

First find the DATATOOL-assigned name by creating a sample definition file using the -ods option:

datatool -ods -oA -m catalogentry.xsd

The sample definition file (catalogentry._sample_def) shows RR as the class name:

[CatalogEntry]
RR._class = 
Summary._class = 

Then edit the module definition file (catalogentry.def) and change RR to a more descriptive class name, for example:

[CatalogEntry]
RR._class=CRecordChoice

The new name will be used the next time the module is built.

Module File

Module files are not used directly by DATATOOL, but they are read by new_module.sh and project_tree_builder and therefore determine what DATATOOL’s command line will be when DATATOOL is invoked from the NCBI build system.

Module files simply consist of lines of the form “KEY = VALUE”. Only the key MODULE_IMPORT is currently used (and is the only key ever recognized by project_tree_builder). Other keys used to be recognized by module.sh and still harmlessly remain in some files. The possible keys are:

MODULE_IMPORT These definitions contain a space-delimited list of other modules to import. The paths should be relative to .../src and should not include extensions.

For example, a valid entry could be:
MODULE_IMPORT = objects/general/general objects/seq/seq
MODULE_ASN, MODULE_DTD, MODULE_XSD These definitions explicitly set the specification filename (normally foo.asn, foo.dtd, or foo.xsd for foo.module). Almost no module files contain this definition. It is no longer used by the project_tree_builder and is therefore not necessary
MODULE_PATH Specifies the directory containing the current module, again relative to .../src. Almost all module files contain this definition, however it is no longer used by either new_module.sh or the project_tree_builder and is therefore not necessary.

Generated Code

The following additional topics are discussed in this section:

Normalized name
ENUMERATED types

Normalized Name

By default, DATATOOL generates “normalized” C++ class names from ASN.1 type names using two rules:

Convert any hyphens (“-”) into underscores (“_”), because hyphens are not legal characters in C++ class names.
Prepend a ‘C’ character.

For example, the default normalized C++ class name for the ASN.1 type name “Seq-data” is “CSeq_data”.

The default C++ class name can be overridden by explicitly specifying in the definition file a name for a given ASN.1 type name. For example:

[MyModule.Seq-data]
_class=CMySeqData

ENUMERATED Types

By default, for every ENUMERATED ASN.1 type, DATATOOL will produce a C++ enum type with the name ENormalizedName.

Class Diagrams

The following topics are discussed in this section:

Specification analysis
Data types
Data values
Code generation

Specification Analysis

The following topics are discussed in this section:

ASN.1 specification analysis
DTD specification analysis

ASN.1 Specification Analysis

See Figure 1.

ASN.1 specification analysis.

DTD Specification Analysis

See Figure 2.

DTD specification analysis.

Data Types

See CDataType.

Data Values

See Figure 3.

Data values.

Code Generation

See Figure 4.

Code generation.

Load Balancing

Overview
Load Balancing Service Mapping Daemon (LBSMD)
Database Load Balancing
DISPD Network Dispatcher
NCBID Server Launcher
Firewall Daemon (FWDaemon)
Launcherd Utility
Monitoring Tools
Quality Assurance Domain

Note: For security reasons not all links in the public version of this document are accessible by the outside NCBI users.

The section covers the following topics:

The purpose of load balancing
All the separate components’ purpose, internal details, configuration
Communications between the components
Monitoring facilities

Overview

The purpose of load balancing is distributing the load among the service providers available on the NCBI network basing on certain rules. The load is generated by both locally-connected and Internet-connected users. The figures below show the most typical usage scenarios.

Figure 5. Local Clients

Please note that the figure is slightly simplified to remove unnecessary details for the time being.

In case of local access to the NCBI resources there are two NCBI developed components, which are involved into the interactions. These are LBSMD daemon (Load Balancing Service Mapping Daemon) and mod_caf (Cookie/Argument Affinity module) - an Apache web server module.

The LBSMD daemon is running on each host in the NCBI network. The daemon reads its configuration file with all the services available on the host described. Then the LBSMD daemon broadcasts the available services and the current host load to the adjacent LBSMD daemons on a regular basis. The data received from the other LBSMD daemons are stored in a special table. So at some stage the LBSMD daemon on each host will have had a full description of the services available on the network as well as the current hosts’ load.

The mod_caf Apache’s module analyses special cookies, query line arguments and reads data from the table populated by the LBSMD daemon. Basing on the best match it makes a decision of where to pass a request further.

Suppose for a moment that a local NCBI client runs a web browser, points to an NCBI web page and initiates a DB request via the web interface. At this stage the mod_caf analyses the request line and makes a decision where to pass the request. The request is passed to the ServiceProviderN host which performs the corresponding database query. Then the query results are delivered to the client. The data exchange path is shown on the figure above using solid lines.

Another typical scenario for the local NCBI clients is when client code is run on a user workstation. That client code might require a long term connection to a certain service, to a database for example. The browser is not able to provide this kind of connection so a direct connection is used in this case. The data exchange path is shown on the figure above using dashed lines.

The communication scenarios become more complicated in case when clients are located outside of the NCBI network. The figure below describes the interactions between modules when the user requested a service which does not suppose a long term connection.

Figure 6. Internet Clients. Short Term Connection

The clients have no abilities to connect to front end Apache web servers directly. The connection is done via a router which is located in DMZ (Demilitarized Zone). The router selects one of the available front end servers and passes the request to that web server. Then the web server processes the request very similar to how it processes requests from a local client.

The next figure explains the interactions for the case when an Internet client requests a service which supposes a long term connection.

Figure 7. Internet Clients. Long Term Connection

In opposite to the local clients the internet clients are unable to connect to the required service directly because of the DMZ zone. This is where DISPD, FWDaemon and a proxy come to help resolving the problem.

The data flow in the scenario is as follows. A request from the client reaches a front end Apache server as it was discussed above. Then the front end server passes the request to the DISPD dispatcher. The DISPD dispatcher communicates to FWDaemon (Firewall Daemon) to provide the required service facilities. The FWDaemon answers with a special ticket for the requested service. The ticket is sent to the client via the front end web server and the router. Then the client connects to the NAT service in the DMZ zone providing the received ticket. The NAT service establishes a connection to the FWDaemon and passes the received earlier ticket. The FWDaemon, in turn, provides the connection to the required service. It is worth to mention that the FWDaemon is running on the same host as the DISPD dispatcher and neither DISPD nor FWDaemon can work without each other.

The most complicated scenario comes to the picture when an arbitrary Unix filter program is used as a service provided for the outside NCBI users. The figure below shows all the components involved into the scenario.

Figure 8. NCBID at Work

The data flow in the scenario is as follows. A request from the client reaches a front end Apache server as it was discussed above. Then the front end server passes the request to the DISPD dispatcher. The DISPD communicates to both the FWDaemon and the NCBID utility on (possibly) the other host and requests to demonize a requested Unix filter program (Service X on the figure). The demonized service starts listening on the certain port for a network connection. The connection attributes are delivered to the FWDaemon and to the client via the web front end and the router. The client connects to the NAT service and the NAT service passes the request further to the FWDaemon. The FWDaemon passes the request to the demonized Service X on the Service Provider K host. Since that moment the client is able to start data exchange with the service. The described scenario is purposed for long term connections oriented tasks.

Further sections describe all the components in more detail.

Load Balancing Service Mapping Daemon (LBSMD)

Overview

As mentioned earlier, the LBSMD daemon runs almost on every host that carries either public or private servers which, in turn, implement NCBI services. The services include CGI programs or standalone servers to access NCBI data.

Each service has a unique name assigned to it. The “TaxService” would be an example of such a name. The name not only identifies a service. It also implies a protocol which is used for data exchange with that service. For example, any client which connects to the “TaxService” service knows how to communicate with that service regardless the way the service is implemented. In other words the service could be implemented as a standalone server on host X and as a CGI program on the same host or on another host Y (please note, however, that there are exceptions and for some service types it is forbidden to have more than one service type on the same host).

A host can advertize many services. For example, one service (such as “Entrez2”) can operate with binary data only while another one (such as “Entrez2Text”) can operate with text data only. The distinction between those two services could be made by using a content type specifier in the LBSMD daemon configuration file.

The main purpose of the LBSMD daemon is to maintain a table of all services available at NCBI at the moment. In addition the LBSMD daemon keeps track of servers that are found to be dysfunctional (dead servers). The daemon is also responsible for propagating trouble reports, obtained from applications. The application trouble reports are based on their experience with advertised servers (e.g., an advertised server is not technically marked dead but generates some sort of garbage). Further in this document, the latter kind of feedback is called a penalty.

The principle of load balancing is simple: each server which implements a service is assigned a (calculated) rate. The higher the rate, the better the chance for that server to be chosen when a request for the service comes up. Note that load balancing is thus almost never deterministic.

The LBSMD daemon calculates two parameters for the host on which it is running. The parameters are a normal host status and a BLAST host status (based on the instant load of the system). These parameters are then used to calculate the rate of all (non static) servers on the host. The rates of all other hosts are not calculated but received and stored in the LBSMD table.

The LBSMD daemon can be restarted from a crontab every few minutes on all the production hosts to ensure that the daemon is always running. This technique is safe because no more than one instance of the daemon is permitted on a certain host and any attempt to start more than one is ignored. Normally, though, a running daemon instance is maintained afloat by some kind of monitoring software, such as “puppet” or “monit” that makes use of the crontabs unnecessary.

The main loop of the LBSMD daemon:

periodically checks the configuration file and reloads the configuration when necessary;
checks for and processes incoming messages from neighbor LBSMD daemons running on other hosts; and
generates and broadcasts the messages to the other hosts about the load of the system and configured services.

The LBSMD daemon can also periodically check whether the configured servers are alive: either by trying to establish a connection to them (and then disconnecting immediately, without sending/receiving any data) and / or by using a special plugin script that can do more intelligent, thorough, and server-specific diagnostics, and report the result back to LBSMD via an exit code.

Lastly, LBSMD can pull port load information as posted by the running servers. This is done via a simple API https://intranet.ncbi.nlm.nih.gov/ieb/ToolBox/CPP_DOC/lxr/source/include/connect/daemons/lbsmdapi.h. The information is then used to calculate the final server rates in run-time.

Although cients can redirect services, LBSMD does not distinguish between direct and redirected services.

Configuration

The LBSMD daemon is configured via command line options and via a configuration file. The full list of command line options can be retrieved by issuing the following command:

/opt/machine/lbsm/sbin/lbsmd --help

The local NCBI users can also visit the following link:

https://intranet.ncbi.nlm.nih.gov/ieb/ToolBox/NETWORK/lbsmd.cgi

The default name of the LBSMD daemon configuration file is /etc/lbsmd/servrc.cfg. Each line can be one of the following:

an include directive
site / zone designation
host authority information
a monitored port designation
a part of the host environment
a service definition
an empty line (entirely blank or containing a comment only)

Empty lines are ignored in the file. Any single configuration line can be split into several physical lines by inserting backslash symbols (\) before the line breaks. A comment is introduced by the pound/hash symbol (#).