How to prepare data files

From SpatoWiki
Revision as of 10:04, 20 June 2011 by Christian (talk | contribs) (Example data files)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

SPaTo documents are mainly a collection of XML files, but can be supplemented by binary data to improve loading time for large networks. All files are stored in a container, which is either a directory or a compressed (ZIP) archive (on Mac OS X, uncompressed documents will still appear as a single file, but the directory contents can be viewed by right-clicking the file and selecting “Show Package Contents”). Every SPaTo document can be compressed or uncompressed from within the application (right-click on the document switcher and select “Save compressed” or “Save uncompressed”).

Example data files

Check out the full list of examples to see all example data files. In this guide, however, we will use excerpts from the Everglades food web:

  • Everglades_xml.zip is an uncompressed SPaTo document using plain XML files to store all data
  • Everglades_blobs.spato is a compressed SPaTo document that stores some of the data in binary format (to view the contents of this document, first load it in SPaTo, right-click the network selector in the upper left corner and select “Save uncompressed”)

Exporting networks from MATLAB

If you are using MATLAB, you might be able to use or adapt our save_spato m-file to write SPaTo documents.

Writing all data into text (XML) files

The easiest way to create a new document is to create a directory with a name that ends in .spato, e.g., My_Network.spato. Then create a text file called document.xml within that directory. Here is an example:

<?xml version="1.0" ?>
<document>
  <title>Everglades Food Web</title>
  <description>
    From the Pajek website: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/foodweb/foodweb.htm
    Originally described by Ulanowicz, R.E., J.J. Heymans, and M.S. Egnotovich. (2000)
    Reduced to largest connected component
  </description>
  <nodes src="nodes.xml" />
  <links src="links.xml" />
  <slices name="SPT" src="spt.xml" />
  <dataset src="nodeprops.xml" selected="true" />
  <dataset src="dist.xml" />
</document>

Each document has to have exactly one nodes tag and may have a links tag in which the weight matrix of the network is defined. The slices tag defines the shortest-path tree (or any other tree) for each node. If no slices element is found in the document, then SPaTo will automatically compute shortest-path trees using Dijkstra's algorithm. The dataset tags define collections of node properties which can be used to color the nodes. As you can guess, the actual content is stored in additional XML files (nodes.xml, links.xml, etc.), but this is optional.

In the following, we will go through the individual tags and how their content is supposed to look like.

Nodes

Here are the contents nodes.xml from the Everglades food web example:

<?xml version="1.0" ?>
<nodes>
  <projection name="LonLat" />
  <node id="Living Sediments" name="Living Sediments" location="-5.95118,-6.82493" strength="1.72566" />
  <node id="Living POC" name="Living POC" location="-4.33367,-8.01817" strength="0.105276" />
  <node id="Periphyton" name="Periphyton" location="4.39522,-2.45945" strength="5.54369" />
  <!-- ... many nodes omitted for brevity ... -->
  <node id="Passerines" name="Passerines" location="-1.42228,8.97679" strength="0.00031404" />
</nodes>

Each node is defined by one node tag that usually has four attributes:

  • name is the name of the node, which is sometimes displayed in the bottom right corner of application window and can be searched using the search input field
  • id is a string that will be displayed as the “label” next to the node in the network visualization and can also be searched using the search input field
  • location is a comma-separated pair of coordinates that defines the node position in the map view
  • strength is a float value that should represent the node strength (node flux), which is currently used to show only the labels of the strongest nodes in the network

The order of the node tags is significant and determines the node index by which it can be referred to in the links and slices tags.

There is one more tag named projection that is used to determine how to project the coordinates in the map view. For historical reasons, the cartesian projection is called “LonLat”. Other currently recognized values are “Albers” (projection parameters will be automatically determined) and “LonLat Roll” (which is a cartesian, or equirectangular, projection with periodic boundary conditions).

Links

Again, starting with links.xml from the Everglades food web example:

<?xml version="1.0" ?>
<links inverse="true">
  <source index="1">
    <target index="7" weight="0.0167681" />
    <target index="8" weight="0.0150874" />
    <target index="9" weight="0.0271651" />
    <!-- ... omitted ... -->
    <target index="41" weight="3.7967e-05" />
  </source>
  <source index="2">
    <target index="8" weight="0.000810104" />
    <!-- ... omitted ... -->
    <target index="41" weight="1.81713e-07" />
  </source>
  <!-- ... omitted ... -->
  <source index="63">
    <target index="4" weight="3.59066e-05" />
    <!-- ... omitted ... -->
    <target index="63" weight="1.3815e-09" />
  </source>
</links>

For each node defined in the nodes tag, there should be one source. The attribute index refers to the 'index'-th node (index numbers start at 1). Within each source tag, there is one target tag for each link that originates from the source node, which specifies the index of the linked node and the weight of that link.

The boolean attribute inverse is used by the built-in shortest-path algorithm if no trees are explicitly given in the document (see below). If true, the inverse link weight will be used to calculate the path length.

Slices (Trees)

This is spt.xml from the Everglades food web example:

<?xml version="1.0" ?>
<slices name="SPT">
    <slice root="1">0 21 10 8 10 46 ... 3 7 13 6 4 10</slice>
    <slice root="2">21 0 21 8 10 46 ... 3 7 13 6 4 10</slice>
    <!-- ... ommited ... -->
    <slice root="63">10 21 10 8 10 46 ... 3 7 13 6 4 0</slice>
</slices>

For each node in the network, one tree has to be defined using a slice tag, with the root attribute specifying the (1-based) index of the node. The contents of the tag is a space-separated list of N integers (where N is the number of nodes). The 'n'-th integer in that list states the index of the parent of node 'n' in the tree. The index 0 is used for the root node (which, by definition, has no parent) or if the node is disconnected from the root node.

Datasets

Each dataset can contain multiple node properties (“quantities”), each defined in a data tag. Each data tag must have a values tag, the contents of which are a space-separated list of float values specifying the value of this quantity for each node, in the order in which the nodes are defined inside the nodes tag.

An optional colormap tag indicates which colormap to use when coloring nodes according to this quantity, with the (optional) name attribute giving the name of the colormap and the boolean log attribute indicating whether to use a log-scale when mapping values to colors. You can also specify the limits of the colormap using minval and maxval attributes. Values outside these limits will be mapped to the color associated with the lower (or higher) limit.

In the Everglades food web example, nodeprops.xml defines some general node properties:

<?xml version="1.0" ?>
<dataset name="Node Properties">
  <data name="Node Degree">
    <colormap log="true" />
    <values>22 21 17 18 9 11 ... 29 8 5 31 9 37</values>
  </data>
  <data name="Node Strength" selected="true">
    <colormap log="true" />
    <values>1.7257 0.10528 5.5437 ... 0.00025997 2.4328e-05 0.00031404</values>
  </data>
  <!-- ... omitted ... -->
</dataset>

All the quantities in the example above are static properties of each node. It is also possible to define quantities in which the value associated with a node also depends on the currently selected root node, i.e., properties that depend on pairs of nodes. The shortest-path distance is such a property; here are the contents of dist.xml from the Everglades food web:

<?xml version="1.0" ?>
<dataset name="Distance Measures">
  <data name="SPD" distmat="true">
    <colormap log="true" minval="0.459869" maxval="2.09185e+08" />
    <values root="1">0 11.4585 3.40721 ... 8732.1 82394.8 14601.1</values>
    <values root="2">11.4585 0 12.5688 ... 8741.26 82403.9 14611.1</values>
    <!-- ... omitted ... -->
    <values root="63">14601.1 14611.1 14598.6 ... 23327.3 96989.9 0</values>
  </data>
</dataset>

All of these two-dimensional quantities can be used as the radial distance measure in the tomogram view and will be listed in the distance matrix selector in the upper right corner of the application window.

Both dataset and data tags can have name attributes which will be used in the graphical user interface. Boolean selected attributes mark the dataset and quantity which is used to color the node when loading the document. Similarly, the boolean distmat attribute marks the quantity used as the radial distance in the tomogram view.

Providing binary data

Binary data is usually more compact and can be loaded much faster by the application, which is especially noticeable in the case of large networks. Any element of the document, except the node definitions, can be supplied as binary data, however, it is most useful for two-dimensional data, i.e., 2-D node property data, the slices/tree definitions, and the link weight matrix.

Contrary to the convention in the XML files, node indices are assumed to start at zero when handling binary data.

SPaTo documents with binary data still have to provide a document.xml file. Here is the Everglades example again:

<?xml version="1.0" encoding="UTF-8"?>
<document>
  <title>Everglades Food Web</title>
  <description>
    From the Pajek website: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/foodweb/foodweb.htm
    Originally described by Ulanowicz, R.E., J.J. Heymans, and M.S. Egnotovich. (2000)
    Reduced to largest connected component
  </description>
  <nodes src="nodes.xml" />
  <links blob="links" />
  <slices name="SPT" blob="spt" />
  <dataset src="nodeprops.xml" />
  <dataset src="dist.xml" />
</document>

Here, the src attributes of the links and </tt>slices</tt> tags have been replaced with blob attributes. Similarly in dist.xml:

<?xml version="1.0" encoding="UTF-8"?>
<dataset name="Distance Measures">
  <data name="SPD" blob="dist_spd" distmat="true">
    <colormap log="true" minval="0.459869" maxval="2.09185e+08"/>
  </data>
</dataset>

The values tags are omitted and replaced by a blob attribute. The value of this attribute is the name of the file that contains the binary data; all binary data files have to be in a subdirectory called blobs/.

Binary data for a slices tag must be a N-by-N integer matrix (with N being the number of nodes in the network), in which each the n-th column is the predecessor vector of the n-th node (i.e., the content of <slice root="{n+1}">...</slice>).

Blob files for data tags must contain either a vector of length N or a N-by-N matrix of floating point numbers. In the case of two-dimensional data, the n-th column provides the node coloring values if the n-th node is the currently selected root node.

Weight matrices (in the links tag) are specified using binary sparse matrices, as described below.

Binary data format

All data is stored in Big-Endian format (most significant byte first) and all integers are signed. Each binary file starts with a header that specifies type and size of the data. The first four bytes are a 32-bit integer specifying the data type:

  • 0 – array of integers (32-bit)
  • 1 – array of floating point numbers (32-bit)
  • 2 – square sparse matrix (32-bit float)

If the data is a sparse matrix, the next four bytes are a 32-bit integer specifying the size of the matrix (i.e., number of nodes in the network defined if the matrix is a weight matrix). Otherwise, a list of 32-bit integers specifies the size of each dimension of the array, terminated by the value –1 (i.e., vectors will have one integer stating its length, followed by –1, while matrices will have an integer stating the number of rows, followed by the number of columns, followed by –1).

The actual data follows after the header: Arrays are stored by filling the last dimension first, i.e., for a N-by-M matrix, the first M values are those of the first column, the next M values are those of the second column, and so on.

Sparse matrices are encoded as a list of N blocks where the n-th block states all links originating from node n. Each block begins with a 32-bit integer stating the number of links in that block, followed by a list of 32-bit integers stating the node index of the target node of each link, followed by a list of 32-bit floating point values stating the weight of each link. Note that node indices are assumed to start at zero in binary data.