formatting.txt

An effort has been made to render most of the datasets and knowledge in this
repository easy to import automtically, with consistent formatting and naming
conventions, as follows:

1. All tabular data files are tab-delimited, with the names of the variables in
the first row, and with one row per case after that. Names of tabular files will
end with ".txt". Columns in tabular files may be either continuous, discrete, or
mixed. Continuous data will have the word "continuous" in the filename, thus:

mydata.continuous.txt

Discrete data will have the word "discrete" in the filename, thus:

mydata.discrete.txt

For mixed data, it will be assumed that columns with less than or equal to a
specified maximum number of distinct values c will be discrete, other columns
being continuous. This maximum number is specified in the filename, thus, for
example:

mydata.mixed.maximum.3.txt

for c = 3. In some cases, this advice is insufficient; for these, a JSON file is
available to use for importing the data, in a format to be specified.

2. Covariance matrices are rendered in lower triangle, with the sample size in
the first row, the list of variables in the second row, tab-delimited, and the
lower triangle of the covariance matrix to follow, in tab-delimited rows of
increasing length up to the number of variables. Names of covariance matrix
files will end with ".cov.txt", thus:

mydata.cov.txt

3. Missing values are marked uniformly with an asterisk, thus: *.

4. All ground truth knowledge for specific forbidden and required edges will be
represented in the following format, for example:

/knowledge
addtemporal

1*  x y z
2  w r

forbiddirect
w z

requiredirect
a b
x b

...

Here, edges from later tiers to earlier tiers are forbidden; additional directed
edges may be specified in the "forbiddirect" and "requiredirect" sections. For
example, "alcohol density" in the forbiddirect section means that the edge
alcohol-->density is forbidden; similar comment applies to the specifications in
the requiredirect section. If a tier number is followed by an asterisk, this
means that edges within that tier are all forbidden. Ground truth files
specified in this way will have the suffix ".ground.truth.txt", thus:

mydata.knowledge.txt

5. In some cases, a specific ground truth DAG is known, as in simulation. In
such cases, a graph may be specified as follows, for example:

Graph Nodes:
raf;mek;plc;pip2;pip3;erk;akt;pka;pkc;p38;jnk

Graph Edges:
1. erk --> akt
2. mek --> erk
3. pip2 --> pkc
4. pip3 --> akt
5. pip3 --> pip2
6. pip3 --> plc
...

Here, the section "Graph Nodes:" is followed by a comma or semicolon separated
list of variable names, and the "Graph Edges" section is followed by an
optionally numbered list of edges in form X --> Y (for a DAG). Such ground truth
graphs will be in files with the suffix ".ground.truth.graph.txt", thus:

mydata.ground.truth.graph.txt

6. A certain directory structure has been used which seems fortuitous. First,
the highest level directories have been sorted into 'real' and 'simulated';
'real' contains all real datasets; 'simulated' all simulated datasets. One
important difference is that for simulated data, ground truth is known for
certain, whereas for real data, there is some rationale needed to infer ground
truth, either time order, or expertise, or (probably the best) basis in actual
experimental results.

Inside each directory, example datasets are presented in each subdirectory, with
subdirectories of those being 'data', 'ground.truth', and 'images'. Ground truth
is not always known, so this directory may be empty. If a data set

mydata.xxx.txt

is given in the data directory, and

mydata.knowledge.txt

or

mydata.ground.truth.graph.txt

is given, these are ground truth files for that dataset. If ground truth is
given in some other format, it will have a different filename from these.

The 'images' dataset contains, for instance, plot matrices of the data or other
images.

Also, for each example, a 'readme.txt' file is given with a reference back to
the original source of the data, and possibly, where hard to easily retrieve
from that reference, information about the variables in the dataset. As this
repository is not meant to be a replacement for the original data sources but
rather a consistent formatting of the data from those original data sources,
extensive notes are not usually included here, and the user is encouraged to
explore the original sources.