Parsing and Tools

pm_tools Documentation

Ralf Hoffmann

Revision History
Revision 0.3.1	2009-02-16	RH

The pm_tools can be used to parse arbitrary files and create data sets for values described by regular expressions. The main purpose is to parse output from scientific programs and transform it into a table for further processing.

1. Introduction

The pm_tools toolkit is a collection of python and perl scripts. Therefore, a working python and perl installation is required. Additionally, a shell with pipe support is highly recommended since the tools are designed to read from stdin and output the transformed data to stdout. You should be familiar with pipes and file redirections.

2. File parsing

The first phase is to bring the data from an arbitrary log format into a tabular format for further processing. The pm_parse_ext tool reads the log file from stdin and uses a template description to parse it and create tabular entries.

2.1. Template format

The data is transformed by using a template description of the data. The template consists of

State descriptions,
Column names,
State order,
(optional) List of states without newline.

Let’s start with an example. Consider the following data file:

x=1.5
y=3
y=4
y=6
info:finished
x=3.6
y=2
info:aborted

Each data set consists of a initial x value, several y values and a final info output. This set can occur arbitrary times (in this example two times). A valid description for the x line is:

1:x=([\d\.]+)

The x= matches the actual text, the parentheses indicates information we want to gather.

Accordingly the y values can parsed by:

2:y=([\d\.]+)

and the final info line can be parsed by:

3:info=(\w+)

Now we can assign names to the groups in the expressions:

1,1:x
2,1:y
3,1:information

Finally, after the states are described, we can give the order of these states:

states:1 2+ 3

which means that after a single occurrence of state 1 there can be one or more occurrences of state 2 and a final occurrence of state 3.

When using the pm_parse_ext tool with this template:

cat datafile | pm_parse_ext -t mytemplate

the tool outputs the following table (in ASCII format of course):

Table 1. ---.---.-----------

x	y	information
1.5	3	finished
1.5	4	finished
1.5	6	finished
3.6	2	aborted

State description

Each data element in the data file needs to be described by a regular expression. A state describes a line of data. Later on, you can specify in which order these states can appear in the data file.

The format of such a state is as follows:

<State number>:<regular expression>

You can use as many states as you need to describe every important data line.

Of course you are only interested in some part of a whole line, probably containing some numbers or so. Therefore, you can mark groups in the regular expressions which will be put into the final table. Groups are marked by enclosing parentheses. So if you have a data line like

var1=5.4

and you are interested in the value, you can use the following description:

1:var1=([\d\.]+)

It’s not actually a correct expression for floating point numbers but usually the output is already well defined so it will work.

Column naming

The groups defined in the regular expressions need to be named so the resulting data in the table can be identified. After the state description a list of column names follows. For each state you can name each group of the regular expression by using the following format:

<state number>,<group number>:<column name>

The group number starts by 1 and is counted left to right. In the previous example it could be

1,1:var_value

State order

After describing the states and giving the columns some reasonable names it’s time to describe the order of the states. It’s just a list of states with some modifiers similar to regular expression. The line begins with

states:

followed by a list of space separated states possibly appended by ? or + for an optional occurrence of this state or a repeated occurrence. A description line like

states:1 2? 3+ 4

means that after a single occurrence of state 1 (i.e. a data line matching the corresponding regular expression for state 1), the state 2 is optional, it can occur in the input but doesn’t need to. Then, one or more lines described by state 3 must occur and at the end state 4 must match a data line. Lines not matching any state are completely ignored.

After all states are matched another round is started beginning the first state to represent a new data set.

Although it’s possible to use several + modifiers, usually the results isn’t expected. The parser will create a table by building the cross product of all lines of all states. For states which only appear once there will be only a single line but repeated states contain several lines. So for each such line the columns from the other state are added to the table. For another repeated state each line of this state is duplicated for each line in the other state which is usually not the wanted result but the parser can’t predict what the user wants in this case.

States without newline

Sometimes it is necessary to use different states for data from the same line. Therefore it’s possible to declare some states as no newline states so after matching this state, the same input line is parsed with the following state again. The states can be given by:

nnlstates:<state number> <state number> ...

which just a space separated list of states.

As an example consider the example from above. If the data file looks like the following:

x=2;y=3
y=4
y=5
info:finished
x=3;y=6
y=2
info:finished

the template file from above can’t be used. By declaring state 1 as a no newline state it will work again:

nnlstates:1

The state description also needs to be updated since it doesn’t start at the beginning of the line anymore:

2:.*y=([\d\.]+)

2.2. Usage

After writing the template file you can easily parse the data file by piping it into the pm_parse_ext tool giving the template file as argument -t:

cat <datafile> | pm_parse_ext -t <templatefile>

The output is a data file which we call a "pmd" file. The preferred file extension is ".pmd" but that actually doesn’t matter as they are plain text files.

3. Tools for table handling

The output of the pm_parse_ext tool is a table which can be post-processed by any of the following tools. Just pipe the output of the pm_parse_ext tool (or any other tool from the pm_tools suite) into a new instance:

cat <datafile> |\
pm_parse_ext -t <templatefile> |\
pm_??? --some_option

All tools have a short help describing the options of each tool. Use the help flag (-h) for more information.

3.1. pm_append_column

This tools adds a new column filled with a given value.

Options:

--name (Short: -n)

Name of the new column

REQUIRED

--value (Short: -v)

Value for column

REQUIRED

3.2. pm_append_table

This tools concatenates several tables which are piped together into this tool. All columns need to be identical including the order of all columns.

Options: None

3.3. pm_apply_to_column

This tools can be used to combine several lines into a single one by applying a mathematical operation to a column. This can be the average, summation, minimum, maximum, standard deviation and coefficient of variation. This operation is applied to a given column. The operation considers all data lines which are identical for all but the given column.

Options:

--column (Short: -c)

Name of the column to modify

REQUIRED

--operation (Short: -o)

Operation to apply (avg,sum,min,max)

REQUIRED

--deviation (Short: -d)

Calculate specified deviation. Available options are:

std: Standard deviation
var: Coefficient of variation

OPTIONAL

--deviation_colname

Calculate the column name for the deviation column. Defaults to "dev"

OPTIONAL

3.4. pm_combine_columns

This tools combines several columns into a new column by using the the values of all selected columns for each line. A given lambda function will be called with these values to create the value for the new column. By default, the function has to accept the column names as parameters but parameter renaming can be applied for shorter names.

Options:

--selection (Short: -s)

Selection of columns to combine.

It’s a comma separated list of column names. A column name can be optional followed by a colon and a parameter name. The function will be called with this name as an argument.

REQUIRED

Example: -s x,y,var_value:v

In this selection the function has to accept 3 parameters x,y and v (var_value is renamed to v).

--name (Short: -n)

Name of the new column

REQUIRED

--function (Short: -f)

Function to apply

Example: lambda x,y,v : float(x) * float(y) / float(v)

REQUIRED

--include (Short: -i)

Python file to include

If the lambda function is not enough to implement the desired function it is possible to use a separate file with function definitions. If this option is used the function argument -f is just the name of the function to call (default name is combinefunc). The values of the selected column entries will be given to the arguments with the corresponding name or alias (just like the lambda function in the preceding example).

OPTIONAL

Example: Consider two columns "time" and "events" and you want to compute the events per time unit.

cat <datafile> | pm_combine_columns -n events_per_time \
-s time,events -f "lambda time,events:float(events)/float(time)"

or in a shorter form by using parameter renaming:

cat <datafile> | pm_combine_columns -n events_per_time \
-s time:t,events:e -f "lambda t,e:float(e)/float(t)"

3.5. pm_conv2dat_lines

This tools transforms the data file into a format recognized by the plotting tool "dat2eps".

Options:

--x-axis (Short: -x)

Column to use for x axis values

REQUIRED

--y-axis (Short: -y)

Column to use for y axis values

REQUIRED

--groupby (Short: -g)

Output is grouped based on multiple values in given column

REQUIRED

--group-order

Comma separated list of group values. There will be only a group for each given value in the given order. Symbolic value REST may be used to add all columns not in this order at the end of the list

OPTIONAL

3.6. pm_enum_rows

This tools enumerates all rows by adding a new column.

Options:

--name (Short: -n)

New column name

REQUIRED

3.7. pm_histogram

This tools counts all identical lines in the data file. The new table will contain only one occurrence for each different line and a new column with the occurrence of this line.

Options:

--name (Short: -n)

Name of the histogram column

OPTIONAL

3.8. pm_interval_histogram

This tools works similar to the histogram tool but this tool uses logarithmic ranges for counting data lines which falls into any of these ranges. A column needs to be selected and all data lines which are identical for all but this column will be counted to the range matching the value of the selected column.

Options:

--column (Short: -c)

Name of the column to check range

REQUIRED

--name (Short: -n)

Name of the histogram column

OPTIONAL

3.9. pm_print_firstcolval

This tools just prints the first value of a given column without any other information. For example, the output can be easily used in scripts.

Options:

--column (Short: -c)

Name of the column

REQUIRED

3.10. pm_rename_column

This tools renames a given column.

Options:

--column (Short: -c)

Name of the column to modify

REQUIRED

--name (Short: -n)

New name for the selected column

OPTIONAL

3.11. pm_select_columns

This tools selects a set of columns out of all available columns. The only argument is a comma separated list of columns to include in the new table. The order does matter and it’s also possible to duplicate columns by repeating its name.

Options:

--selection (Short: -s)

Comma separated list of column names.

REQUIRED

3.12. pm_select_rows

This tools selects certain lines by comparing the value of a selected column to a given value. Thus you need to select a column and a value to compare the values of all lines to. A compare operation can be chosen.

Options:

--column (Short: -c)

Name of the column to compare values

REQUIRED

--value (Short: -v)

Value to compare to

REQUIRED

--operation (Short: -o)

Compare operation:

lt: Lower than
le: Lower equal
eq: Equal (default)
ge: Greater equal
gt: Greater than
ne: not equal

OPTIONAL

--string_compare (Short: -s)

Compare values as strings not as floats

OPTIONAL

3.13. pm_sort

This tools sorts the table by comparing the values of a given column either numerical (float compare) or by string compare.

Options:

--column (Short: -c)

Name of the column to compare

REQUIRED

--string_compare (Short: -s)

Compare values as strings not as floats

OPTIONAL

3.14. pm_sql_join_tables

This script can be used to merge two data files into a new one. Consider the following scenario: You have two tables from two different data sources which both have the columns x, y and z where x and y are common parameters and z are different results. You can use this script to create a new table with x, y, z1 and z2 columns.

Currently it is not possible to pipe the tables into this script so both tables have to be given as arguments:

pm_sql_join_tables [options] <table_name1:file1> <table_name2:file2>

Each table must have a (short) name given before the colon. This name is used to distinguish both tables and their columns.

Options:

--selection (Short: -s)

Comma separated list of column names to use for comparison in join (each given column must exists in every table)

REQUIRED

--additional (Short: -a)

Comma separated list of additional column names to include in output (table name followed by a dot and the column of this table, example: table_name1.time)

OPTIONAL

Example: Consider these two tables:

Table 2. res_comp1.pmd

x	y	result
1	0.5	120
2	0.8	130
3	1.5	140
4	2.2	150

and

Table 3. res_comp2.pmd

x	y	result
1	0.5	300
2	0.8	340
3	1.5	360
4	2.2	370

which might represent some experimental results for two different computers. Now we want a single table containing the results from both computers in a single line for each pair of x and y arguments:

pm_sql_join_tables -s x,y -a comp1.result,comp2.result \
    comp1:res_comp1.pmd comp2:res_comp2.pmd

The output looks like this:

Table 4. join result

x	y	comp1.result	comp2.result
1	0.5	120	300
2	0.8	130	340
3	1.5	140	360
4	2.2	150	370

As you can see, the columns given by the argument "-s" are used to compare both tables and the columns described with the "-a" are added to the final output representing the results.

3.15. pm_sql_select

This script is an extended version of pm_sql_join_tables. It uses an internal SQL database to combine several tables (not just two) into a new one. Some (basic) SQL knowledge may however be required to use this script.

Basically all given data files are read and converted into a SQL table and a select statement is issued to build the new table. Data sources and constraints can be freely chosen.

Currently it is not possible to pipe the tables into this script so all tables have to be given as arguments:

pm_sql_select [options] <table_name1:file1> <table_name2:file2> ...

Each table must have a (short) name given before the colon. This name is used to distinguish both tables and their columns.

Options:

--selection (Short: -s)

Comma separated list of column names to include in the output table (table name followed by a dot and the column of this table, example: table_name1.time)

REQUIRED

--from

SQL from statement. This select the data sources.

Example: table_name1 inner join table_name2

REQUIRED

--on

SQL on statement. This defines the select contraints.

Example: table_name1.x=table_name2.x and table_name2.y=2

REQUIRED

Example: Consider these two tables:

Table 5. res_comp1.pmd

x	result
1	100
2	50
3	33
4	25

and

Table 6. res_comp2.pmd

x	result
1	80
2	40
3	26
4	20

which might represent some experimental results for two different computers. Now we want add the results for x=1 from table comp2 as another column in table comp1, for instance, in order to calculate the ratio between the results.:

pm_sql_select -s comp1.x,comp1.result,comp2.result \
    --from='comp1 inner join comp2' \
    --on='comp2.x="1"' \
    comp1:res_comp1.pmd comp2:res_comp2.pmd

The output looks like this:

Table 7. sql select result

comp1_x	comp1_result	comp2.result
1	100	80
2	50	80
3	33	80
4	25	80

Note: Since all values are stored as strings no matter whether they describe numbers or not, you need to include double quotes in the on statement if you compare specific values!

3.16. pm_stats_for_column

This tools calculates some statistics about a given column. These are the number of values, the sum of all values, the average and min and max value. Additionally, the standard deviation and coefficient of variation will be printed.

Options:

--column (Short: -c)

Name of the column

REQUIRED

3.17. pm_export_r

This script converts a Plotmeister table into table that can be read by R.