Table of Contents
- 1. Introduction
- 2. File parsing
- 3. Tools for table handling
- 3.1. pm_append_column
- 3.2. pm_append_table
- 3.3. pm_apply_to_column
- 3.4. pm_combine_columns
- 3.5. pm_conv2dat_lines
- 3.6. pm_enum_rows
- 3.7. pm_histogram
- 3.8. pm_interval_histogram
- 3.9. pm_print_firstcolval
- 3.10. pm_rename_column
- 3.11. pm_select_columns
- 3.12. pm_select_rows
- 3.13. pm_sort
- 3.14. pm_sql_join_tables
- 3.15. pm_sql_select
- 3.16. pm_stats_for_column
- 3.17. pm_export_r
The pm_tools can be used to parse arbitrary files and create data sets for values described by regular expressions. The main purpose is to parse output from scientific programs and transform it into a table for further processing.
The pm_tools toolkit is a collection of python and perl scripts. Therefore, a working python and perl installation is required. Additionally, a shell with pipe support is highly recommended since the tools are designed to read from stdin and output the transformed data to stdout. You should be familiar with pipes and file redirections.
The first phase is to bring the data from an arbitrary log format into a tabular format for further processing. The pm_parse_ext tool reads the log file from stdin and uses a template description to parse it and create tabular entries.
The data is transformed by using a template description of the data. The template consists of
- State descriptions,
- Column names,
- State order,
- (optional) List of states without newline.
Let’s start with an example. Consider the following data file:
x=1.5 y=3 y=4 y=6 info:finished x=3.6 y=2 info:aborted
Each data set consists of a initial x value, several y values and a final info output. This set can occur arbitrary times (in this example two times). A valid description for the x line is:
1:x=([\d\.]+)
The x= matches the actual text, the parentheses indicates information we want to gather.
Accordingly the y values can parsed by:
2:y=([\d\.]+)
and the final info line can be parsed by:
3:info=(\w+)
Now we can assign names to the groups in the expressions:
1,1:x 2,1:y 3,1:information
Finally, after the states are described, we can give the order of these states:
states:1 2+ 3
which means that after a single occurrence of state 1 there can be one or more occurrences of state 2 and a final occurrence of state 3.
When using the pm_parse_ext tool with this template:
cat datafile | pm_parse_ext -t mytemplate
the tool outputs the following table (in ASCII format of course):
Table 1. ---.---.-----------
x | y | information |
---|---|---|
1.5 | 3 | finished |
1.5 | 4 | finished |
1.5 | 6 | finished |
3.6 | 2 | aborted |
Each data element in the data file needs to be described by a regular expression. A state describes a line of data. Later on, you can specify in which order these states can appear in the data file.
The format of such a state is as follows:
<State number>:<regular expression>
You can use as many states as you need to describe every important data line.
Of course you are only interested in some part of a whole line, probably containing some numbers or so. Therefore, you can mark groups in the regular expressions which will be put into the final table. Groups are marked by enclosing parentheses. So if you have a data line like
var1=5.4
and you are interested in the value, you can use the following description:
1:var1=([\d\.]+)
It’s not actually a correct expression for floating point numbers but usually the output is already well defined so it will work.
The groups defined in the regular expressions need to be named so the resulting data in the table can be identified. After the state description a list of column names follows. For each state you can name each group of the regular expression by using the following format:
<state number>,<group number>:<column name>
The group number starts by 1 and is counted left to right. In the previous example it could be
1,1:var_value
After describing the states and giving the columns some reasonable names it’s time to describe the order of the states. It’s just a list of states with some modifiers similar to regular expression. The line begins with
states:
followed by a list of space separated states possibly appended by ? or + for an optional occurrence of this state or a repeated occurrence. A description line like
states:1 2? 3+ 4
means that after a single occurrence of state 1 (i.e. a data line matching the corresponding regular expression for state 1), the state 2 is optional, it can occur in the input but doesn’t need to. Then, one or more lines described by state 3 must occur and at the end state 4 must match a data line. Lines not matching any state are completely ignored.
After all states are matched another round is started beginning the first state to represent a new data set.
Although it’s possible to use several + modifiers, usually the results isn’t expected. The parser will create a table by building the cross product of all lines of all states. For states which only appear once there will be only a single line but repeated states contain several lines. So for each such line the columns from the other state are added to the table. For another repeated state each line of this state is duplicated for each line in the other state which is usually not the wanted result but the parser can’t predict what the user wants in this case.
Sometimes it is necessary to use different states for data from the same line. Therefore it’s possible to declare some states as no newline states so after matching this state, the same input line is parsed with the following state again. The states can be given by:
nnlstates:<state number> <state number> ...
which just a space separated list of states.
As an example consider the example from above. If the data file looks like the following:
x=2;y=3 y=4 y=5 info:finished x=3;y=6 y=2 info:finished
the template file from above can’t be used. By declaring state 1 as a no newline state it will work again:
nnlstates:1
The state description also needs to be updated since it doesn’t start at the beginning of the line anymore:
2:.*y=([\d\.]+)
After writing the template file you can easily parse the data file by
piping it into the pm_parse_ext tool giving the template file as argument
-t
:
cat <datafile> | pm_parse_ext -t <templatefile>
The output is a data file which we call a "pmd" file. The preferred file extension is ".pmd" but that actually doesn’t matter as they are plain text files.
The output of the pm_parse_ext tool is a table which can be post-processed by any of the following tools. Just pipe the output of the pm_parse_ext tool (or any other tool from the pm_tools suite) into a new instance:
cat <datafile> |\ pm_parse_ext -t <templatefile> |\ pm_??? --some_option
All tools have a short help describing the options of each tool. Use the help flag (-h) for more information.
This tools adds a new column filled with a given value.
Options:
- --name (Short: -n)
Name of the new column
REQUIRED
- --value (Short: -v)
Value for column
REQUIRED
This tools concatenates several tables which are piped together into this tool. All columns need to be identical including the order of all columns.
Options: None
This tools can be used to combine several lines into a single one by applying a mathematical operation to a column. This can be the average, summation, minimum, maximum, standard deviation and coefficient of variation. This operation is applied to a given column. The operation considers all data lines which are identical for all but the given column.
Options:
- --column (Short: -c)
Name of the column to modify
REQUIRED
- --operation (Short: -o)
Operation to apply (avg,sum,min,max)
REQUIRED
- --deviation (Short: -d)
Calculate specified deviation. Available options are:
- std
- Standard deviation
- var
- Coefficient of variation
OPTIONAL
- --deviation_colname
Calculate the column name for the deviation column. Defaults to "dev"
OPTIONAL
This tools combines several columns into a new column by using the the values of all selected columns for each line. A given lambda function will be called with these values to create the value for the new column. By default, the function has to accept the column names as parameters but parameter renaming can be applied for shorter names.
Options:
- --selection (Short: -s)
Selection of columns to combine.
It’s a comma separated list of column names. A column name can be optional followed by a colon and a parameter name. The function will be called with this name as an argument.
REQUIRED
- Example: -s x,y,var_value:v
- In this selection the function has to accept 3 parameters x,y and v (var_value is renamed to v).
- --name (Short: -n)
Name of the new column
REQUIRED
- --function (Short: -f)
Function to apply
Example:
lambda x,y,v : float(x) * float(y) / float(v)
REQUIRED
- --include (Short: -i)
Python file to include
If the lambda function is not enough to implement the desired function it is possible to use a separate file with function definitions. If this option is used the function argument -f is just the name of the function to call (default name is combinefunc). The values of the selected column entries will be given to the arguments with the corresponding name or alias (just like the lambda function in the preceding example).
OPTIONAL
Example: Consider two columns "time" and "events" and you want to compute the events per time unit.
cat <datafile> | pm_combine_columns -n events_per_time \ -s time,events -f "lambda time,events:float(events)/float(time)"
or in a shorter form by using parameter renaming:
cat <datafile> | pm_combine_columns -n events_per_time \ -s time:t,events:e -f "lambda t,e:float(e)/float(t)"
This tools transforms the data file into a format recognized by the plotting tool "dat2eps".
Options:
- --x-axis (Short: -x)
Column to use for x axis values
REQUIRED
- --y-axis (Short: -y)
Column to use for y axis values
REQUIRED
- --groupby (Short: -g)
Output is grouped based on multiple values in given column
REQUIRED
- --group-order
Comma separated list of group values. There will be only a group for each given value in the given order. Symbolic value REST may be used to add all columns not in this order at the end of the list
OPTIONAL
This tools enumerates all rows by adding a new column.
Options:
- --name (Short: -n)
New column name
REQUIRED
This tools counts all identical lines in the data file. The new table will contain only one occurrence for each different line and a new column with the occurrence of this line.
Options:
- --name (Short: -n)
Name of the histogram column
OPTIONAL
This tools works similar to the histogram tool but this tool uses logarithmic ranges for counting data lines which falls into any of these ranges. A column needs to be selected and all data lines which are identical for all but this column will be counted to the range matching the value of the selected column.
Options:
- --column (Short: -c)
Name of the column to check range
REQUIRED
- --name (Short: -n)
Name of the histogram column
OPTIONAL
This tools just prints the first value of a given column without any other information. For example, the output can be easily used in scripts.
Options:
- --column (Short: -c)
Name of the column
REQUIRED
This tools renames a given column.
Options:
- --column (Short: -c)
Name of the column to modify
REQUIRED
- --name (Short: -n)
New name for the selected column
OPTIONAL
This tools selects a set of columns out of all available columns. The only argument is a comma separated list of columns to include in the new table. The order does matter and it’s also possible to duplicate columns by repeating its name.
Options:
- --selection (Short: -s)
Comma separated list of column names.
REQUIRED
This tools selects certain lines by comparing the value of a selected column to a given value. Thus you need to select a column and a value to compare the values of all lines to. A compare operation can be chosen.
Options:
- --column (Short: -c)
Name of the column to compare values
REQUIRED
- --value (Short: -v)
Value to compare to
REQUIRED
- --operation (Short: -o)
Compare operation:
- lt
- Lower than
- le
- Lower equal
- eq
- Equal (default)
- ge
- Greater equal
- gt
- Greater than
- ne
- not equal
OPTIONAL
- --string_compare (Short: -s)
Compare values as strings not as floats
OPTIONAL
This tools sorts the table by comparing the values of a given column either numerical (float compare) or by string compare.
Options:
- --column (Short: -c)
Name of the column to compare
REQUIRED
- --string_compare (Short: -s)
Compare values as strings not as floats
OPTIONAL
This script can be used to merge two data files into a new one. Consider the following scenario: You have two tables from two different data sources which both have the columns x, y and z where x and y are common parameters and z are different results. You can use this script to create a new table with x, y, z1 and z2 columns.
Currently it is not possible to pipe the tables into this script so both tables have to be given as arguments:
pm_sql_join_tables [options] <table_name1:file1> <table_name2:file2>
Each table must have a (short) name given before the colon. This name is used to distinguish both tables and their columns.
Options:
- --selection (Short: -s)
Comma separated list of column names to use for comparison in join (each given column must exists in every table)
REQUIRED
- --additional (Short: -a)
Comma separated list of additional column names to include in output (table name followed by a dot and the column of this table, example: table_name1.time)
OPTIONAL
Example: Consider these two tables:
and
which might represent some experimental results for two different computers. Now we want a single table containing the results from both computers in a single line for each pair of x and y arguments:
pm_sql_join_tables -s x,y -a comp1.result,comp2.result \ comp1:res_comp1.pmd comp2:res_comp2.pmd
The output looks like this:
As you can see, the columns given by the argument "-s" are used to compare both tables and the columns described with the "-a" are added to the final output representing the results.
This script is an extended version of pm_sql_join_tables. It uses an internal SQL database to combine several tables (not just two) into a new one. Some (basic) SQL knowledge may however be required to use this script.
Basically all given data files are read and converted into a SQL table and a select statement is issued to build the new table. Data sources and constraints can be freely chosen.
Currently it is not possible to pipe the tables into this script so all tables have to be given as arguments:
pm_sql_select [options] <table_name1:file1> <table_name2:file2> ...
Each table must have a (short) name given before the colon. This name is used to distinguish both tables and their columns.
Options:
- --selection (Short: -s)
Comma separated list of column names to include in the output table (table name followed by a dot and the column of this table, example: table_name1.time)
REQUIRED
- --from
SQL from statement. This select the data sources.
Example:
table_name1 inner join table_name2
REQUIRED
- --on
SQL on statement. This defines the select contraints.
Example:
table_name1.x=table_name2.x and table_name2.y=2
REQUIRED
Example: Consider these two tables:
and
which might represent some experimental results for two different computers. Now we want add the results for x=1 from table comp2 as another column in table comp1, for instance, in order to calculate the ratio between the results.:
pm_sql_select -s comp1.x,comp1.result,comp2.result \ --from='comp1 inner join comp2' \ --on='comp2.x="1"' \ comp1:res_comp1.pmd comp2:res_comp2.pmd
The output looks like this:
Note: Since all values are stored as strings no matter whether they describe numbers or not, you need to include double quotes in the on statement if you compare specific values!
This tools calculates some statistics about a given column. These are the number of values, the sum of all values, the average and min and max value. Additionally, the standard deviation and coefficient of variation will be printed.
Options:
- --column (Short: -c)
Name of the column
REQUIRED