= Genetic Programming Framework Ricardo Nieto nifr91@gmail 1.0, Oct 21, 2020 :toc: :stem: latexmath :prewrap!: :source-highlighter: rouge

Implementation of different tree-based genetic programming meta-heuristics.

This version is the snapshot of the project for the published article:

In the paper a novel algorithm for GP named GMD-GP is proposed and is compared to 9 algorithm.

Experimental validation showed that GMD-GP improves the fitness "quality" of solutions when compared to standard (vanilla) GP, and is competitive with state of the art algorithm in the benchmark problem of Symbolic Regression (SR).

In this repository the source code of the implemented methods and the datasets used in training and validation can be found.

== Install

=== Prerequisites

A GNU/Linux OS like: ** https://manjaro.org/[Manjaro]
The following programming languages: ** https://crystal-lang.org/[Crystal Lang v1.0+] ** https://www.ruby-lang.org/en/[Ruby V3.0+] ** https://www.python.org/[Python V3.9+]

=== Source Code

To install from source code clone the repository and build the application

[source,sh]

git clone https://gitlab.com/nifr91/genetic-programming gp

then cd in to the folder and compile the apps

[source,sh]

cd gp shards build --release gpdmd.app

this will generate an executable gpdmd.app in the bin folder.

== Usage

The program solves a Symbolic Regression problem for this it trains an s-expression with data points, it expects the data to be in two files one for the input variables xtrain and other for the expected value ytrain.

.Basic usage [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y

=== Example

Suppose we have data from the following model:

[stem] ++++ y(x) = 2*x + 0.5 ++++

in the following two files data.x and data.y :

[options="header"] [cols="3*a"] |=== ^| File ^| File ^| Graph stem:[y(x)] |.data.x [source,txt]

include::example/data.x[]

|.data.y [source,txt]

include::example/data.y[]

.^|image::docs/readme/data.svg[] |===

Solve the regression problem with different options: .Default values [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y

((train-fit 0.0) (size 13) (best-sol (+ (+ (+ x0 (/ 3 9)) x0) (/ (/ 6 9) 4))))

.Train for 500 evaluations [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y --max-evals 500

((train-fit 0.0) (size 17) (best-sol (+ (+ (/ 1 2) x0) (+ (* (+ x0 (- 1 (- 3 x0))) 0) x0))))

.Use only term set [x,int] and function set [+] [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y --terms=int --funs=+

((train-fit 0.25) (size 3) (best-sol (+ x0 x0)))

== Naming convention

The scripts expects the following naming conventions:

.Output file [source,txt]

data--gpdmd--example.out--00 | | | | |___> Independent exection sequential id | | | |> Extension for output data | | |> Experiment id | |> Method name (id) |> Dataset (problem) id

.Population Evolution Metrics [source,txt]

data--gpdmd--example--entropy.evo--00 | | | | | |> Independent execution sequential id | | | | |> Extension for evolution data | | | |> Evolution metric of the population | | |_______________________> Experiment id | | > Method name (id) |> Dataset (problem) id

.Merged independent executions values [source,txt]

data--gpdmd--example--train-fit.merge | | | | |> Extension for merged data | | | |> Merged metric | | |> Id for the experiment | |> Method name |__________________________> Problem/dataset name

.Population Evolution Statistics [source,txt]

data--gpdmd--example--entropy.median | | | | |> Extension for the statistic | | | |> Merged metric | | |> Id for the experiment | |> Method name |____________________________> Problem/dataset name

.SVG [source,txt]

data--example--entropy-median--lineplot.svg | | | | |> Extension for svg | | | |> Plot-type | | |> Merged metric | |> Id for the experiment |_________________________> Problem/dataset name

== Results Comparison

The comparison between methods is carried out by applying a set of statistical tests to at least 30 independent executions with the same datasets as input.

The script compare-dist.py apply the following tests assuming a significance level of 5%. First a Shaphiro-Wilk test is applied verify if the values of the results followed a Gaussian distribution. If so, the Levene test is used to check for the homogeneity of the variance. When equal variances, an ANOVA test is done; if different a Welch test is performed. Fon non Gaussian distribution, the non para-metric Kruskal-Wallis test is used.

Therefore the statement ”algorithm A is superior than algorithm B” means that the differences between them are statistically significant and that the median obtained by A is higher than the median achieved by B.

. Compile the reference method with the `--release` flag. + [source,sh]

shards build --release gpdmd.app

. Compile the new method

. Create an output directory out and directory best

. Then run the reference method at least 30 times + [source,sh]

parallel --eta --joblog gpdmd.log
"bin/gpdmd.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y \

out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)

. Then run the new method the same number of times as the reference method + [source,sh]

parallel --eta --joblog newmethod.log
"./newmethod
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
> out/data--newmethod--t1.out--{}"
::: $(seq -w 30)

The last line of the `stdout` must be a lisp-like associative list expression with the fields `(train-fit numeric-value)`, `(val-fit numeric-value)` and `(best-sol (s-epxresion of solution))`. + [source,scheme]

... ((train-fit 0.1) (val-fit 0.1) (best-sol (+ x0 x0)))

. Merge the independent values into a single files using the script `best-merge-metrics.rb` + [source,sh]

ls -d out/data--gpdmd--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best ls -d out/data--stdgp--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best

This create a folder best where the merged files of each field train, val and size are written.

. Compare the methods with the script `compare-methods.rb` + [source,sh]

ls -d best/data----t1--train-fit.merge | bin/compare-methods.rb --if=- > best/train-fit.adoc ls -d best/data----t1--val-fit.merge | bin/compare-methods.rb --if=- > best/val-fit.adoc ls -d best/data--*--t1--size.merge | bin/compare-methods.rb --if=- > best/size.adoc

This generates an asciidoc file inside the folder `best` with the pairwise comparisons and the medians comparison like the following. +

.Medians of each method for the problems [width="90%"] [.center] |=== | |gpdmd |stdgp |data|[red]0.018689|0.025253 |===

.Methods pairwise comparison (stem:[\uparrow] better, = equal , stem:[\downarrow] worse) [width="90%"] [cols="<.,^.,^.,^."] [.center] |=== ||gpdmd|stdgp .2+^.^|stem:[\sum\Delta\uparrow\downarrow] ||stem:[\uparrow]:==:stem:[\downarrow]|stem:[\uparrow]:==:stem:[\downarrow] |gpdmd| - |01:00:00|1 |stdgp|00:00:01| - |-1 |===

== Population Dynamics

For the population dynamic analysis the population is printed in a lisp like expression in a line.

. Compile the methods with the `--release` flag. + [source,sh]

shards build --release gpdmd.app shards build --release lexicase.app

. Create an output directory out and directory pop

. Then run the reference method at least 30 times , remember to enable population log. + [source,sh]

parallel --eta --joblog gpdmd.log
"bin/gpdmd.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
--log-conf
--log-evo \

out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)

parallel --eta --joblog lexicase.log
"bin/lexicase.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
--log-conf
--log-evo \

out/data--lexicase--t1.out--{}"
::: $(seq -w 30)

The logs for configuration and population must be enabled, the first line must be the configuration and the following lines are a line per iteration with a lisp-like expression with at least the fields `(generation integer)`, `(time seconds)`, `(evaluations integer)` and `(population (indv0 indv1 indv2))`. + [source,txt]

((configuration "v1.0") (run-name "run-name") ... ) ((generation 0) (time 0.0) (evaluations 200)(population ((...) (...) ...)) ...) ((generation 1) (time 1.2) (evaluations 400)(population ((...) (...) ...)) ...)

. Run the script `population-metrics` to evaluate the population for each file + [source,sh]

parallel --eta --joblog popmet.log
"bin/population-metrics.app
--out-dir=pop out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)

parallel --eta --joblog popmet.log
"bin/population-metrics.app
--out-dir=pop out/data--lexicase--t1.out--{}"
::: $(seq -w 30)

It will output for each metric and for each independent execution a file where each line of has the metric of the population at a generation.

. For each metric merge the evolutions of the independent executions + [source,sh]

ls -d pop/gpdmdavg-dcn-edit2.evo* pop/gpdmdgens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=- ls -d pop/gpdmdmax-mark-den.evo* pop/gpdmdgens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=-

ls -d pop/lexicaseavg-dcn-edit2.evo* pop/lexicasegens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=- ls -d pop/lexicasemax-mark-den.evo* pop/lexicasegens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=-

. For each merged file extract the statistics of interest, files with extensions `.median`, `.mean`,`.var`,`.ptp`,`.min` and `.max`. +

bin/population-stats.rb --if=pop/data--gpdmd--t1--max-mark-den.merge --out-dir=pop bin/population-stats.rb --if=pop/data--gpdmd--t1--avg-dcn-edit2.merge --out-dir=pop

bin/population-stats.rb --if=pop/data--lexicase--t1--max-mark-den.merge --out-dir=pop bin/population-stats.rb --if=pop/data--lexicase--t1--avg-dcn-edit2.merge --out-dir=pop

. Finally one can plot the change of the metric with the rng-metric (time), as each generated file has per line the statistic for the 30 independent executions. + [source,sh]

gem install svg-lib mkdir -p svg eplot pop/data--*--avg-dcn-edit2.median > svg/data--avg-dcn-edit2--lineplot.svg

[width=50%] [align="center"] image::docs/readme/data--avg-dcn-edit2--lineplot.svg[]

== Implemented Algorithms

AFPO
ALPS
DPSBR
FOCUS
GMDGP
KNOVELTY
LEXICASE
GPDMD
SIS
STDGP

== Options

[source,scheme]

include::example/config.ss[]

== Updates

V0.3 ** Better configuration object
V0.2 ** Evolution through Evolution object
V0.1 ** Implementation of 10 algorithms

== Contact Information

nifr91@gmail.com

== Contributing

The GPCR repository is hosted at nifr91/genetic-programming on GitLab.

Read the general Contributing guide, and then:

. Fork it . Create your method branch (git checkout -b new-method) . Commit your changes (git commit -am 'Add a new method') . Push to the branch (git push origin new-method) . Create a new Pull Request

Repository

genetic-programming

Owner

nifr91

Statistic

0
0
0
0
0
about 4 years ago
September 16, 2020

License

MIT License

Links

Synced at

Sun, 17 Aug 2025 01:15:33 GMT

Languages

Crystal 65.29% Ruby 29.37% Shell 2.77% Python 2.25% Scheme 0.3%

genetic-programming

[source,sh]

git clone https://gitlab.com/nifr91/genetic-programming gp

[source,sh]

cd gp shards build --release gpdmd.app

.Basic usage [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y

[options="header"] [cols="3*a"] |=== ^| File ^| File ^| Graph stem:[y(x)] |.data.x [source,txt]

include::example/data.x[]

|.data.y [source,txt]

include::example/data.y[]

Solve the regression problem with different options: .Default values [source,sh]

((train-fit 0.0) (size 13) (best-sol (+ (+ (+ x0 (/ 3 9)) x0) (/ (/ 6 9) 4))))

.Train for 500 evaluations [source,sh]

((train-fit 0.0) (size 17) (best-sol (+ (+ (/ 1 2) x0) (+ (* (+ x0 (- 1 (- 3 x0))) 0) x0))))

.Use only term set [x,int] and function set [+] [source,sh]

((train-fit 0.25) (size 3) (best-sol (+ x0 x0)))

.Output file [source,txt]

data--gpdmd--example.out--00 | | | | |_________> Independent exection sequential id | | | |> Extension for output data | | |> Experiment id | |> Method name (id) |______> Dataset (problem) id

.Population Evolution Metrics [source,txt]

data--gpdmd--example--entropy.evo--00 | | | | | |> Independent execution sequential id | | | | |> Extension for evolution data | | | |> Evolution metric of the population | | |_______________________> Experiment id | | > Method name (id) |> Dataset (problem) id

.Merged independent executions values [source,txt]

data--gpdmd--example--train-fit.merge | | | | |> Extension for merged data | | | |> Merged metric | | |> Id for the experiment | |> Method name |__________________________> Problem/dataset name

.Population Evolution Statistics [source,txt]

data--gpdmd--example--entropy.median | | | | |> Extension for the statistic | | | |> Merged metric | | |> Id for the experiment | |> Method name |____________________________> Problem/dataset name

.SVG [source,txt]

data--example--entropy-median--lineplot.svg | | | | |> Extension for svg | | | |> Plot-type | | |> Merged metric | |> Id for the experiment |_________________________> Problem/dataset name

. Compile the reference method with the --release flag. + [source,sh]

shards build --release gpdmd.app

. Then run the reference method at least 30 times + [source,sh]

. Then run the new method the same number of times as the reference method + [source,sh]

parallel --eta --joblog newmethod.log "./newmethod --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y --xval-file=example/data.x --yval-file=example/data.y > out/data--newmethod--t1.out--{}" ::: $(seq -w 30)

The last line of the stdout must be a lisp-like associative list expression with the fields (train-fit numeric-value), (val-fit numeric-value) and (best-sol (s-epxresion of solution)). + [source,scheme]

... ((train-fit 0.1) (val-fit 0.1) (best-sol (+ x0 x0)))

. Merge the independent values into a single files using the script best-merge-metrics.rb + [source,sh]

ls -d out/data--gpdmd--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best ls -d out/data--stdgp--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best

. Compare the methods with the script compare-methods.rb + [source,sh]

ls -d best/data----t1--train-fit.merge | bin/compare-methods.rb --if=- > best/train-fit.adoc ls -d best/data----t1--val-fit.merge | bin/compare-methods.rb --if=- > best/val-fit.adoc ls -d best/data--*--t1--size.merge | bin/compare-methods.rb --if=- > best/size.adoc

This generates an asciidoc file inside the folder best with the pairwise comparisons and the medians comparison like the following. +

.Medians of each method for the problems [width="90%"] [.center] |=== | |gpdmd |stdgp |data|[red]0.018689|0.025253 |===

. Compile the methods with the --release flag. + [source,sh]

shards build --release gpdmd.app shards build --release lexicase.app

. Then run the reference method at least 30 times , remember to enable population log. + [source,sh]

((configuration "v1.0") (run-name "run-name") ... ) ((generation 0) (time 0.0) (evaluations 200)(population ((...) (...) ...)) ...) ((generation 1) (time 1.2) (evaluations 400)(population ((...) (...) ...)) ...)

. Run the script population-metrics to evaluate the population for each file + [source,sh]

parallel --eta --joblog popmet.log "bin/population-metrics.app --out-dir=pop out/data--lexicase--t1.out--{}" ::: $(seq -w 30)

. For each metric merge the evolutions of the independent executions + [source,sh]

. For each merged file extract the statistics of interest, files with extensions *.median, *.mean,*.var,*.ptp,*.min and *.max. +

bin/population-stats.rb --if=pop/data--lexicase--t1--max-mark-den.merge --out-dir=pop bin/population-stats.rb --if=pop/data--lexicase--t1--avg-dcn-edit2.merge --out-dir=pop

. Finally one can plot the change of the metric with the rng-metric (time), as each generated file has per line the statistic for the 30 independent executions. + [source,sh]

gem install svg-lib mkdir -p svg eplot pop/data--*--avg-dcn-edit2.median > svg/data--avg-dcn-edit2--lineplot.svg

[source,scheme]

include::example/config.ss[]

data--gpdmd--example.out--00 | | | | |___> Independent exection sequential id | | | |> Extension for output data | | |> Experiment id | |> Method name (id) |> Dataset (problem) id

. Compile the reference method with the `--release` flag. + [source,sh]

parallel --eta --joblog newmethod.log
"./newmethod
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
> out/data--newmethod--t1.out--{}"
::: $(seq -w 30)

The last line of the `stdout` must be a lisp-like associative list expression with the fields `(train-fit numeric-value)`, `(val-fit numeric-value)` and `(best-sol (s-epxresion of solution))`. + [source,scheme]

. Merge the independent values into a single files using the script `best-merge-metrics.rb` + [source,sh]

. Compare the methods with the script `compare-methods.rb` + [source,sh]

This generates an asciidoc file inside the folder `best` with the pairwise comparisons and the medians comparison like the following. +

. Compile the methods with the `--release` flag. + [source,sh]

. Run the script `population-metrics` to evaluate the population for each file + [source,sh]

parallel --eta --joblog popmet.log
"bin/population-metrics.app
--out-dir=pop out/data--lexicase--t1.out--{}"
::: $(seq -w 30)

. For each merged file extract the statistics of interest, files with extensions `.median`, `.mean`,`.var`,`.ptp`,`.min` and `.max`. +