genetic-programming
= Genetic Programming Framework Ricardo Nieto nifr91@gmail 1.0, Oct 21, 2020 :toc: :stem: latexmath :prewrap!: :source-highlighter: rouge
Implementation of different tree-based genetic programming meta-heuristics.
This version is the snapshot of the project for the published article:
[options="noheader"] [.center] [cols="10,<~"] |=== |Title | DMD-GP: A genetic programming variant with Dynamic Management of Diversity |Authors| R. Nieto-Fuentes, Carlos Segura |Year | 2021 |DOI | |===
In the paper a novel algorithm for GP named GMD-GP is proposed and is compared to 9 algorithm.
Experimental validation showed that GMD-GP improves the fitness "quality" of solutions when compared to standard (vanilla) GP, and is competitive with state of the art algorithm in the benchmark problem of Symbolic Regression (SR).
In this repository the source code of the implemented methods and the datasets used in training and validation can be found.
== Install
=== Prerequisites
-
A GNU/Linux OS like: ** https://manjaro.org/[Manjaro]
-
The following programming languages: ** https://crystal-lang.org/[Crystal Lang v1.0+] ** https://www.ruby-lang.org/en/[Ruby V3.0+] ** https://www.python.org/[Python V3.9+]
=== Source Code
To install from source code clone the repository and build the application
[source,sh]
git clone https://gitlab.com/nifr91/genetic-programming gp
then cd
in to the folder and compile the apps
[source,sh]
cd gp shards build --release gpdmd.app
this will generate an executable gpdmd.app
in the bin
folder.
== Usage
The program solves a Symbolic Regression problem for this it trains an s-expression with data points, it expects the data to be in two files one for the input variables xtrain
and other for the expected value ytrain
.
.Basic usage [source,sh]
bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y
=== Example
Suppose we have data from the following model:
[stem] ++++ y(x) = 2*x + 0.5 ++++
in the following two files data.x
and data.y
:
[options="header"] [cols="3*a"] |=== ^| File ^| File ^| Graph stem:[y(x)] |.data.x [source,txt]
include::example/data.x[]
|.data.y [source,txt]
include::example/data.y[]
.^|image::docs/readme/data.svg[] |===
Solve the regression problem with different options: .Default values [source,sh]
bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y
((train-fit 0.0) (size 13) (best-sol (+ (+ (+ x0 (/ 3 9)) x0) (/ (/ 6 9) 4))))
.Train for 500 evaluations [source,sh]
bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y --max-evals 500
((train-fit 0.0) (size 17) (best-sol (+ (+ (/ 1 2) x0) (+ (* (+ x0 (- 1 (- 3 x0))) 0) x0))))
.Use only term set [x,int] and function set [+] [source,sh]
bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y --terms=int --funs=+
((train-fit 0.25) (size 3) (best-sol (+ x0 x0)))
== Naming convention
The scripts expects the following naming conventions:
.Output file [source,txt]
data--gpdmd--example.out--00 | | | | |_________> Independent exection sequential id | | | |> Extension for output data | | |> Experiment id | |> Method name (id) |______> Dataset (problem) id
.Population Evolution Metrics [source,txt]
data--gpdmd--example--entropy.evo--00 | | | | | |> Independent execution sequential id | | | | |> Extension for evolution data | | | |> Evolution metric of the population | | |_______________________> Experiment id | | > Method name (id) |> Dataset (problem) id
.Merged independent executions values [source,txt]
data--gpdmd--example--train-fit.merge | | | | |> Extension for merged data | | | |> Merged metric | | |> Id for the experiment | |> Method name |__________________________> Problem/dataset name
.Population Evolution Statistics [source,txt]
data--gpdmd--example--entropy.median | | | | |> Extension for the statistic | | | |> Merged metric | | |> Id for the experiment | |> Method name |____________________________> Problem/dataset name
.SVG [source,txt]
data--example--entropy-median--lineplot.svg | | | | |> Extension for svg | | | |> Plot-type | | |> Merged metric | |> Id for the experiment |_________________________> Problem/dataset name
== Results Comparison
The comparison between methods is carried out by applying a set of statistical tests to at least 30 independent executions with the same datasets as input.
The script compare-dist.py
apply the following tests assuming a significance level of 5%. First a Shaphiro-Wilk test is applied verify if the values of the results followed a Gaussian distribution. If so, the Levene test is used to check for the homogeneity of the variance. When equal variances, an ANOVA test is done; if different a Welch test is performed. Fon non Gaussian distribution, the non para-metric Kruskal-Wallis test is used.
Therefore the statement ”algorithm A is superior than algorithm B” means that the differences between them are statistically significant and that the median obtained by A is higher than the median achieved by B.
. Compile the reference method with the --release
flag. + [source,sh]
shards build --release gpdmd.app
. Compile the new method
. Create an output directory out
and directory best
. Then run the reference method at least 30 times + [source,sh]
parallel --eta --joblog gpdmd.log
"bin/gpdmd.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y \
out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)
. Then run the new method the same number of times as the reference method + [source,sh]
parallel --eta --joblog newmethod.log
"./newmethod
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
> out/data--newmethod--t1.out--{}"
::: $(seq -w 30)
The last line of the stdout
must be a lisp-like associative list expression with the fields (train-fit numeric-value)
, (val-fit numeric-value)
and (best-sol (s-epxresion of solution))
. + [source,scheme]
... ((train-fit 0.1) (val-fit 0.1) (best-sol (+ x0 x0)))
. Merge the independent values into a single files using the script best-merge-metrics.rb
+ [source,sh]
ls -d out/data--gpdmd--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best ls -d out/data--stdgp--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best
This create a folder best
where the merged files of each field train
, val
and size
are written.
. Compare the methods with the script compare-methods.rb
+ [source,sh]
ls -d best/data----t1--train-fit.merge | bin/compare-methods.rb --if=- > best/train-fit.adoc ls -d best/data----t1--val-fit.merge | bin/compare-methods.rb --if=- > best/val-fit.adoc ls -d best/data--*--t1--size.merge | bin/compare-methods.rb --if=- > best/size.adoc
This generates an asciidoc file inside the folder best
with the pairwise comparisons and the medians comparison like the following. +
.Medians of each method for the problems [width="90%"] [.center] |=== | |gpdmd |stdgp |data|[red]0.018689|0.025253 |===
.Methods pairwise comparison (stem:[\uparrow] better, = equal , stem:[\downarrow] worse) [width="90%"] [cols="<.,^.,^.,^."] [.center] |=== ||gpdmd|stdgp .2+^.^|stem:[\sum\Delta\uparrow\downarrow] ||stem:[\uparrow]:==:stem:[\downarrow]|stem:[\uparrow]:==:stem:[\downarrow] |gpdmd| - |01:00:00|1 |stdgp|00:00:01| - |-1 |===
== Population Dynamics
For the population dynamic analysis the population is printed in a lisp like expression in a line.
. Compile the methods with the --release
flag. + [source,sh]
shards build --release gpdmd.app shards build --release lexicase.app
. Create an output directory out
and directory pop
. Then run the reference method at least 30 times , remember to enable population log. + [source,sh]
parallel --eta --joblog gpdmd.log
"bin/gpdmd.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
--log-conf
--log-evo \
out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)
parallel --eta --joblog lexicase.log
"bin/lexicase.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
--log-conf
--log-evo \
out/data--lexicase--t1.out--{}"
::: $(seq -w 30)
The logs for configuration and population must be enabled, the first line must be the configuration and the following lines are a line per iteration with a lisp-like expression with at least the fields (generation integer)
, (time seconds)
, (evaluations integer)
and (population (indv0 indv1 indv2))
. + [source,txt]
((configuration "v1.0") (run-name "run-name") ... ) ((generation 0) (time 0.0) (evaluations 200)(population ((...) (...) ...)) ...) ((generation 1) (time 1.2) (evaluations 400)(population ((...) (...) ...)) ...)
. Run the script population-metrics
to evaluate the population for each file + [source,sh]
parallel --eta --joblog popmet.log
"bin/population-metrics.app
--out-dir=pop out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)
parallel --eta --joblog popmet.log
"bin/population-metrics.app
--out-dir=pop out/data--lexicase--t1.out--{}"
::: $(seq -w 30)
It will output for each metric and for each independent execution a file where each line of has the metric of the population at a generation.
. For each metric merge the evolutions of the independent executions + [source,sh]
ls -d pop/gpdmdavg-dcn-edit2.evo* pop/gpdmdgens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=- ls -d pop/gpdmdmax-mark-den.evo* pop/gpdmdgens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=-
ls -d pop/lexicaseavg-dcn-edit2.evo* pop/lexicasegens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=- ls -d pop/lexicasemax-mark-den.evo* pop/lexicasegens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=-
. For each merged file extract the statistics of interest, files with extensions *.median
, *.mean
,*.var
,*.ptp
,*.min
and *.max
. +
bin/population-stats.rb --if=pop/data--gpdmd--t1--max-mark-den.merge --out-dir=pop bin/population-stats.rb --if=pop/data--gpdmd--t1--avg-dcn-edit2.merge --out-dir=pop
bin/population-stats.rb --if=pop/data--lexicase--t1--max-mark-den.merge --out-dir=pop bin/population-stats.rb --if=pop/data--lexicase--t1--avg-dcn-edit2.merge --out-dir=pop
. Finally one can plot the change of the metric with the rng-metric (time), as each generated file has per line the statistic for the 30 independent executions. + [source,sh]
gem install svg-lib mkdir -p svg eplot pop/data--*--avg-dcn-edit2.median > svg/data--avg-dcn-edit2--lineplot.svg
[width=50%] [align="center"] image::docs/readme/data--avg-dcn-edit2--lineplot.svg[]
== Implemented Algorithms
- AFPO
- ALPS
- DPSBR
- FOCUS
- GMDGP
- KNOVELTY
- LEXICASE
- GPDMD
- SIS
- STDGP
== Options
[source,scheme]
include::example/config.ss[]
== Updates
- V0.3 ** Better configuration object
- V0.2 ** Evolution through
Evolution
object - V0.1 ** Implementation of 10 algorithms
== Contact Information
== Contributing
The GPCR repository is hosted at nifr91/genetic-programming on GitLab.
Read the general Contributing guide, and then:
. Fork it . Create your method branch (git checkout -b new-method) . Commit your changes (git commit -am 'Add a new method') . Push to the branch (git push origin new-method) . Create a new Pull Request
genetic-programming
- 0
- 0
- 0
- 0
- 0
- over 3 years ago
- September 16, 2020
MIT License
Mon, 18 Nov 2024 06:56:45 GMT