genetic-programming

A basic Genetic Programming framework

= Genetic Programming Framework Ricardo Nieto nifr91@gmail 1.0, Oct 21, 2020 :toc: :stem: latexmath :prewrap!: :source-highlighter: rouge

Implementation of different tree-based genetic programming meta-heuristics.

This version is the snapshot of the project for the published article:

[options="noheader"] [.center] [cols="10,<~"] |=== |Title | DMD-GP: A genetic programming variant with Dynamic Management of Diversity |Authors| R. Nieto-Fuentes, Carlos Segura |Year | 2021 |DOI | |===

In the paper a novel algorithm for GP named GMD-GP is proposed and is compared to 9 algorithm.

Experimental validation showed that GMD-GP improves the fitness "quality" of solutions when compared to standard (vanilla) GP, and is competitive with state of the art algorithm in the benchmark problem of Symbolic Regression (SR).

In this repository the source code of the implemented methods and the datasets used in training and validation can be found.

== Install

=== Prerequisites

=== Source Code

To install from source code clone the repository and build the application

[source,sh]

git clone https://gitlab.com/nifr91/genetic-programming gp

then cd in to the folder and compile the apps

[source,sh]

cd gp shards build --release gpdmd.app

this will generate an executable gpdmd.app in the bin folder.

== Usage

The program solves a Symbolic Regression problem for this it trains an s-expression with data points, it expects the data to be in two files one for the input variables xtrain and other for the expected value ytrain.

.Basic usage [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y

=== Example

Suppose we have data from the following model:

[stem] ++++ y(x) = 2*x + 0.5 ++++

in the following two files data.x and data.y :

[options="header"] [cols="3*a"] |=== ^| File ^| File ^| Graph stem:[y(x)] |.data.x [source,txt]

include::example/data.x[]

|.data.y [source,txt]

include::example/data.y[]

.^|image::docs/readme/data.svg[] |===

Solve the regression problem with different options: .Default values [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y

((train-fit 0.0) (size 13) (best-sol (+ (+ (+ x0 (/ 3 9)) x0) (/ (/ 6 9) 4))))


.Train for 500 evaluations [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y --max-evals 500

((train-fit 0.0) (size 17) (best-sol (+ (+ (/ 1 2) x0) (+ (* (+ x0 (- 1 (- 3 x0))) 0) x0))))


.Use only term set [x,int] and function set [+] [source,sh]

bin/gpdmd.app --config-file=example/config.ss --xtrain-file=example/data.x --ytrain-file=example/data.y --terms=int --funs=+

((train-fit 0.25) (size 3) (best-sol (+ x0 x0)))


== Naming convention

The scripts expects the following naming conventions:

.Output file [source,txt]

data--gpdmd--example.out--00 | | | | |_________> Independent exection sequential id | | | |> Extension for output data | | |> Experiment id | |> Method name (id) |______> Dataset (problem) id

.Population Evolution Metrics [source,txt]

data--gpdmd--example--entropy.evo--00 | | | | | |> Independent execution sequential id | | | | |> Extension for evolution data | | | |> Evolution metric of the population | | |_______________________> Experiment id | | > Method name (id) |> Dataset (problem) id

.Merged independent executions values [source,txt]

data--gpdmd--example--train-fit.merge | | | | |> Extension for merged data | | | |> Merged metric | | |> Id for the experiment | |> Method name |__________________________> Problem/dataset name

.Population Evolution Statistics [source,txt]

data--gpdmd--example--entropy.median | | | | |> Extension for the statistic | | | |> Merged metric | | |> Id for the experiment | |> Method name |____________________________> Problem/dataset name

.SVG [source,txt]

data--example--entropy-median--lineplot.svg | | | | |> Extension for svg | | | |> Plot-type | | |> Merged metric | |> Id for the experiment |_________________________> Problem/dataset name

== Results Comparison

The comparison between methods is carried out by applying a set of statistical tests to at least 30 independent executions with the same datasets as input.

The script compare-dist.py apply the following tests assuming a significance level of 5%. First a Shaphiro-Wilk test is applied verify if the values of the results followed a Gaussian distribution. If so, the Levene test is used to check for the homogeneity of the variance. When equal variances, an ANOVA test is done; if different a Welch test is performed. Fon non Gaussian distribution, the non para-metric Kruskal-Wallis test is used.

Therefore the statement ”algorithm A is superior than algorithm B” means that the differences between them are statistically significant and that the median obtained by A is higher than the median achieved by B.

. Compile the reference method with the --release flag. + [source,sh]

shards build --release gpdmd.app

. Compile the new method

. Create an output directory out and directory best

. Then run the reference method at least 30 times + [source,sh]

parallel --eta --joblog gpdmd.log
"bin/gpdmd.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y \

out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)


. Then run the new method the same number of times as the reference method + [source,sh]

parallel --eta --joblog newmethod.log
"./newmethod
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
> out/data--newmethod--t1.out--{}"
::: $(seq -w 30)

The last line of the stdout must be a lisp-like associative list expression with the fields (train-fit numeric-value), (val-fit numeric-value) and (best-sol (s-epxresion of solution)). + [source,scheme]

... ((train-fit 0.1) (val-fit 0.1) (best-sol (+ x0 x0)))

. Merge the independent values into a single files using the script best-merge-metrics.rb + [source,sh]

ls -d out/data--gpdmd--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best ls -d out/data--stdgp--t1.out* | bin/best-merge-metrics.rb --if=- --out-dir=best

This create a folder best where the merged files of each field train, val and size are written.

. Compare the methods with the script compare-methods.rb + [source,sh]

ls -d best/data----t1--train-fit.merge | bin/compare-methods.rb --if=- > best/train-fit.adoc ls -d best/data----t1--val-fit.merge | bin/compare-methods.rb --if=- > best/val-fit.adoc ls -d best/data--*--t1--size.merge | bin/compare-methods.rb --if=- > best/size.adoc

This generates an asciidoc file inside the folder best with the pairwise comparisons and the medians comparison like the following. +

.Medians of each method for the problems [width="90%"] [.center] |=== | |gpdmd |stdgp |data|[red]0.018689|0.025253 |===

.Methods pairwise comparison (stem:[\uparrow] better, = equal , stem:[\downarrow] worse) [width="90%"] [cols="<.,^.,^.,^."] [.center] |=== ||gpdmd|stdgp .2+^.^|stem:[\sum\Delta\uparrow\downarrow] ||stem:[\uparrow]:==:stem:[\downarrow]|stem:[\uparrow]:==:stem:[\downarrow] |gpdmd| - |01:00:00|1 |stdgp|00:00:01| - |-1 |===

== Population Dynamics

For the population dynamic analysis the population is printed in a lisp like expression in a line.

. Compile the methods with the --release flag. + [source,sh]

shards build --release gpdmd.app shards build --release lexicase.app

. Create an output directory out and directory pop

. Then run the reference method at least 30 times , remember to enable population log. + [source,sh]

parallel --eta --joblog gpdmd.log
"bin/gpdmd.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
--log-conf
--log-evo \

out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)

parallel --eta --joblog lexicase.log
"bin/lexicase.app
--config-file=example/config.ss
--xtrain-file=example/data.x --ytrain-file=example/data.y
--xval-file=example/data.x --yval-file=example/data.y
--log-conf
--log-evo \

out/data--lexicase--t1.out--{}"
::: $(seq -w 30)


The logs for configuration and population must be enabled, the first line must be the configuration and the following lines are a line per iteration with a lisp-like expression with at least the fields (generation integer), (time seconds), (evaluations integer) and (population (indv0 indv1 indv2)). + [source,txt]

((configuration "v1.0") (run-name "run-name") ... ) ((generation 0) (time 0.0) (evaluations 200)(population ((...) (...) ...)) ...) ((generation 1) (time 1.2) (evaluations 400)(population ((...) (...) ...)) ...)

. Run the script population-metrics to evaluate the population for each file + [source,sh]

parallel --eta --joblog popmet.log
"bin/population-metrics.app
--out-dir=pop out/data--gpdmd--t1.out--{}"
::: $(seq -w 30)

parallel --eta --joblog popmet.log
"bin/population-metrics.app
--out-dir=pop out/data--lexicase--t1.out--{}"
::: $(seq -w 30)

It will output for each metric and for each independent execution a file where each line of has the metric of the population at a generation.

. For each metric merge the evolutions of the independent executions + [source,sh]

ls -d pop/gpdmdavg-dcn-edit2.evo* pop/gpdmdgens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=- ls -d pop/gpdmdmax-mark-den.evo* pop/gpdmdgens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=-

ls -d pop/lexicaseavg-dcn-edit2.evo* pop/lexicasegens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=- ls -d pop/lexicasemax-mark-den.evo* pop/lexicasegens.evo* | bin/population-merge-metrics.rb --out-dir=pop --rng-metric=gens --begin=0 --end=10 --if=-


. For each merged file extract the statistics of interest, files with extensions *.median, *.mean,*.var,*.ptp,*.min and *.max. +

bin/population-stats.rb --if=pop/data--gpdmd--t1--max-mark-den.merge --out-dir=pop bin/population-stats.rb --if=pop/data--gpdmd--t1--avg-dcn-edit2.merge --out-dir=pop

bin/population-stats.rb --if=pop/data--lexicase--t1--max-mark-den.merge --out-dir=pop bin/population-stats.rb --if=pop/data--lexicase--t1--avg-dcn-edit2.merge --out-dir=pop

. Finally one can plot the change of the metric with the rng-metric (time), as each generated file has per line the statistic for the 30 independent executions. + [source,sh]

gem install svg-lib mkdir -p svg eplot pop/data--*--avg-dcn-edit2.median > svg/data--avg-dcn-edit2--lineplot.svg

[width=50%] [align="center"] image::docs/readme/data--avg-dcn-edit2--lineplot.svg[]

== Implemented Algorithms

  • AFPO
  • ALPS
  • DPSBR
  • FOCUS
  • GMDGP
  • KNOVELTY
  • LEXICASE
  • GPDMD
  • SIS
  • STDGP

== Options

[source,scheme]

include::example/config.ss[]

== Updates

  • V0.3 ** Better configuration object
  • V0.2 ** Evolution through Evolution object
  • V0.1 ** Implementation of 10 algorithms

== Contact Information

nifr91@gmail.com

== Contributing

The GPCR repository is hosted at nifr91/genetic-programming on GitLab.

Read the general Contributing guide, and then:

. Fork it . Create your method branch (git checkout -b new-method) . Commit your changes (git commit -am 'Add a new method') . Push to the branch (git push origin new-method) . Create a new Pull Request

Repository

genetic-programming

Owner
Statistic
  • 0
  • 0
  • 0
  • 0
  • 0
  • over 3 years ago
  • September 16, 2020
License

MIT License

Links
Synced at

Sun, 22 Dec 2024 00:14:54 GMT

Languages