Example for full simulation loop using a table-based lookup mechanism with initial

data

This example shows a simulation for a direct arylation where all combinations have been measured. It also demonstrates how to use initial data by using a lookup mechanism. This allows us to access information about previously conducted experiments from .xlsx- files.

This examples assumes some basic familiarity with using BayBE and the lookup mechanism. We thus refer to campaign for a basic example. We refer to full_lookup for details on the lookup mechanism.

Necessary imports for this example

import os
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from baybe import Campaign
from baybe.objectives import SingleTargetObjective
from baybe.parameters import NumericalDiscreteParameter, SubstanceParameter
from baybe.recommenders import RandomRecommender, TwoPhaseMetaRecommender
from baybe.searchspace import SearchSpace
from baybe.simulation import simulate_scenarios
from baybe.targets import NumericalTarget

Parameters for a full simulation loop

For the full simulation, we need to define an additional parameter. Since this example uses initial data, we only need to define the number of iterations per run. The number of runs is determined by the number of initial data points provided.

SMOKE_TEST = "SMOKE_TEST" in os.environ
N_DOE_ITERATIONS = 2 if SMOKE_TEST else 5
BATCH_SIZE = 1 if SMOKE_TEST else 3

Lookup functionality and data creation

See full_lookup for details.

try:
    lookup = pd.read_excel("./lookup.xlsx")
except FileNotFoundError:
    try:
        lookup = pd.read_excel("examples/Backtesting/lookup.xlsx")
    except FileNotFoundError as e:
        print(e)

Inclusion of initial data

To include initial data, we sample some rows from the lookup table. Note that the initial_data needs to be a list of pd.DataFrame objects. One experiment will be performed per provided initial data set.

initial_data = [lookup.sample(n=5), lookup.sample(n=5), lookup.sample(n=5)]

As usual, we set up some experiment. Note that we now need to ensure that the names fit the names in the provided .xlsx file!

dict_solvent = {
    "DMAc": r"CC(N(C)C)=O",
    "Butyornitrile": r"CCCC#N",
    "Butyl Ester": r"CCCCOC(C)=O",
    "p-Xylene": r"CC1=CC=C(C)C=C1",
}
dict_base = {
    "Potassium acetate": r"O=C([O-])C.[K+]",
    "Potassium pivalate": r"O=C([O-])C(C)(C)C.[K+]",
    "Cesium acetate": r"O=C([O-])C.[Cs+]",
    "Cesium pivalate": r"O=C([O-])C(C)(C)C.[Cs+]",
}
dict_ligand = {
    "BrettPhos": r"CC(C)C1=CC(C(C)C)=C(C(C(C)C)=C1)C2=C(P(C3CCCCC3)C4CCCCC4)C(OC)="
    "CC=C2OC",
    "Di-tert-butylphenylphosphine": r"CC(C)(C)P(C1=CC=CC=C1)C(C)(C)C",
    "(t-Bu)PhCPhos": r"CN(C)C1=CC=CC(N(C)C)=C1C2=CC=CC=C2P(C(C)(C)C)C3=CC=CC=C3",
    "Tricyclohexylphosphine": r"P(C1CCCCC1)(C2CCCCC2)C3CCCCC3",
    "PPh3": r"P(C1=CC=CC=C1)(C2=CC=CC=C2)C3=CC=CC=C3",
    "XPhos": r"CC(C1=C(C2=CC=CC=C2P(C3CCCCC3)C4CCCCC4)C(C(C)C)=CC(C(C)C)=C1)C",
    "P(2-furyl)3": r"P(C1=CC=CO1)(C2=CC=CO2)C3=CC=CO3",
    "Methyldiphenylphosphine": r"CP(C1=CC=CC=C1)C2=CC=CC=C2",
    "1268824-69-6": r"CC(OC1=C(P(C2CCCCC2)C3CCCCC3)C(OC(C)C)=CC=C1)C",
    "JackiePhos": r"FC(F)(F)C1=CC(P(C2=C(C3=C(C(C)C)C=C(C(C)C)C=C3C(C)C)C(OC)=CC=C2OC)"
    r"C4=CC(C(F)(F)F)=CC(C(F)(F)F)=C4)=CC(C(F)(F)F)=C1",
    "SCHEMBL15068049": r"C[C@]1(O2)O[C@](C[C@]2(C)P3C4=CC=CC=C4)(C)O[C@]3(C)C1",
    "Me2PPh": r"CP(C)C1=CC=CC=C1",
}

Creating the searchspace and the objective

Here, we create the parameter objects, the searchspace and the objective.

base = SubstanceParameter(name="Base", data=dict_base, encoding="MORDRED")
solvent = SubstanceParameter(name="Solvent", data=dict_solvent, encoding="MORDRED")
ligand = SubstanceParameter(name="Ligand", data=dict_ligand, encoding="MORDRED")
temperature = NumericalDiscreteParameter(
    name="Temp_C", values=[90, 105, 120], tolerance=2
)
concentration = NumericalDiscreteParameter(
    name="Concentration", values=[0.057, 0.1, 0.153], tolerance=0.005
)
parameters = [solvent, base, ligand, temperature, concentration]
searchspace = SearchSpace.from_product(parameters=parameters)
objective = SingleTargetObjective(target=NumericalTarget(name="yield", mode="MAX"))

Constructing campaigns for the simulation loop

In this example, we create two campaigns. One uses the default recommender and the other one makes random recommendations.

campaign = Campaign(searchspace=searchspace, objective=objective)
campaign_rand = Campaign(
    searchspace=searchspace,
    recommender=TwoPhaseMetaRecommender(recommender=RandomRecommender()),
    objective=objective,
)

Performing the simulation loop

We can now use the simulate_scenarios function to simulate a full experiment. This function is where we provide the initial_data dataframe. Note that this function enables to run multiple scenarios by a single function call. For this, it is necessary to define a dictionary mapping scenario names to campaigns.

scenarios = {"Test_Scenario": campaign, "Random": campaign_rand}
results = simulate_scenarios(
    scenarios,
    lookup,
    batch_size=BATCH_SIZE,
    n_doe_iterations=N_DOE_ITERATIONS,
    initial_data=initial_data,
)

The following lines plot the results and save the plot in run_full_initial_data.png

max_yield = lookup["yield"].max()
sns.lineplot(
    data=results, x="Num_Experiments", y="yield_CumBest", hue="Scenario", marker="x"
)
plt.plot([3, 3 * N_DOE_ITERATIONS], [max_yield, max_yield], "--r")
plt.legend(loc="lower right")
plt.gcf().set_size_inches(20, 8)
plt.savefig("./run_full_initial_data.png")