foundry_dev_tools.utils.caches.spark_caches module#

DataFrame cache implementations.

File based spark cache

class foundry_dev_tools.utils.caches.spark_caches.DiskPersistenceBackedSparkCache[source]#

Bases: MutableMapping[DatasetIdentity, pyspark.sql.DataFrame]

A cache that stores spark dataframes inside a directory.

__init__(ctx)[source]#
Parameters:

ctx (FoundryContext)

set_item_metadata(path, dataset_identity, schema)[source]#

Writes schema and metadata.json entry.

Use when files are added to cache without calling cache[entry] = df

Parameters:
  • path (Path) – direct path to transaction folder, e.g. /…/dss-rid/transaction1.parquet

  • dataset_identity (api_types.DatasetIdentity) – dataset_identity with keys dataset_rid, last_transaction_rid, dataset_path

  • schema (api_types.FoundrySchema | None) – spark schema or foundrySchema or None

Return type:

None

get_path_to_local_dataset(dataset_identity)[source]#

Returns local path to dataset.

Parameters:

dataset_identity (dict) – with dataset_rid and last_transaction_rid

Returns:

path to dataset in cache

Return type:

str

get_cache_dir()[source]#

Returns cache directory.

Returns:

path to temporary cache directory

Return type:

str

get_dataset_identity_not_branch_aware(dataset_path_or_rid)[source]#

If dataset is in cache, returns complete identity, otherwise KeyError.

Parameters:

dataset_path_or_rid (str) – path to dataset in Foundry Compass or dataset_rid

Returns:

dataset identity

Return type:

dict

Raises:

KeyError – if dataset with dataset_path_or_rid not in cache

dataset_has_schema(dataset_identity)[source]#

Checks if dataset stored in cache folder has _schema.json attached.

Parameters:

dataset_identity (DatasetIdentity) – dataset identity

Returns:

if dataset has schema return true

Return type:

bool

foundry_dev_tools.utils.caches.spark_caches.get_dataset_path(cache_dir, dataset_identity)[source]#

Returns the local directory of the dataset of the last transaction.

Parameters:
  • cache_dir (str) – the base directory of the cache

  • dataset_identity (dict) – the identity of the dataset, containing dataset_rid and last_transaction_rid

Returns:

local path of the last transaction of the dataset

Return type:

str