foundry_dev_tools.utils.caches.spark_caches module#

DataFrame cache implementations.

File based spark cache

class foundry_dev_tools.utils.caches.spark_caches.DiskPersistenceBackedSparkCache[source]#

A cache that stores spark dataframes inside a directory.

__init__(ctx)[source]#

set_item_metadata(path, dataset_identity, schema)[source]#

Writes schema and metadata.json entry.

Use when files are added to cache without calling cache[entry] = df

Parameters:

path (Path) – direct path to transaction folder, e.g. /…/dss-rid/transaction1.parquet
dataset_identity (api_types.DatasetIdentity) – dataset_identity with keys dataset_rid, last_transaction_rid, dataset_path
schema (api_types.FoundrySchema | None) – spark schema or foundrySchema or None

Return type:

None

get_path_to_local_dataset(dataset_identity)[source]#

Returns local path to dataset.

Parameters:: dataset_identity (dict) – with dataset_rid and last_transaction_rid
Returns:: path to dataset in cache
Return type:: str

get_cache_dir()[source]#

Returns cache directory.

get_dataset_identity_not_branch_aware(dataset_path_or_rid)[source]#

If dataset is in cache, returns complete identity, otherwise KeyError.

Parameters:: dataset_path_or_rid (str) – path to dataset in Foundry Compass or dataset_rid
Returns:: dataset identity
Return type:: dict
Raises:: KeyError – if dataset with dataset_path_or_rid not in cache

dataset_has_schema(dataset_identity)[source]#

Checks if dataset stored in cache folder has _schema.json attached.

foundry_dev_tools.utils.caches.spark_caches.get_dataset_path(cache_dir, dataset_identity)[source]#

Returns the local directory of the dataset of the last transaction.

Parameters:

cache_dir (str) – the base directory of the cache
dataset_identity (dict) – the identity of the dataset, containing dataset_rid and last_transaction_rid

Returns:

local path of the last transaction of the dataset

Return type:

str

foundry_dev_tools.utils.caches.spark_caches module