foundry_dev_tools.utils.caches.spark_caches module#
DataFrame cache implementations.
File based spark cache
- class foundry_dev_tools.utils.caches.spark_caches.DiskPersistenceBackedSparkCache[source]#
Bases:
MutableMapping
[DatasetIdentity
, pyspark.sql.DataFrame]A cache that stores spark dataframes inside a directory.
- __init__(ctx)[source]#
- Parameters:
ctx (FoundryContext)
- set_item_metadata(path, dataset_identity, schema)[source]#
Writes schema and metadata.json entry.
Use when files are added to cache without calling cache[entry] = df
- Parameters:
path (Path) – direct path to transaction folder, e.g. /…/dss-rid/transaction1.parquet
dataset_identity (api_types.DatasetIdentity) – dataset_identity with keys dataset_rid, last_transaction_rid, dataset_path
schema (api_types.FoundrySchema | None) – spark schema or foundrySchema or None
- Return type:
None
- get_cache_dir()[source]#
Returns cache directory.
- Returns:
path to temporary cache directory
- Return type:
- get_dataset_identity_not_branch_aware(dataset_path_or_rid)[source]#
If dataset is in cache, returns complete identity, otherwise KeyError.
- dataset_has_schema(dataset_identity)[source]#
Checks if dataset stored in cache folder has _schema.json attached.
- Parameters:
dataset_identity (DatasetIdentity) – dataset identity
- Returns:
if dataset has schema return true
- Return type: