foundry_dev_tools.resources.dataset module#
Dataset helper class.
- class foundry_dev_tools.resources.dataset.Dataset[source]#
Bases:
Resource
Not intended to be initialized directly. Use
Resource.from_rid()
orResource.from_path()
instead.- __init__(*args, **kwargs)[source]#
Not intended to be initialized directly. Use
Resource.from_rid()
orResource.from_path()
instead.- Return type:
None
- property branch: api_types.Branch#
The branch of the dataset.
If self._branch is not set, and the default “master” branch does not exist, it will be created.
- classmethod from_rid(context, rid, /, *, branch='master', create_branch_if_not_exists=True, parent_ref=None, parent_branch_id=None, **kwargs)[source]#
Returns dataset at path.
- Parameters:
context (FoundryContext) – the foundry context for the dataset
rid (api_types.Rid) – the rid of the dataset
branch (api_types.Ref | api_types.Branch) – the branch of the dataset
create_branch_if_not_exists (bool) – create branch if branch does not exist
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)
**kwargs – passed to
foundry_dev_tools.resources.resource.Resource.from_rid()
- Return type:
- classmethod from_path(context, path, /, *, branch='master', create_if_not_exist=False, create_branch_if_not_exists=True, parent_ref=None, parent_branch_id=None, **kwargs)[source]#
Returns dataset at path.
- Parameters:
context (FoundryContext) – the foundry context for the dataset
path (api_types.FoundryPath) – the path where the dataset is located on foundry
branch (api_types.Ref) – the branch of the dataset
create_if_not_exist (bool) – if the dataset does not exist, create it and the branch
create_branch_if_not_exists (bool) – create branch if branch does not exist, branch always will be created if resource does not exist
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)
**kwargs – passed to
foundry_dev_tools.resources.resource.Resource.from_path()
- Return type:
- classmethod create(context, path, branch, parent_ref=None, parent_branch_id=None)[source]#
Create a foundry dataset.
See
api_create_dataset()
.- Parameters:
context (FoundryContext)
path (api_types.FoundryPath)
branch (api_types.Ref)
parent_ref (api_types.Ref | None)
parent_branch_id (api_types.Ref | None)
- Return type:
- create_branch(branch, parent_ref=None, parent_branch_id=None)[source]#
Creates a branch on a dataset and switches to it.
- Parameters:
branch (api_types.DatasetBranch) – the branch to create
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch
- Return type:
Self
- switch_branch(branch, create_branch_if_not_exists=False, parent_ref=None, parent_branch_id=None)[source]#
Switch to another branch.
- Parameters:
branch (api_types.DatasetBranch) – the name of the branch to switch to
create_branch_if_not_exists (bool) – create branch if branch does not exist, branch always will be created if resource does not exist
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)
- Return type:
Self
- get_branch(branch)[source]#
Returns the branch resource.
- Parameters:
branch (api_types.DatasetBranch)
- Return type:
- get_transactions(page_size, end_transaction_rid=None, include_open_exclusive_transaction=False, allow_deleted_dataset=None, **kwargs)[source]#
Get reverse transactions.
- Parameters:
page_size (int) – response page entry size
end_transaction_rid (api_types.TransactionRid | None) – at what transaction to stop listing
include_open_exclusive_transaction (bool | None) – include open exclusive transaction
allow_deleted_dataset (bool | None) – respond even if dataset was deleted
**kwargs – gets passed to
APIClient.api_request()
- Return type:
- get_last_transaction()[source]#
Returns the last transaction or None if there are no transactions.
- Return type:
api_types.Transaction | None
- get_open_transaction()[source]#
Gets the open transaction or None.
- Return type:
api_types.Transaction | None
- start_transaction(start_transaction_type=None)[source]#
Start a transaction on the dataset.
- Parameters:
start_transaction_type (api_types.FoundryTransaction | None)
- Return type:
Self
- property transaction: api_types.Transaction#
Get the current transaction or raise Error if there is no open transaction.
- upload_files(path_file_dict, max_workers=None, **kwargs)[source]#
Uploads multiple local files to a foundry dataset.
- Parameters:
max_workers (int | None) – Set number of threads for upload
path_file_dict (dict[api_types.PathInDataset, Path]) – A dictionary which maps the path in the dataset -> local file path
**kwargs – get passed to
foundry_dev_tools.resources.dataset.Dataset.transaction_context()
- Return type:
Self
- upload_file(file_path, path_in_foundry_dataset, **kwargs)[source]#
Upload a file to the dataset.
- Parameters:
file_path (Path) – local file path to upload
path_in_foundry_dataset (api_types.PathInDataset) – file path inside the foundry dataset
transaction_type – if this dataset does not have an open transaction, opens a transaction with the specified type.
**kwargs – get passed to
foundry_dev_tools.resources.dataset.Dataset.transaction_context()
- Return type:
Self
- upload_folder(folder_path, max_workers=None, **kwargs)[source]#
Uploads all files contained in the folder to the dataset.
The default transaction type is UPDATE.
- Parameters:
folder_path (Path) – the folder to upload
max_workers (int | None) – Set number of threads for upload
**kwargs – get passed to
foundry_dev_tools.resources.dataset.Dataset.transaction_context()
- Return type:
Self
- delete_files(logical_paths, **kwargs)[source]#
Adds files in an open DELETE transaction.
Files added to DELETE transactions affect the dataset view by removing files from the view.
- Parameters:
logical_paths (list[api_types.PathInDataset]) – files in the dataset to delete
**kwargs – get passed to
foundry_dev_tools.resources.dataset.Dataset.transaction_context()
(transaction_type is forced to DELETE)
- Return type:
Self
- remove_file(logical_path, recursive=False)[source]#
Removes the given file from an open transaction.
If the logical path matches a file exactly then only that file will be removed, regardless of the value of recursive. If the logical path represents a directory, then all files prefixed with the logical path followed by ‘/’ will be removed when recursive is true and no files will be removed when recursive is false. If the given logical path does not match a file or directory then this call is ignored and does not throw an exception.
- Parameters:
logical_path (api_types.FoundryPath) – logical path in the backing filesystem
recursive (bool) – recurse into subdirectories
- Return type:
Self
- put_file(logical_path, file_data, overwrite=None, **kwargs)[source]#
Opens, writes, and closes a file under the specified dataset and transaction.
- Parameters:
dataset_rid – dataset rid
transaction_rid – transaction rid
logical_path (api_types.PathInDataset) – file path in dataset
overwrite (bool | None) – defaults to false, if true -> Overwrite the file if it already exists in the transaction.
**kwargs – get passed to
foundry_dev_tools.resources.dataset.Dataset.transaction_context()
- Return type:
Self
- download_files(output_directory, paths_in_dataset=None, max_workers=None)[source]#
Downloads multiple files from a dataset and saves them in the output directory.
- Parameters:
- Return type:
list[Path]
- get_file(path_in_dataset, start_transaction_rid=None, range_header=None)[source]#
Get bytes of a file in a Dataset.
- download_file(output_directory, path_in_dataset)[source]#
Downloads the file to the output directory.
- Parameters:
path_in_dataset (api_types.PathInDataset) – the file to download
output_directory (Path) – the directory where the file will be saved
- Return type:
Path
- download_files_temporary(paths_in_dataset=None, max_workers=None)[source]#
Downloads dataset files to temporary directory and cleans it up afterwards.
A wrapper around
foundry_dev_tools.resources.dataset.Dataset.download_files()
together withtempfile.TemporaryDirectory
as a contextmanager.
- save_dataframe(df, transaction_type='SNAPSHOT', foundry_schema=None)[source]#
Saves a dataframe to Foundry. If the dataset in Foundry does not exist it is created.
If the branch does not exist, it is created. If the dataset exists, an exception is thrown. If exists_ok=True is passed, the dataset is overwritten. Creates SNAPSHOT transactions by default.
- Parameters:
df (pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | pyspark.sql.DataFrame) – A pyspark, pandas or polars DataFrame to upload
dataset_path_or_rid – path or rid of the dataset in which the object should be stored.
branch – Branch of the dataset in which the object should be stored
exists_ok – By default, this method creates a new dataset. Pass exists_ok=True to overwrite according to strategy from parameter ‘mode’
transaction_type (api_types.FoundryTransaction) – Foundry Transaction type, see
foundry_dev_tools.utils.api_types.FoundryTransaction
foundry_schema (api_types.FoundrySchema | None) – use a custom foundry schema instead of the infered one
- Return type:
Self
- upload_schema(transaction_rid, schema)[source]#
Uploads the foundry dataset schema for a dataset, transaction, branch combination.
- Parameters:
transaction_rid (api_types.TransactionRid) – The rid of the transaction
schema (api_types.FoundrySchema) – The foundry schema
- Return type:
Self
- list_files(end_ref=None, page_size=1000, logical_path=None, page_start_logical_path=None, include_open_exclusive_transaction=False, exclude_hidden_files=False, temporary_credentials_auth_token=None)[source]#
Wraps
foundry_dev_tools.clients.CatalogClient.list_dataset_files()
.- Parameters:
end_ref (api_types.View | None) – branch or transaction rid of the dataset, defaults to the current branch
page_size (int) – the maximum page size returned
logical_path (api_types.PathInDataset | None) – If logical_path is absent, returns all files in the view. If logical_path matches a file exactly, returns just that file. Otherwise, returns all files in the “directory” of logical_path: (a slash is added to the end of logicalPath if necessary and a prefix-match is performed)
page_start_logical_path (api_types.PathInDataset | None) – if specified page starts at the given path, otherwise at the beginning of the file list
include_open_exclusive_transaction (bool) – if files added in open transaction should be returned as well in the response
exclude_hidden_files (bool) – if hidden files should be excluded (e.g. _log files)
temporary_credentials_auth_token (str | None) – to generate temporary credentials for presigned URLs
- Returns:
[ { "logicalPath": "..", "pageStartLogicalPath": "..", "includeOpenExclusiveTransaction": "..", "excludeHiddenFiles": "..", }, ]
- Return type:
list[FileResourcesPage]
- query_foundry_sql(query: str, return_type: Literal['pandas'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) pd.core.frame.DataFrame [source]#
- query_foundry_sql(query: str, return_type: Literal['spark'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) pyspark.sql.DataFrame
- query_foundry_sql(query: str, return_type: Literal['arrow'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) pa.Table
- query_foundry_sql(query: str, return_type: Literal['raw'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) tuple[dict, list[list]]
- query_foundry_sql(query: str, return_type: api_types.SQLReturnType = 'pandas', sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) tuple[dict, list[list]] | pd.core.frame.DataFrame | pa.Table | pyspark.sql.DataFrame
Wrapper around
foundry_dev_tools.clients.foundry_sql_server.FoundrySqlServerClient.query_foundry_sql()
.But it automatically prepends the dataset location, so instead of: >>> ctx.foundry_sql_server.query_foundry_sql(“SELECT * FROM /path/to/dataset WHERE a=1”) You can just use: >>> ds = ctx.get_dataset_by_path(“/path/to/dataset”) >>> ds.query_foundry_sql(“SELECT * WHERE a=1”)
- to_spark()[source]#
Get dataset as
pyspark.sql.DataFrame
.Via
foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()
- Return type:
- to_arrow()[source]#
Get dataset as a
pyarrow.Table
.Via
foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()
- Return type:
pa.Table
- to_pandas()[source]#
Get dataset as a
pandas.DataFrame
.Via
foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()
- Return type:
- to_polars()[source]#
Get dataset as a
polars.DataFrame
.Via
foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()
- Return type:
polars.dataframe.frame.DataFrame
- transaction_context(transaction_type=None, abort_on_error=True)[source]#
Handles transactions for dataset functions.
If there is no open transaction it will start one. If there is already an open transaction it will check if the transaction_type is correct. If this context manager started the transaction it will also commit or abort it (if abort_on_error is True).
- Parameters:
abort_on_error (bool) – if an error happens while in transaction_context and this is set to true it will abort the transaction instead of committing it. Only takes effect if this context manager also started the transaction.
transaction_type (api_types.FoundryTransaction | None) – if there is no open transaction it will open a transaction with this type, if there is already an open transaction and it does not match this transaction_type it will raise an Error