foundry_dev_tools.resources.dataset module

foundry_dev_tools.resources.dataset module#

Dataset helper class.

class foundry_dev_tools.resources.dataset.Dataset[source]#

Bases: Resource

Not intended to be initialized directly. Use Resource.from_rid() or Resource.from_path() instead.

rid_start: ClassVar[str] = 'ri.foundry.main.dataset'#

__init__(*args, **kwargs)[source]#

Not intended to be initialized directly. Use Resource.from_rid() or Resource.from_path() instead.

Return type:: None

property branch: api_types.Branch#

The branch of the dataset.

If self._branch is not set, and the default “master” branch does not exist, it will be created.

classmethod from_rid(context, rid, /, *, branch='master', create_branch_if_not_exists=True, parent_ref=None, parent_branch_id=None, **kwargs)[source]#

Returns dataset at path.

Parameters:

context (FoundryContext) – the foundry context for the dataset
rid (api_types.Rid) – the rid of the dataset
branch (api_types.Ref | api_types.Branch) – the branch of the dataset
create_branch_if_not_exists (bool) – create branch if branch does not exist
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)
**kwargs – passed to foundry_dev_tools.resources.resource.Resource.from_rid()

Return type:

Dataset

classmethod from_path(context, path, /, *, branch='master', create_if_not_exist=False, create_branch_if_not_exists=True, parent_ref=None, parent_branch_id=None, **kwargs)[source]#

Returns dataset at path.

Parameters:

context (FoundryContext) – the foundry context for the dataset
path (api_types.FoundryPath) – the path where the dataset is located on foundry
branch (api_types.Ref) – the branch of the dataset
create_if_not_exist (bool) – if the dataset does not exist, create it and the branch
create_branch_if_not_exists (bool) – create branch if branch does not exist, branch always will be created if resource does not exist
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)
**kwargs – passed to foundry_dev_tools.resources.resource.Resource.from_path()

Return type:

Dataset

classmethod create(context, path, branch, parent_ref=None, parent_branch_id=None)[source]#

Create a foundry dataset.

See api_create_dataset().

Parameters:

context (FoundryContext)
path (api_types.FoundryPath)
branch (api_types.Ref)
parent_ref (api_types.Ref | None)
parent_branch_id (api_types.Ref | None)

Return type:

Dataset

create_branch(branch, parent_ref=None, parent_branch_id=None)[source]#

Creates a branch on a dataset and switches to it.

Parameters:

branch (api_types.DatasetBranch) – the branch to create
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch

Return type:

Self

switch_branch(branch, create_branch_if_not_exists=False, parent_ref=None, parent_branch_id=None)[source]#

Switch to another branch.

Parameters:

branch (api_types.DatasetBranch) – the name of the branch to switch to
create_branch_if_not_exists (bool) – create branch if branch does not exist, branch always will be created if resource does not exist
parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)
parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)

Return type:

Self

get_branch(branch)[source]#

Returns the branch resource.

Parameters:: branch (api_types.DatasetBranch)
Return type:: api_types.Branch

get_transactions(page_size, end_transaction_rid=None, include_open_exclusive_transaction=False, allow_deleted_dataset=None, **kwargs)[source]#

Get reverse transactions.

Parameters:

page_size (int) – response page entry size
end_transaction_rid (api_types.TransactionRid | None) – at what transaction to stop listing
include_open_exclusive_transaction (bool | None) – include open exclusive transaction
allow_deleted_dataset (bool | None) – respond even if dataset was deleted
**kwargs – gets passed to APIClient.api_request()

Return type:

list[api_types.Transaction]

get_last_transaction()[source]#

Returns the last transaction or None if there are no transactions.

Return type:: api_types.Transaction | None

get_open_transaction()[source]#

Gets the open transaction or None.

Return type:: api_types.Transaction | None

start_transaction(start_transaction_type=None)[source]#

Start a transaction on the dataset.

Parameters:: start_transaction_type (api_types.FoundryTransaction | None)
Return type:: Self

property transaction: api_types.Transaction#: Get the current transaction or raise Error if there is no open transaction.

commit_transaction()[source]#

Commit the transaction on the dataset.

Return type:: Self

abort_transaction()[source]#

Commit the transaction on the dataset.

Return type:: Self

upload_files(path_file_dict, max_workers=None, **kwargs)[source]#

Uploads multiple local files to a foundry dataset.

Parameters:

max_workers (int | None) – Set number of threads for upload
path_file_dict (dict[api_types.PathInDataset, Path]) – A dictionary which maps the path in the dataset -> local file path
**kwargs – get passed to foundry_dev_tools.resources.dataset.Dataset.transaction_context()

Return type:

Self

upload_file(file_path, path_in_foundry_dataset, **kwargs)[source]#

Upload a file to the dataset.

Parameters:

file_path (Path) – local file path to upload
path_in_foundry_dataset (api_types.PathInDataset) – file path inside the foundry dataset
transaction_type – if this dataset does not have an open transaction, opens a transaction with the specified type.
**kwargs – get passed to foundry_dev_tools.resources.dataset.Dataset.transaction_context()

Return type:

Self

upload_folder(folder_path, max_workers=None, **kwargs)[source]#

Uploads all files contained in the folder to the dataset.

The default transaction type is UPDATE.

Parameters:

folder_path (Path) – the folder to upload
max_workers (int | None) – Set number of threads for upload
**kwargs – get passed to foundry_dev_tools.resources.dataset.Dataset.transaction_context()

Return type:

Self

delete_files(logical_paths, **kwargs)[source]#

Adds files in an open DELETE transaction.

Files added to DELETE transactions affect the dataset view by removing files from the view.

Parameters:

logical_paths (list[api_types.PathInDataset]) – files in the dataset to delete
**kwargs – get passed to foundry_dev_tools.resources.dataset.Dataset.transaction_context() (transaction_type is forced to DELETE)

Return type:

Self

remove_file(logical_path, recursive=False)[source]#

Removes the given file from an open transaction.

If the logical path matches a file exactly then only that file will be removed, regardless of the value of recursive. If the logical path represents a directory, then all files prefixed with the logical path followed by ‘/’ will be removed when recursive is true and no files will be removed when recursive is false. If the given logical path does not match a file or directory then this call is ignored and does not throw an exception.

Parameters:

logical_path (api_types.FoundryPath) – logical path in the backing filesystem
recursive (bool) – recurse into subdirectories

Return type:

Self

put_file(logical_path, file_data, overwrite=None, **kwargs)[source]#

Opens, writes, and closes a file under the specified dataset and transaction.

Parameters:

dataset_rid – dataset rid
transaction_rid – transaction rid
logical_path (api_types.PathInDataset) – file path in dataset
file_data (str | bytes | IO[AnyStr]) – content of the file
overwrite (bool | None) – defaults to false, if true -> Overwrite the file if it already exists in the transaction.
**kwargs – get passed to foundry_dev_tools.resources.dataset.Dataset.transaction_context()

Return type:

Self

download_files(output_directory, paths_in_dataset=None, max_workers=None)[source]#

Downloads multiple files from a dataset and saves them in the output directory.

Parameters:

output_directory (Path) – the directory where the files will be saved
paths_in_dataset (set[api_types.PathInDataset] | None) – the files to download, if None (the default) it will download all files
max_workers (int | None) – how many connections to use in parallel to download the files

Return type:

list[Path]

get_file(path_in_dataset, start_transaction_rid=None, range_header=None)[source]#

Get bytes of a file in a Dataset.

Parameters:

path_in_dataset (api_types.PathInDataset) – the file to get
start_transaction_rid (api_types.TransactionRid | None) – start transaction rid
range_header (str | None) – HTTP range header

Return type:

bytes

download_file(output_directory, path_in_dataset)[source]#

Downloads the file to the output directory.

Parameters:

path_in_dataset (api_types.PathInDataset) – the file to download
output_directory (Path) – the directory where the file will be saved

Return type:

Path

download_files_temporary(paths_in_dataset=None, max_workers=None)[source]#

Downloads dataset files to temporary directory and cleans it up afterwards.

A wrapper around foundry_dev_tools.resources.dataset.Dataset.download_files() together with tempfile.TemporaryDirectory as a contextmanager.

Parameters:

paths_in_dataset (list[api_types.PathInDataset] | None) – the files to download, if None (the default) it will download all files
max_workers (int | None) – how many connections to use in parallel to download the files

Return type:

Iterator[Path]

save_dataframe(df, transaction_type='SNAPSHOT', foundry_schema=None)[source]#

Saves a dataframe to Foundry. If the dataset in Foundry does not exist it is created.

If the branch does not exist, it is created. If the dataset exists, an exception is thrown. If exists_ok=True is passed, the dataset is overwritten. Creates SNAPSHOT transactions by default.

Parameters:

df (pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | pyspark.sql.DataFrame) – A pyspark, pandas or polars DataFrame to upload
dataset_path_or_rid – path or rid of the dataset in which the object should be stored.
branch – Branch of the dataset in which the object should be stored
exists_ok – By default, this method creates a new dataset. Pass exists_ok=True to overwrite according to strategy from parameter ‘mode’
transaction_type (api_types.FoundryTransaction) – Foundry Transaction type, see foundry_dev_tools.utils.api_types.FoundryTransaction
foundry_schema (api_types.FoundrySchema | None) – use a custom foundry schema instead of the infered one

Return type:

Self

infer_schema()[source]#

Returns the infered dataset schema.

Return type:: dict

upload_schema(transaction_rid, schema)[source]#

Uploads the foundry dataset schema for a dataset, transaction, branch combination.

Parameters:

transaction_rid (api_types.TransactionRid) – The rid of the transaction
schema (api_types.FoundrySchema) – The foundry schema

Return type:

Self

list_files(end_ref=None, page_size=1000, logical_path=None, page_start_logical_path=None, include_open_exclusive_transaction=False, exclude_hidden_files=False, temporary_credentials_auth_token=None)[source]#

Wraps foundry_dev_tools.clients.CatalogClient.list_dataset_files().

Parameters:

end_ref (api_types.View | None) – branch or transaction rid of the dataset, defaults to the current branch
page_size (int) – the maximum page size returned
logical_path (api_types.PathInDataset | None) – If logical_path is absent, returns all files in the view. If logical_path matches a file exactly, returns just that file. Otherwise, returns all files in the “directory” of logical_path: (a slash is added to the end of logicalPath if necessary and a prefix-match is performed)
page_start_logical_path (api_types.PathInDataset | None) – if specified page starts at the given path, otherwise at the beginning of the file list
include_open_exclusive_transaction (bool) – if files added in open transaction should be returned as well in the response
exclude_hidden_files (bool) – if hidden files should be excluded (e.g. _log files)
temporary_credentials_auth_token (str | None) – to generate temporary credentials for presigned URLs

Returns:

[
    {
        "logicalPath": "..",
        "pageStartLogicalPath": "..",
        "includeOpenExclusiveTransaction": "..",
        "excludeHiddenFiles": "..",
    },
]

Return type:

list[FileResourcesPage]

query_foundry_sql(query: str, return_type: Literal['pandas'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) → pd.core.frame.DataFrame[source]#

query_foundry_sql(query: str, return_type: Literal['spark'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) → pyspark.sql.DataFrame

query_foundry_sql(query: str, return_type: Literal['arrow'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) → pa.Table

query_foundry_sql(query: str, return_type: Literal['raw'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) → tuple[dict, list[list]]

query_foundry_sql(query: str, return_type: api_types.SQLReturnType = 'pandas', sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) → tuple[dict, list[list]] | pd.core.frame.DataFrame | pa.Table | pyspark.sql.DataFrame

Wrapper around foundry_dev_tools.clients.foundry_sql_server.FoundrySqlServerClient.query_foundry_sql().

But it automatically prepends the dataset location, so instead of: >>> ctx.foundry_sql_server.query_foundry_sql(“SELECT * FROM /path/to/dataset WHERE a=1”) You can just use: >>> ds = ctx.get_dataset_by_path(“/path/to/dataset”) >>> ds.query_foundry_sql(“SELECT * WHERE a=1”)

to_spark()[source]#

Get dataset as pyspark.sql.DataFrame.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:: pyspark.sql.DataFrame

to_arrow()[source]#

Get dataset as a pyarrow.Table.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:: pa.Table

to_pandas()[source]#

Get dataset as a pandas.DataFrame.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:: pandas.core.frame.DataFrame

to_polars()[source]#

Get dataset as a polars.DataFrame.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:: polars.dataframe.frame.DataFrame

transaction_context(transaction_type=None, abort_on_error=True)[source]#

Handles transactions for dataset functions.

If there is no open transaction it will start one. If there is already an open transaction it will check if the transaction_type is correct. If this context manager started the transaction it will also commit or abort it (if abort_on_error is True).

Parameters:

abort_on_error (bool) – if an error happens while in transaction_context and this is set to true it will abort the transaction instead of committing it. Only takes effect if this context manager also started the transaction.
transaction_type (api_types.FoundryTransaction | None) – if there is no open transaction it will open a transaction with this type, if there is already an open transaction and it does not match this transaction_type it will raise an Error

sync()[source]#

Fetches the attributes again + the dataset branch information.

Return type:: Self

foundry_dev_tools.resources.dataset module

Contents

foundry_dev_tools.resources.dataset module#