foundry_dev_tools.resources.dataset module#

Dataset helper class.

class foundry_dev_tools.resources.dataset.Dataset[source]#

Bases: Resource

Not intended to be initialized directly. Use Resource.from_rid() or Resource.from_path() instead.

rid_start: ClassVar[str] = 'ri.foundry.main.dataset'#
__init__(*args, **kwargs)[source]#

Not intended to be initialized directly. Use Resource.from_rid() or Resource.from_path() instead.

Return type:

None

property branch: api_types.Branch#

The branch of the dataset.

If self._branch is not set, and the default “master” branch does not exist, it will be created.

classmethod from_rid(context, rid, /, *, branch='master', create_branch_if_not_exists=True, parent_ref=None, parent_branch_id=None, **kwargs)[source]#

Returns dataset at path.

Parameters:
  • context (FoundryContext) – the foundry context for the dataset

  • rid (api_types.Rid) – the rid of the dataset

  • branch (api_types.Ref | api_types.Branch) – the branch of the dataset

  • create_branch_if_not_exists (bool) – create branch if branch does not exist

  • parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)

  • parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)

  • **kwargs – passed to foundry_dev_tools.resources.resource.Resource.from_rid()

Return type:

Dataset

classmethod from_path(context, path, /, *, branch='master', create_if_not_exist=False, create_branch_if_not_exists=True, parent_ref=None, parent_branch_id=None, **kwargs)[source]#

Returns dataset at path.

Parameters:
  • context (FoundryContext) – the foundry context for the dataset

  • path (api_types.FoundryPath) – the path where the dataset is located on foundry

  • branch (api_types.Ref) – the branch of the dataset

  • create_if_not_exist (bool) – if the dataset does not exist, create it and the branch

  • create_branch_if_not_exists (bool) – create branch if branch does not exist, branch always will be created if resource does not exist

  • parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)

  • parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)

  • **kwargs – passed to foundry_dev_tools.resources.resource.Resource.from_path()

Return type:

Dataset

classmethod create(context, path, branch, parent_ref=None, parent_branch_id=None)[source]#

Create a foundry dataset.

See api_create_dataset().

Parameters:
  • context (FoundryContext)

  • path (api_types.FoundryPath)

  • branch (api_types.Ref)

  • parent_ref (api_types.Ref | None)

  • parent_branch_id (api_types.Ref | None)

Return type:

Dataset

create_branch(branch, parent_ref=None, parent_branch_id=None)[source]#

Creates a branch on a dataset and switches to it.

Parameters:
  • branch (api_types.DatasetBranch) – the branch to create

  • parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based

  • parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch

Return type:

Self

switch_branch(branch, create_branch_if_not_exists=False, parent_ref=None, parent_branch_id=None)[source]#

Switch to another branch.

Parameters:
  • branch (api_types.DatasetBranch) – the name of the branch to switch to

  • create_branch_if_not_exists (bool) – create branch if branch does not exist, branch always will be created if resource does not exist

  • parent_ref (api_types.TransactionRid | None) – optionally the transaction off which the branch will be based (only used if branch needs to be created)

  • parent_branch_id (api_types.DatasetBranch | None) – optionally a parent branch name, otherwise a root branch (only used if branch needs to be created)

Return type:

Self

get_branch(branch)[source]#

Returns the branch resource.

Parameters:

branch (api_types.DatasetBranch)

Return type:

api_types.Branch

get_transactions(page_size, end_transaction_rid=None, include_open_exclusive_transaction=False, allow_deleted_dataset=None, **kwargs)[source]#

Get reverse transactions.

Parameters:
  • page_size (int) – response page entry size

  • end_transaction_rid (api_types.TransactionRid | None) – at what transaction to stop listing

  • include_open_exclusive_transaction (bool | None) – include open exclusive transaction

  • allow_deleted_dataset (bool | None) – respond even if dataset was deleted

  • **kwargs – gets passed to APIClient.api_request()

Return type:

list[api_types.Transaction]

get_last_transaction()[source]#

Returns the last transaction or None if there are no transactions.

Return type:

api_types.Transaction | None

get_open_transaction()[source]#

Gets the open transaction or None.

Return type:

api_types.Transaction | None

start_transaction(start_transaction_type=None)[source]#

Start a transaction on the dataset.

Parameters:

start_transaction_type (api_types.FoundryTransaction | None)

Return type:

Self

property transaction: api_types.Transaction#

Get the current transaction or raise Error if there is no open transaction.

commit_transaction()[source]#

Commit the transaction on the dataset.

Return type:

Self

abort_transaction()[source]#

Commit the transaction on the dataset.

Return type:

Self

upload_files(path_file_dict, max_workers=None, **kwargs)[source]#

Uploads multiple local files to a foundry dataset.

Parameters:
Return type:

Self

upload_file(file_path, path_in_foundry_dataset, **kwargs)[source]#

Upload a file to the dataset.

Parameters:
  • file_path (Path) – local file path to upload

  • path_in_foundry_dataset (api_types.PathInDataset) – file path inside the foundry dataset

  • transaction_type – if this dataset does not have an open transaction, opens a transaction with the specified type.

  • **kwargs – get passed to foundry_dev_tools.resources.dataset.Dataset.transaction_context()

Return type:

Self

upload_folder(folder_path, max_workers=None, **kwargs)[source]#

Uploads all files contained in the folder to the dataset.

The default transaction type is UPDATE.

Parameters:
Return type:

Self

delete_files(logical_paths, **kwargs)[source]#

Adds files in an open DELETE transaction.

Files added to DELETE transactions affect the dataset view by removing files from the view.

Parameters:
Return type:

Self

remove_file(logical_path, recursive=False)[source]#

Removes the given file from an open transaction.

If the logical path matches a file exactly then only that file will be removed, regardless of the value of recursive. If the logical path represents a directory, then all files prefixed with the logical path followed by ‘/’ will be removed when recursive is true and no files will be removed when recursive is false. If the given logical path does not match a file or directory then this call is ignored and does not throw an exception.

Parameters:
  • logical_path (api_types.FoundryPath) – logical path in the backing filesystem

  • recursive (bool) – recurse into subdirectories

Return type:

Self

put_file(logical_path, file_data, overwrite=None, **kwargs)[source]#

Opens, writes, and closes a file under the specified dataset and transaction.

Parameters:
  • dataset_rid – dataset rid

  • transaction_rid – transaction rid

  • logical_path (api_types.PathInDataset) – file path in dataset

  • file_data (str | bytes | IO[AnyStr]) – content of the file

  • overwrite (bool | None) – defaults to false, if true -> Overwrite the file if it already exists in the transaction.

  • **kwargs – get passed to foundry_dev_tools.resources.dataset.Dataset.transaction_context()

Return type:

Self

download_files(output_directory, paths_in_dataset=None, max_workers=None)[source]#

Downloads multiple files from a dataset and saves them in the output directory.

Parameters:
  • output_directory (Path) – the directory where the files will be saved

  • paths_in_dataset (set[api_types.PathInDataset] | None) – the files to download, if None (the default) it will download all files

  • max_workers (int | None) – how many connections to use in parallel to download the files

Return type:

list[Path]

get_file(path_in_dataset, start_transaction_rid=None, range_header=None)[source]#

Get bytes of a file in a Dataset.

Parameters:
  • path_in_dataset (api_types.PathInDataset) – the file to get

  • start_transaction_rid (api_types.TransactionRid | None) – start transaction rid

  • range_header (str | None) – HTTP range header

Return type:

bytes

download_file(output_directory, path_in_dataset)[source]#

Downloads the file to the output directory.

Parameters:
  • path_in_dataset (api_types.PathInDataset) – the file to download

  • output_directory (Path) – the directory where the file will be saved

Return type:

Path

download_files_temporary(paths_in_dataset=None, max_workers=None)[source]#

Downloads dataset files to temporary directory and cleans it up afterwards.

A wrapper around foundry_dev_tools.resources.dataset.Dataset.download_files() together with tempfile.TemporaryDirectory as a contextmanager.

Parameters:
  • paths_in_dataset (list[api_types.PathInDataset] | None) – the files to download, if None (the default) it will download all files

  • max_workers (int | None) – how many connections to use in parallel to download the files

Return type:

Iterator[Path]

save_dataframe(df, transaction_type='SNAPSHOT', foundry_schema=None)[source]#

Saves a dataframe to Foundry. If the dataset in Foundry does not exist it is created.

If the branch does not exist, it is created. If the dataset exists, an exception is thrown. If exists_ok=True is passed, the dataset is overwritten. Creates SNAPSHOT transactions by default.

Parameters:
  • df (pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | pyspark.sql.DataFrame) – A pyspark, pandas or polars DataFrame to upload

  • dataset_path_or_rid – path or rid of the dataset in which the object should be stored.

  • branch – Branch of the dataset in which the object should be stored

  • exists_ok – By default, this method creates a new dataset. Pass exists_ok=True to overwrite according to strategy from parameter ‘mode’

  • transaction_type (api_types.FoundryTransaction) – Foundry Transaction type, see foundry_dev_tools.utils.api_types.FoundryTransaction

  • foundry_schema (api_types.FoundrySchema | None) – use a custom foundry schema instead of the infered one

Return type:

Self

infer_schema()[source]#

Returns the infered dataset schema.

Return type:

dict

upload_schema(transaction_rid, schema)[source]#

Uploads the foundry dataset schema for a dataset, transaction, branch combination.

Parameters:
  • transaction_rid (api_types.TransactionRid) – The rid of the transaction

  • schema (api_types.FoundrySchema) – The foundry schema

Return type:

Self

list_files(end_ref=None, page_size=1000, logical_path=None, page_start_logical_path=None, include_open_exclusive_transaction=False, exclude_hidden_files=False, temporary_credentials_auth_token=None)[source]#

Wraps foundry_dev_tools.clients.CatalogClient.list_dataset_files().

Parameters:
  • end_ref (api_types.View | None) – branch or transaction rid of the dataset, defaults to the current branch

  • page_size (int) – the maximum page size returned

  • logical_path (api_types.PathInDataset | None) – If logical_path is absent, returns all files in the view. If logical_path matches a file exactly, returns just that file. Otherwise, returns all files in the “directory” of logical_path: (a slash is added to the end of logicalPath if necessary and a prefix-match is performed)

  • page_start_logical_path (api_types.PathInDataset | None) – if specified page starts at the given path, otherwise at the beginning of the file list

  • include_open_exclusive_transaction (bool) – if files added in open transaction should be returned as well in the response

  • exclude_hidden_files (bool) – if hidden files should be excluded (e.g. _log files)

  • temporary_credentials_auth_token (str | None) – to generate temporary credentials for presigned URLs

Returns:

[
    {
        "logicalPath": "..",
        "pageStartLogicalPath": "..",
        "includeOpenExclusiveTransaction": "..",
        "excludeHiddenFiles": "..",
    },
]

Return type:

list[FileResourcesPage]

query_foundry_sql(query: str, return_type: Literal['pandas'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) pd.core.frame.DataFrame[source]#
query_foundry_sql(query: str, return_type: Literal['spark'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) pyspark.sql.DataFrame
query_foundry_sql(query: str, return_type: Literal['arrow'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) pa.Table
query_foundry_sql(query: str, return_type: Literal['raw'], sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) tuple[dict, list[list]]
query_foundry_sql(query: str, return_type: api_types.SQLReturnType = 'pandas', sql_dialect: api_types.SqlDialect = 'SPARK', timeout: int = 600) tuple[dict, list[list]] | pd.core.frame.DataFrame | pa.Table | pyspark.sql.DataFrame

Wrapper around foundry_dev_tools.clients.foundry_sql_server.FoundrySqlServerClient.query_foundry_sql().

But it automatically prepends the dataset location, so instead of: >>> ctx.foundry_sql_server.query_foundry_sql(“SELECT * FROM /path/to/dataset WHERE a=1”) You can just use: >>> ds = ctx.get_dataset_by_path(“/path/to/dataset”) >>> ds.query_foundry_sql(“SELECT * WHERE a=1”)

to_spark()[source]#

Get dataset as pyspark.sql.DataFrame.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:

pyspark.sql.DataFrame

to_arrow()[source]#

Get dataset as a pyarrow.Table.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:

pa.Table

to_pandas()[source]#

Get dataset as a pandas.DataFrame.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:

pandas.core.frame.DataFrame

to_polars()[source]#

Get dataset as a polars.DataFrame.

Via foundry_dev_tools.resources.dataset.Dataset.query_foundry_sql()

Return type:

polars.dataframe.frame.DataFrame

transaction_context(transaction_type=None, abort_on_error=True)[source]#

Handles transactions for dataset functions.

If there is no open transaction it will start one. If there is already an open transaction it will check if the transaction_type is correct. If this context manager started the transaction it will also commit or abort it (if abort_on_error is True).

Parameters:
  • abort_on_error (bool) – if an error happens while in transaction_context and this is set to true it will abort the transaction instead of committing it. Only takes effect if this context manager also started the transaction.

  • transaction_type (api_types.FoundryTransaction | None) – if there is no open transaction it will open a transaction with this type, if there is already an open transaction and it does not match this transaction_type it will raise an Error

sync()[source]#

Fetches the attributes again + the dataset branch information.

Return type:

Self