foundry_dev_tools.clients.data_proxy module#

The DataProxy API client.

class foundry_dev_tools.clients.data_proxy.DataProxyClient[source]#

Bases: APIClient

DataProxyClient class that implements methods from the ‘foundry-data-proxy’ API.

api_name: ClassVar[str] = 'foundry-data-proxy'#
upload_dataset_file(dataset_rid, transaction_rid, path, path_in_foundry_dataset)[source]#

Uploads a file into a foundry dataset.

Parameters:
  • dataset_rid (DatasetRid) – Unique identifier of the dataset

  • transaction_rid (TransactionRid) – transaction id

  • path (Path) – File to upload

  • path_in_foundry_dataset (PathInDataset) – The destination path in the dataset

Return type:

requests.Response

upload_dataset_files(dataset_rid, transaction_rid, path_file_dict, max_workers=None)[source]#

Uploads multiple local files to a foundry dataset.

Parameters:
  • dataset_rid (DatasetRid) – dataset rid

  • transaction_rid (TransactionRid) – transaction id

  • max_workers (int | None) – Set number of threads for upload

  • path_file_dict (dict[PathInDataset, Path]) – A dictionary with the following structure:

{
'<path_in_foundry_dataset>': '<local_file_path>',
...
}
query_foundry_sql_legacy(query: str, return_type: Literal['pandas'], branch: Ref = 'master', sql_dialect: SqlDialect = 'SPARK', timeout: int = 600) pd.core.frame.DataFrame[source]#
query_foundry_sql_legacy(query: str, return_type: Literal['spark'], branch: Ref = 'master', sql_dialect: SqlDialect = 'SPARK', timeout: int = 600) pyspark.sql.DataFrame
query_foundry_sql_legacy(query: str, return_type: Literal['arrow'], branch: Ref = 'master', sql_dialect: SqlDialect = 'SPARK', timeout: int = 600) pa.Table
query_foundry_sql_legacy(query: str, return_type: Literal['raw'], branch: Ref = 'master', sql_dialect: SqlDialect = 'SPARK', timeout: int = 600) tuple[dict, list[list]]
query_foundry_sql_legacy(query: str, return_type: SQLReturnType = 'raw', branch: Ref = 'master', sql_dialect: SqlDialect = 'SPARK', timeout: int = 600) tuple[dict, list[list]] | pd.core.frame.DataFrame | pa.Table | pyspark.sql.DataFrame

Queries the dataproxy query API with spark SQL.

Example

query_foundry_sql_legacy(query=”SELECT * FROM /Global/Foundry Operations/Foundry Support/iris”,

branch=”master”)

Parameters:
Returns:

(foundry_schema, data)

data: contains the data matrix, foundry_schema: the foundry schema (fieldSchemaList key). Can be converted to a pandas Dataframe, see below

foundry_schema, data = self.query_foundry_sql_legacy(query, branch)
df = pd.DataFrame(
    data=data, columns=[e["name"] for e in foundry_schema["fieldSchemaList"]]
)

Return type:

tuple (dict, list)

Raises:
download_dataset_file(dataset_rid, output_directory, foundry_file_path, view='master')[source]#

Downloads a single foundry dataset file into a directory.

The local folder (and its parents) will be created if it does not exist.

If you want the bytes instead of downloading it to a file use DataProxyClient.api_get_file_in_view() directly.

Parameters:
  • dataset_rid (DatasetRid) – the dataset rid

  • output_directory (Path) – the local output directory for the file

  • foundry_file_path (PathInDataset) – the file_path on the foundry file system

  • view (View) – branch or transaction rid of the dataset

Returns:

local file path

Return type:

Path

download_dataset_files(dataset_rid, output_directory, files=None, view='master', max_workers=None)[source]#

Downloads files of a dataset (in parallel) to a local output directory.

Parameters:
  • dataset_rid (DatasetRid) – the dataset rid

  • files (set[PathInDataset] | None) – list of files or None, in which case all files are downloaded

  • output_directory (Path) – the output directory for the files

  • view (View) – branch or transaction rid of the dataset

  • max_workers (int | None) – Set number of threads for upload

Returns:

path to downloaded files

Return type:

list[str]

api_put_file(dataset_rid, transaction_rid, logical_path, file_data, overwrite=None, **kwargs)[source]#

Opens, writes, and closes a file under the specified dataset and transaction.

Parameters:
  • dataset_rid (DatasetRid) – dataset rid

  • transaction_rid (TransactionRid) – transaction rid

  • logical_path (PathInDataset) – file path in dataset

  • file_data (str | bytes | IO[AnyStr]) – content of the file

  • overwrite (bool | None) – defaults to false, if true -> Overwrite the file if it already exists in the transaction.

  • **kwargs – gets passed to APIClient.api_request()

Return type:

requests.Response

api_get_file(dataset_rid, transaction_rid, logical_path, range_header=None, requests_stream=True, **kwargs)[source]#

Returns a file from the specified dataset and transaction.

Parameters:
  • dataset_rid (DatasetRid) – dataset rid

  • transaction_rid (TransactionRid) – transaction rid

  • logical_path (PathInDataset) – path in dataset

  • range_header (str | None) – HTTP range header

  • requests_stream (bool) – passed to requests.Session.request() as stream

  • **kwargs – gets passed to APIClient.api_request()

Return type:

requests.Response

api_get_file_in_view(dataset_rid, end_ref, logical_path, start_transaction_rid=None, range_header=None, **kwargs)[source]#

Returns a file from the specified dataset and end ref.

Parameters:
  • dataset_rid (DatasetRid) – dataset rid

  • end_ref (View) – end ref/view

  • logical_path (PathInDataset) – PathInDataset

  • start_transaction_rid (TransactionRid | None) – start transaction rid

  • range_header (str | None) – HTTP range header

  • **kwargs – gets passed to APIClient.api_request()

Return type:

requests.Response

api_get_files(dataset_rid, transaction_rid, logical_paths, requests_stream=True, **kwargs)[source]#

Returns specified files as a zip archive.

If logical_paths is an empty set, it will return all files of the transaction.

Parameters:
  • dataset_rid (DatasetRid) – dataset rid

  • transaction_rid (TransactionRid) – transaction rid

  • logical_paths (set[PathInDataset]) – a set with paths in the dataset

  • requests_stream (bool) – passed to requests.Session.request() as stream

  • **kwargs – gets passed to APIClient.api_request()

Return type:

requests.Response

api_get_files_in_view(dataset_rid, end_ref, logical_paths, start_transaction_rid=None, stream=True, **kwargs)[source]#

Returns specified files by logical_paths and end_ref in a zip archive.

Parameters:
  • dataset_rid (DatasetRid) – dataset rid

  • end_ref (View) – end ref/view

  • logical_paths (set[PathInDataset]) – set of paths in the dataset

  • start_transaction_rid (TransactionRid | None) – transaction rid

  • stream (bool) – passed to requests.Session.request()

  • **kwargs – gets passed to APIClient.api_request()

Return type:

requests.Response

api_get_dataset_as_csv2(dataset_rid, branch_id, start_transaction_rid=None, end_transaction_rid=None, include_column_names=True, include_bom=True, **kwargs)[source]#

Gets dataset data with each record as a CSV line.

Parameters:
  • dataset_rid (DatasetRid) – the dataset rid

  • branch_id (Ref) – branch of the dataset

  • start_transaction_rid (TransactionRid | None) – start transaction rid

  • end_transaction_rid (TransactionRid | None) – end transaction rid

  • include_column_names (bool) – include column names

  • include_bom (bool) – include bom

  • **kwargs – gets passed to APIClient.api_request()

Returns:

with the csv stream. Can be converted to a pandas DataFrame >>> pd.read_csv(io.BytesIO(response.content))

Return type:

Response

api_query_with_fallbacks2(query, fallback_branch_ids, dialect='SPARK', **kwargs)[source]#

Queries for data from 1 or more tables and returns the results as JSON.

Parameters:
Return type:

requests.Response