Loading a Delta Table
To load the current version, use the constructor:
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0")
Depending on your storage backend, you could use the storage_options
parameter to provide some configuration. Configuration is defined for
specific backends - s3
options,
azure
options,
gcs
options.
>>> storage_options = {"AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY":"THE_AWS_SECRET_ACCESS_KEY"}
>>> dt = DeltaTable("../rust/tests/data/delta-0.2.0", storage_options=storage_options)
The configuration can also be provided via the environment, and the basic service provider is derived from the URL being used. We try to support many of the well-known formats to identify basic service properties.
S3:
- s3://<bucket>/<path>
- s3a://<bucket>/<path>
Azure:
- az://<container>/<path>
- adl://<container>/<path>
- abfs://<container>/<path>
GCS:
- gs://<bucket>/<path>
Alternatively, if you have a data catalog you can load it by reference to a database and table name. Currently only AWS Glue is supported.
For AWS Glue catalog, use AWS environment variables to authenticate.
>>> from deltalake import DeltaTable
>>> from deltalake import DataCatalog
>>> database_name = "simple_database"
>>> table_name = "simple_table"
>>> data_catalog = DataCatalog.AWS
>>> dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, database_name=database_name, table_name=table_name)
>>> dt.to_pyarrow_table().to_pydict()
{'id': [5, 7, 9, 5, 6, 7, 8, 9]}
Custom Storage Backends
While delta always needs its internal storage backend to work and be properly configured, in order to manage the delta log, it may sometime be advantageous - and is common practice in the arrow world - to customize the storage interface used for reading the bulk data.
deltalake
will work with any storage compliant with pyarrow.fs.FileSystem
, however the root of the filesystem has to be adjusted to point at the root of the Delta table. We can achieve this by wrapping the custom filesystem into a pyarrow.fs.SubTreeFileSystem
.
import pyarrow.fs as fs
from deltalake import DeltaTable
path = "<path/to/table>"
filesystem = fs.SubTreeFileSystem(path, fs.LocalFileSystem())
dt = DeltaTable(path)
ds = dt.to_pyarrow_dataset(filesystem=filesystem)
When using the pyarrow factory method for file systems, the normalized path is provided on creation. In case of S3 this would look something like:
import pyarrow.fs as fs
from deltalake import DeltaTable
table_uri = "s3://<bucket>/<path>"
raw_fs, normalized_path = fs.FileSystem.from_uri(table_uri)
filesystem = fs.SubTreeFileSystem(normalized_path, raw_fs)
dt = DeltaTable(table_uri)
ds = dt.to_pyarrow_dataset(filesystem=filesystem)
Time Travel
To load previous table states, you can provide the version number you wish to load:
>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
Once you've loaded a table, you can also change versions using either a version number or datetime string:
>>> dt.load_version(1)
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00")
::: warning ::: title Warning :::
Previous table versions may not exist if they have been vacuumed, in which case an exception will be thrown. See Vacuuming tables for more information. :::