Notes:
- The format of the URL is not well documented, I also found that in some cases "s3a" can be used for non-AWS s3 object storage.
- DBMS_CLOUD does not allow you to use URLs other than "https". You will most likely have to create tables using the above syntax rather than using the DBMS_CLOUD package.
Apache Iceberg
Once I was able to successfully access parquet objects directly, I began testing cataloging parquet objects with Apache Iceberg using Spark.
I found the most difficult part of this was properly creating an Apache Iceberg manifest file that could be used to build an external table definition against Iceberg.
Environment to build manifest
My testing environment to build the Apache Iceberg manifest contains the following.
Python
- Python 3.9.18
- Below is a list of my python packages that are installed
Python 3.9.18
Package Version
------------------- -----------
aiobotocore 2.23.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.14
aioitertools 0.12.0
aiosignal 1.4.0
annotated-types 0.7.0
async-timeout 5.0.1
attrs 20.3.0
Babel 2.9.1
bcc 0.28.0
boto3 1.39.7
botocore 1.38.27
cachetools 5.5.2
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
circuitbreaker 1.3.2
click 8.1.8
cloud-init 23.4
cockpit 311.1
colorama 0.4.6
configobj 5.0.6
configshell-fb 1.1.30
cryptography 36.0.1
dasbus 1.4
dbus-python 1.2.18
distro 1.5.0
docutils 0.16
file-magic 0.4.0
frozenlist 1.7.0
fsspec 2025.7.0
gpg 1.15.1
greenlet 3.2.3
halo 0.0.31
idna 2.10
Jinja2 2.11.3
jmespath 1.0.1
jsonpatch 1.21
jsonpointer 2.0
jsonschema 3.2.0
kmod 0.1
Ksplice Uptrack 1.2.84
libcomps 0.1.18
lockfile 0.12.2
log-symbols 0.0.14
markdown-it-py 3.0.0
MarkupSafe 1.1.1
mdurl 0.1.2
mmh3 5.1.0
multidict 6.6.3
netaddr 0.8.0
netifaces 0.10.6
nftables 0.1
numpy 2.0.2
nvmetcli 0.7
oauthlib 3.1.1
oci 2.126.3
oci-utils 0.14.0
oracledb 3.2.0
pandas 2.3.1
parquet_tools 0.2.16
pciutils 2.3.6
pcp 5.0
perf 0.1
pexpect 4.8.0
pip 25.1.1
ply 3.11
prettytable 0.7.2
propcache 0.3.2
psutil 5.8.0
ptyprocess 0.6.0
py4j 0.10.9.9
pyarrow 20.0.0
pycparser 2.20
pycurl 7.43.0.6
pydantic 2.11.7
pydantic_core 2.33.2
Pygments 2.19.2
PyGObject 3.40.1
pyiceberg 0.9.1
pyOpenSSL 19.0.0
pyparsing 3.2.3
pyrsistent 0.17.3
pyserial 3.4
PySocks 1.7.1
pyspark 4.0.0
python-daemon 2.2.4
python-dateutil 2.9.0.post0
python-dmidecode 3.12.2
python-linux-procfs 0.7.3
pytz 2021.1
pyudev 0.22.0
PyYAML 5.4.1
requests 2.25.1
rhnlib 2.8.6
rich 13.9.4
rpm 4.16.1.3
s3fs 2025.7.0
s3transfer 0.13.0
sdnotify 0.3.2
selinux 3.6
sepolicy 3.6
setools 4.4.4
setroubleshoot 3.3.32
setuptools 53.0.0
six 1.15.0
sortedcontainers 2.4.0
sos 4.7.0
spinners 0.0.24
SQLAlchemy 2.0.41
strictyaml 1.7.3
systemd-python 234
tabulate 0.9.0
tenacity 9.1.2
termcolor 3.1.0
thrift 0.16.0
typing_extensions 4.14.0
typing-inspection 0.4.1
tzdata 2025.2
urllib3 1.26.5
urwid 2.1.2
wrapt 1.17.2
yarl 1.20.1
PySpark & Spark runtime
PySpark version: 4.0.0
Java version: 21.0.7
Scala version: version 2.13.16
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.0.0
/_/
Using Scala version 2.13.16, OpenJDK 64-Bit Server VM, 21.0.7
Branch HEAD
Compiled by user wenchen on 2025-05-19T07:58:03Z
Revision fa33ea000a0bda9e5a3fa1af98e8e85b8cc5e4d4
Url https://github.com/apache/spark
Environment variables
HADOOP_OPTS= -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/ -Djava.library.path=/usr/local/hadoop/hadoop-3.4.1/lib/
JAVA_HOME=/usr/lib/jvm/java-21-openjdk-21.0.7.0.6-1.0.1.el9.x86_64
HADOOP_HOME=/usr/local/hadoop/hadoop-3.4.1
SPARK_HOME=/usr/local/spark
HADOOP_COMMON_LIB_NATIVE_DIR=/usr/local/hadoop/hadoop-3.4.1/lib/native/
Script to build manifest and store python files
I used the NY State Taxi data to build my objects, and it can be found
here.
For my testing, I used the yellow taxi datasets.
Below is the section of my code that builds the spark catalog on OCI S3 object storage, and sets the environment using my credentials.
NOTE: It points to my endpoint and uses my access key and secret. I removed those from the script.
Python script to set the environment and hadoop catalog (Part #1)
from pyspark.sql import SparkSession
ICEBERG = "org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0"
HADOOP_CLOUD = "org.apache.spark:spark-hadoop-cloud_2.13:4.0.0"
spark = (
SparkSession.builder
.appName("iceberg-oci-s3")
.config("spark.jars.packages", ",".join([ICEBERG, HADOOP_CLOUD]))
# Iceberg catalog -> Hadoop catalog in an OCI S3 warehouse
.config("spark.sql.catalog.oci", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.oci.type", "hadoop")
.config("spark.sql.catalog.oci.warehouse", "s3a://icebergs3/iceberg_warehouse")
# OCI S3-compatible endpoint
.config("spark.hadoop.fs.s3a.endpoint", "https://{my tenancy/namespace}.compat.objectstorage.us-ashburn-1.oraclecloud.com")
.config("spark.hadoop.fs.s3a.path.style.access", "true")
.config("spark.hadoop.fs.s3a.endpoint.region", "us-ashburn-1")
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
.config("spark.hadoop.fs.s3a.list.version", "1")
.getOrCreate()
)
hconf = spark.sparkContext._jsc.hadoopConfiguration()
hconf.set("fs.s3a.access.key", "{my access key}") # OCI 'Customer Secret Key' ID
hconf.set("fs.s3a.secret.key", "{my secret for that key}") # the secret key
hconf.set("fs.s3a.connection.ssl.enabled", "true")
Python script to create namespace and create table (Part #2)
spark.sql("""
CREATE NAMESPACE IF NOT EXISTS oci_db
""")
import pyarrow.parquet as pq
spark.sql("""
CREATE NAMESPACE IF NOT EXISTS oci.tripdata
""")
spark.sql("""
CREATE TABLE IF NOT EXISTS oci.tripdata.yellow (
`VendorID` INT,
`tpep_pickup_datetime` TIMESTAMP,
`tpep_dropoff_datetime` TIMESTAMP,
`passenger_count` BIGINT,
`trip_distance` DOUBLE,
`RatecodeID` BIGINT,
`store_and_fwd_flag` STRING,
`PULocationID` INT,
`DOLocationID` INT,
`payment_type` BIGINT,
`fare_amount` DOUBLE,
`extra` DOUBLE,
`mta_tax` DOUBLE,
`tip_amount` DOUBLE,
`tolls_amount` DOUBLE,
`improvement_surcharge` DOUBLE,
`total_amount` DOUBLE,
`congestion_surcharge` DOUBLE,
`Airport_fee` DOUBLE
)
USING iceberg
PARTITIONED BY (months(tpep_pickup_datetime))
""")
Python scrypt to read parquet object/set type/append to iceberg table (Part #3)
sdf = spark.read.parquet("{parquet object}") # Location and name of parquet object to load
from pyspark.sql import functions as F
sdf_cast = (
sdf
.withColumn("VendorID", F.col("VendorID").cast("int"))
.withColumn("tpep_pickup_datetime", F.col("tpep_pickup_datetime").cast("timestamp"))
.withColumn("tpep_dropoff_datetime", F.col("tpep_dropoff_datetime").cast("timestamp"))
.withColumn("passenger_count", F.col("passenger_count").cast("bigint"))
.withColumn("trip_distance", F.col("trip_distance").cast("double"))
.withColumn("RatecodeID", F.col("RatecodeID").cast("bigint"))
.withColumn("store_and_fwd_flag", F.col("store_and_fwd_flag").cast("string"))
.withColumn("PULocationID", F.col("PULocationID").cast("int"))
.withColumn("DOLocationID", F.col("DOLocationID").cast("int"))
.withColumn("payment_type", F.col("payment_type").cast("bigint"))
.withColumn("fare_amount", F.col("fare_amount").cast("double"))
.withColumn("extra", F.col("extra").cast("double"))
.withColumn("mta_tax", F.col("mta_tax").cast("double"))
.withColumn("tip_amount", F.col("tip_amount").cast("double"))
.withColumn("tolls_amount", F.col("tolls_amount").cast("double"))
.withColumn("improvement_surcharge", F.col("improvement_surcharge").cast("double"))
.withColumn("total_amount", F.col("total_amount").cast("double"))
.withColumn("congestion_surcharge", F.col("congestion_surcharge").cast("double"))
.withColumn("Airport_fee", F.col("Airport_fee").cast("double"))
)
sdf_cast.writeTo("oci.tripdata.yellow").append()
Summary :
Combining the 3 parts of the script, filling in the credentials along with the endpoint, and bucket, along with specifying the parquet object will load the object storage properly including the manifest file.
Investigating the resulting manifest and objects
Below is a list of the objects created in the metadata directory.
You can see that I updated the data and added more data, and each time, it created a new version of the manifest along with snapshot information.
When writing the data, I chose to partition it, below is the partitioned data. You can see that when Apache Iceberg wrote the data, it automatically created parquet objects in directories for each partition, and added to the current directory with new data.
Creating an Oracle table on top of Iceberg
Now that I have created my iceberg table, and it is stored in object storage, I can create an Oracle table that will read the manifest file.
In my example, the most recent manifest object is "iceberg_warehouse/tripdata/yellow/metadata/v7.metadata.json".
My External Table definition is below.
CREATE TABLE CUSTOMERS_ICEBERG
(
VendorID integer,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp,
passenger_count number,
trip_distance number,
RatecodeID number,
store_and_fwd_flag blob,
PULocationID number,
DOLocationID number,
payment_type number,
fare_amount number,
extra number,
mta_tax number,
tip_amount number,
tolls_amount number,
improvement_surcharge number,
total_amount number,
congestion_surcharge number,
Airport_fee number
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY DATA_PUMP_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.fileformat=parquet
com.oracle.bigdata.credential.name=OCI_S3
com.oracle.bigdata.access_protocol=iceberg
)
LOCATION ('iceberg:https://{namespace}.compat.objectstorage.us-ashburn-1.oci.customer-oci.com/icebergs3/iceberg_warehouse/tripdata/yellow/metadata/v7.metadata.json')
)
PARALLEL;
NOTE: my location is "iceberg:https".
I can now select from my new table and it will read the Iceberg manifest file.
Conclusion:
Oracle Database 23ai not only supports creating external tables on top of Parquet objects, but it also supports creating external tables on top of Apache Iceberg manifest objects.