Creating, Querying and Loading Data in to the Oracle Autonomous Data Warehouse

ADW

Oracle Autonomous Data Warehouse (ADW) access via Python

The following shows how to access the Oracle Autonomous Data Warehouse using Python and how to load the data using the DBMS_CLOUD package via the cx_Oracle module. This is obviously simpler via the Graphical front end or SQL Developer but using Python provdes a simple scriptable model whilst hiding some of the complexities of using native REST APIs.

This simple example assumes that you've got an Oracle Cloud account and that you've created or got access an ADW database. You'll also have to download the credentials file to provide SQL*Net access. You can find the details on how to do that here. We'll be using Python in this short example but most of what we're doing could be achieved using the GUI and/or REST Calls.

Connecting to ADW Instance

To start with we'll make sure we can connect to the ADW Instance we've previously created. To do that we need to import the required libraries. If you dodn't have these I reccommend using PIP (and virtualenv)

In [ ]:
import cx_Oracle
import keyring
import os

We need to use an environment variable to reflect the location of the downloaded credentials files to be used by SQL*Net.

In [ ]:
os.environ['TNS_ADMIN'] = '/Users/dgiles/Downloads/wallet_DOMSDB'

This is equivlent to bash export TNS_ADMIN=/Users/dgiles/Downloads/wallet_DOMSDB and points to the unzipped directory containing the tnsnames.ora, sqlnet.ora etc. NOTE: you'll need to update the sqlnet.ora to ensure the wallet points to the same directory specified in the TNS_ADMIN environment variable. i.e.

WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/Users/dgiles/Downloads/wallet_DOMSDB")))
SSL_SERVER_DN_MATCH=yes

In the example above I've changed DIRECTORY to the location where I downloaded and unzipped the credentials file.

The next steps are to connect to the Oracle ADW instance. In the example below I've store my password using the Python Module "keyring". I'm also using the jupyter notebook magic sql functionality. We'll test the connection using the admin user and the connect string "domsdb_medium" which is one of the services included in the tnsnames.ora file.

In [ ]:
%load_ext sql
In [257]:
password = keyring.get_password('adw','admin')
%sql oracle+cx_oracle://admin:$password@domsdb_medium
Out[257]:
'Connected: admin@None'
In [258]:
%%sql admin@domsdb_medium
select 1 from dual
0 rows affected.
Out[258]:
1
1

Generating test data

We've connected to the oracle database and can now start uploading data to the instance. In this example we'll use datagenerator to generate the data into flat files and then place these on Oracle Object Store and load them from there.

The first step is to install datagenerator. You can find details on how to do that here. We can now simply generate data for the "SH" benchmark.

In [259]:
import subprocess

# Change the 2 following parameters to reflect your environment
generated_data_dir = '/Users/dgiles/Downloads/generated_data'
datagenerator_home = '/Users/dgiles/datagenerator'

# Change the following paramters relating to the way datagenerator will create the data
scale = 100
parallel = 8

dg_command = '{dg}/bin/datagenerator -c {dg}/bin/sh.xml -scale {s} -cl -f -d {gdd} -tc {p}'.format(
    dg = datagenerator_home,
    s = scale,
    gdd = generated_data_dir,
    p = parallel
)
# Typically we'd use a command similiar to the one below but since we're in a notebook it's easier to use the default functionality
# p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
# (output, err) = p.communicate()

print(dg_command)
!{dg_command}
/Users/dgiles/datagenerator/bin/datagenerator -c /Users/dgiles/datagenerator/bin/sh.xml -scale 100 -cl -f -d /Users/dgiles/Downloads/generated_data -tc 8
Started Datagenerator, Version 0.4.0.1083

============================================
|           Datagenerator Run Stats        |
============================================
Connection Time                        0:00:00.000
Data Generation Time                   0:00:26.274
DDL Creation Time                      0:00:00.000
Total Run Time                         0:00:26.284
Rows Inserted per sec                        5,367
Data Generated (MB) per sec                    0.5
Actual Rows Generated                      137,000
Commits Completed                                0
Batch Updates Completed                          0

We should now have a series of files in the "generated_data_dir" directory. These will be a mix of csv files, create table scripts, loader scripts etc.

In [255]:
!ls {generated_data_dir}
CHANNELS.csv                   SALES.csv
CHANNELS.ctl                   SALES.ctl
COUNTRIES.csv                  SUPPLEMENTARY_DEMOGRAPHICS.csv
COUNTRIES.ctl                  SUPPLEMENTARY_DEMOGRAPHICS.ctl
CUSTOMERS.csv                  constraints.sql
CUSTOMERS.ctl                  createindexes.sql
PRODUCTS.csv                   createsequences.sql
PRODUCTS.ctl                   createtables.sql
PROMOTIONS.csv                 droptables.sql
PROMOTIONS.ctl

Uploading the data to the Oracle Object Store

We're really only interested in the "csv" files so we'll upload just those. But before we do this we'll need to establish a connection to the Oracle Object Store. I give some detail behind how to do this in this notebook. I'll be using the object storage out of the Frankfurt Region.

In [256]:
import oci
import ast

my_config = ast.literal_eval(keyring.get_password('oci_opj','doms'))
my_config['region'] = 'eu-frankfurt-1'
object_storage_client = oci.object_storage.ObjectStorageClient(my_config)
namespace = object_storage_client.get_namespace().data

We've now got a handle to the Oracle Object Store Client so we can now create a bucket which we'll call and upload the "CSV" Files.

In [ ]:
import os, io

bucket_name = 'Sales_Data'

files_to_process = [file for file in os.listdir(generated_data_dir) if file.endswith('csv')]

try:
    create_bucket_response = object_storage_client.create_bucket(
        namespace,
        oci.object_storage.models.CreateBucketDetails(
            name=bucket_name,
            compartment_id=my_config['tenancy']
        )
    )
except Exception as e:
    print(e.message)


for upload_file in files_to_process:
    print('Uploading file {}'.format(upload_file))
    object_storage_client.put_object(namespace, bucket_name, upload_file, io.open(os.path.join(generated_data_dir,upload_file),'r'))    

We need to create an authentication token that can be used by the ADW instance to access our Object storage. To do this we need to create an identity client.

In [ ]:
indentity_client = oci.identity.IdentityClient(my_config)
In [ ]:
token = indentity_client.create_auth_token(
    oci.identity.models.CreateAuthTokenDetails(
        description = "Token used to provide access to newly loaded files"
    ),
    user_id = my_config['user']
)

Creating users and tables in the ADW Instance

The following steps will feel very familiar to any DBA/developer of and Oracle database. We need to create a schema and assocated tables to load the data into.

First we'll need to create a user/schema and grant it the appropriate roles

In [ ]:
%sql create user mysh identified by ReallyLongPassw0rd default tablespace data

Grant the "mysh" user the DWROLE

In [ ]:
%sql grant DWROLE to mysh
In [ ]:
%sql oracle+cx_oracle://mysh:ReallyLongPassw0rd@domsdb_medium

We can now create the tables we'll use to load the data into.

In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE COUNTRIES (
  COUNTRY_ID           NUMBER       NOT NULL,
  COUNTRY_ISO_CODE     CHAR(2)      NOT NULL,
  COUNTRY_NAME         VARCHAR2(40) NOT NULL,
  COUNTRY_SUBREGION    VARCHAR2(30) NOT NULL,
  COUNTRY_SUBREGION_ID NUMBER       NOT NULL,
  COUNTRY_REGION       VARCHAR2(20) NOT NULL,
  COUNTRY_REGION_ID    NUMBER       NOT NULL,
  COUNTRY_TOTAL        NUMBER(9)    NOT NULL,
  COUNTRY_TOTAL_ID     NUMBER       NOT NULL,
  COUNTRY_NAME_HIST    VARCHAR2(40)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE SALES (
  PROD_ID       NUMBER     NOT NULL,
  CUST_ID       NUMBER     NOT NULL,
  TIME_ID       DATE       NOT NULL,
  CHANNEL_ID    NUMBER     NOT NULL,
  PROMO_ID      NUMBER     NOT NULL,
  QUANTITY_SOLD NUMBER(10) NOT NULL,
  AMOUNT_SOLD   NUMBER(10) NOT NULL
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE SUPPLEMENTARY_DEMOGRAPHICS (
  CUST_ID                 NUMBER NOT NULL,
  EDUCATION               VARCHAR2(21),
  OCCUPATION              VARCHAR2(21),
  HOUSEHOLD_SIZE          VARCHAR2(21),
  YRS_RESIDENCE           NUMBER,
  AFFINITY_CARD           NUMBER(10),
  BULK_PACK_DISKETTES     NUMBER(10),
  FLAT_PANEL_MONITOR      NUMBER(10),
  HOME_THEATER_PACKAGE    NUMBER(10),
  BOOKKEEPING_APPLICATION NUMBER(10),
  PRINTER_SUPPLIES        NUMBER(10),
  Y_BOX_GAMES             NUMBER(10),
  OS_DOC_SET_KANJI        NUMBER(10),
  COMMENTS                VARCHAR2(4000)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE CUSTOMERS (
  CUST_ID                NUMBER       NOT NULL,
  CUST_FIRST_NAME        VARCHAR2(20) NOT NULL,
  CUST_LAST_NAME         VARCHAR2(40) NOT NULL,
  CUST_GENDER            CHAR(1)      NOT NULL,
  CUST_YEAR_OF_BIRTH     NUMBER(4)    NOT NULL,
  CUST_MARITAL_STATUS    VARCHAR2(20),
  CUST_STREET_ADDRESS    VARCHAR2(40) NOT NULL,
  CUST_POSTAL_CODE       VARCHAR2(10) NOT NULL,
  CUST_CITY              VARCHAR2(30) NOT NULL,
  CUST_CITY_ID           NUMBER       NOT NULL,
  CUST_STATE_PROVINCE    VARCHAR2(40) NOT NULL,
  CUST_STATE_PROVINCE_ID NUMBER       NOT NULL,
  COUNTRY_ID             NUMBER       NOT NULL,
  CUST_MAIN_PHONE_NUMBER VARCHAR2(25) NOT NULL,
  CUST_INCOME_LEVEL      VARCHAR2(30),
  CUST_CREDIT_LIMIT      NUMBER,
  CUST_EMAIL             VARCHAR2(40),
  CUST_TOTAL             VARCHAR2(14) NOT NULL,
  CUST_TOTAL_ID          NUMBER       NOT NULL,
  CUST_SRC_ID            NUMBER,
  CUST_EFF_FROM          DATE,
  CUST_EFF_TO            DATE,
  CUST_VALID             VARCHAR2(1)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE CHANNELS (
  CHANNEL_ID       NUMBER       NOT NULL,
  CHANNEL_DESC     VARCHAR2(20) NOT NULL,
  CHANNEL_CLASS    VARCHAR2(20) NOT NULL,
  CHANNEL_CLASS_ID NUMBER       NOT NULL,
  CHANNEL_TOTAL    VARCHAR2(13) NOT NULL,
  CHANNEL_TOTAL_ID NUMBER       NOT NULL
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE PRODUCTS (
  PROD_ID               NUMBER(6)      NOT NULL,
  PROD_NAME             VARCHAR2(50)   NOT NULL,
  PROD_DESC             VARCHAR2(4000) NOT NULL,
  PROD_SUBCATEGORY      VARCHAR2(50)   NOT NULL,
  PROD_SUBCATEGORY_ID   NUMBER         NOT NULL,
  PROD_SUBCATEGORY_DESC VARCHAR2(2000) NOT NULL,
  PROD_CATEGORY         VARCHAR2(50)   NOT NULL,
  PROD_CATEGORY_ID      NUMBER         NOT NULL,
  PROD_CATEGORY_DESC    VARCHAR2(2000) NOT NULL,
  PROD_WEIGHT_CLASS     NUMBER(3)      NOT NULL,
  PROD_UNIT_OF_MEASURE  VARCHAR2(20),
  PROD_PACK_SIZE        VARCHAR2(30)   NOT NULL,
  SUPPLIER_ID           NUMBER(6)      NOT NULL,
  PROD_STATUS           VARCHAR2(20)   NOT NULL,
  PROD_LIST_PRICE       NUMBER(8)      NOT NULL,
  PROD_MIN_PRICE        NUMBER(8)      NOT NULL,
  PROD_TOTAL            VARCHAR2(13)   NOT NULL,
  PROD_TOTAL_ID         NUMBER         NOT NULL,
  PROD_SRC_ID           NUMBER,
  PROD_EFF_FROM         DATE,
  PROD_EFF_TO           DATE,
  PROD_VALID            VARCHAR2(1)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE PROMOTIONS (
  PROMO_ID             NUMBER(6)    NOT NULL,
  PROMO_NAME           VARCHAR2(30) NOT NULL,
  PROMO_SUBCATEGORY    VARCHAR2(30) NOT NULL,
  PROMO_SUBCATEGORY_ID NUMBER       NOT NULL,
  PROMO_CATEGORY       VARCHAR2(30) NOT NULL,
  PROMO_CATEGORY_ID    NUMBER       NOT NULL,
  PROMO_COST           NUMBER(10)   NOT NULL,
  PROMO_BEGIN_DATE     DATE         NOT NULL,
  PROMO_END_DATE       DATE         NOT NULL,
  PROMO_TOTAL          VARCHAR2(15) NOT NULL,
  PROMO_TOTAL_ID       NUMBER       NOT NULL
)
In [245]:
%%sql mysh@domsdb_medium
CREATE TABLE times AS
SELECT udate time_id,
  TO_CHAR(udate,'Day'Winking day_name,
  TO_CHAR(udate,'DD'Winking day_number_in_month,
  TO_CHAR(udate,'DDD'Winking day_number_in_year,
  TO_CHAR(udate,'YYYY' ) calendar_year,
  TO_CHAR(udate,'Q' ) calendar_quarter_number,
  TO_CHAR(udate,'MM' ) calendar_month_number,
  TO_CHAR(udate,'WW' ) calendar_week_number,
  TO_CHAR(udate,'YYYY-MM'Winking calendar_month_desc,
  TO_CHAR(udate,'YYYY-Q'Winking calendar_quarter_desc
FROM
  (SELECT to_date('31/12/1994','DD/MM/YYYY'Winking+rownum udate
  FROM all_objects
  WHERE to_date('31/12/1994','DD/MM/YYYY'Winking+rownum <= to_date( '31/12/2014','DD/MM/YYYY'Winking
  )
7305 rows affected.
Out[245]:
[]
In [248]:
%%sql mysh@domsdb_medium
select * from tab
0 rows affected.
Out[248]:
tname tabtype clusterid
COUNTRIES TABLE None
SALES TABLE None
SUPPLEMENTARY_DEMOGRAPHICS TABLE None
CUSTOMERS TABLE None
CHANNELS TABLE None
PRODUCTS TABLE None
PROMOTIONS TABLE None
TIMES TABLE None

Copying the data from the object store

We need to add the authorisation token to the newly created schema to allow it to access the object stores files. We can't do this using the sql magic syntax we've been using till this point so we'll do it using standard cx_Oracle calls.

In [ ]:
connection = cx_Oracle.connect('mysh', 'ReallyLongPassw0rd', 'domsdb_medium')
cursor = connection.cursor();
In [ ]:
cursor.callproc('DBMS_CLOUD.create_credential', keywordParameters = {'credential_name':'SALES_DATA_AUTH',
                                                                    'username':'dominic.giles@oracle.com',
                                                                    'password':token.data.token})

We can access the object storage using a url of the the format

https://swiftobjectstorage.<region>.oraclecloud.com/v1/<tenancy>/<bucket name>/<object name>

We can use this to dynamically generate a url for each of the objects inside of the bucket we've just created and use the DBMS_CLOUD package to copy the data into the ADW instance. The code below gets all of the names of the tables we've just created and the loops through each table copying the associated csv file into the ADW instance.

In [ ]:
from tqdm import tqdm

format = '''{"delimiter" : ",",
          "skipheaders" : 1,
          "ignoremissingcolumns" : "true",
          "removequotes" : "true",
          "dateformat" : "DD-MON-YYYY HH24:MI:SS",
          "blankasnull" : "true"}'''

file_location = '''https://swiftobjectstorage.{region}.oraclecloud.com/v1/{tenancy}/{bucket_name}/{table_name}.csv'''
region = my_config['region']
tenancy= 'oracleonpremjava'


rs = cursor.execute("select table_name from user_tables where table_name not like 'COPY%'")
rows = rs.fetchall()
for row in tqdm(rows):
    url = file_location.format(region=region, tenancy=tenancy, bucket_name=bucket_name, table_name=row[0])
    cursor.callproc('DBMS_CLOUD.copy_data', keywordParameters= {'table_name':row[0],
                                                           'credential_name':'SALES_DATA_AUTH',
                                                           'file_uri_list':url, 
                                                           'format': format
                                                           })
                               

We can now take a look and see how many rows we've loaded into the tables

In [261]:
rs = cursor.execute("select table_name from user_tables where table_name not like 'COPY%'")
rows = rs.fetchall()

for row in rows:
    rs2 = cursor.execute("select count(*) from {}".format(row[0]))
    rows2 = rs2.fetchone()
    print('{tn: <35}{rc:>10,}'.format(tn=row[0],rc=rows2[0]))
COUNTRIES                                  22
SALES                                  97,799
SUPPLEMENTARY_DEMOGRAPHICS             19,599
CUSTOMERS                              19,599
CHANNELS                                    4
PRODUCTS                                   71
PROMOTIONS                                502
TIMES                                   7,305

It's now possible to run standard queries against the newly loaded data. No need to create anything else (indexes etc.)

In [249]:
%%sql mysh@domsdb_medium
SELECT channels.channel_desc, 
       countries.country_iso_code,
       TO_CHAR(SUM(amount_sold), '9,999,999,999'Winking SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id 
  AND sales.cust_id=customers.cust_id 
  AND customers.country_id = countries.country_id
  AND sales.channel_id = channels.channel_id 
  AND channels.channel_desc IN ('Tele Sales','Internet'Winking 
  AND times.calendar_year = '2006' 
  AND countries.country_iso_code IN ('GB','DE','FR','DK'Winking
GROUP BY 
  ROLLUP(channels.channel_desc,countries.country_iso_code)
ORDER BY 1
0 rows affected.
Out[249]:
channel_desc country_iso_code sales$
Internet DE          3,130
Internet DK          1,676
Internet FR          2,545
Internet GB          2,320
Internet None          9,671
Tele Sales DE          3,077
Tele Sales DK          3,116
Tele Sales FR          3,184
Tele Sales GB          2,386
Tele Sales None         11,763
None None         21,434

Tidying up the object store and database

You can run the following steps if you want to remove all of the tables from the schema and purge the object store of files. Lets start by removing the tables. NOTE : The code below will remove all of the tables from the schema. Make sure you've not got anything in the schema that you want to keep before running it.

In [ ]:
from tqdm import tqdm

rs = cursor.execute("select table_name from user_tables")
rows = rs.fetchall()

for row in tqdm(rows):
    rs2 = cursor.execute("drop table {} purge".format(row[0]))

And then removing the object store files and bucket

In [ ]:
object_list = object_storage_client.list_objects(namespace, bucket_name)

for o in object_list.data.objects:
    print('Deleting object {}'.format(o.name))
    object_storage_client.delete_object(namespace, bucket_name, o.name)

print('Deleting bucket')    
response = object_storage_client.delete_bucket(namespace, bucket_name)
blog comments powered by Disqus