Creating, Querying and Loading Data in to the Oracle Autonomous Data Warehouse

ADW

Oracle Autonomous Data Warehouse (ADW) access via Python

The following shows how to access the Oracle Autonomous Data Warehouse using Python and how to load the data using the DBMS_CLOUD package via the cx_Oracle module. This is obviously simpler via the Graphical front end or SQL Developer but using Python provdes a simple scriptable model whilst hiding some of the complexities of using native REST APIs.

This simple example assumes that you've got an Oracle Cloud account and that you've created or got access an ADW database. You'll also have to download the credentials file to provide SQL*Net access. You can find the details on how to do that here. We'll be using Python in this short example but most of what we're doing could be achieved using the GUI and/or REST Calls.

Connecting to ADW Instance

To start with we'll make sure we can connect to the ADW Instance we've previously created. To do that we need to import the required libraries. If you dodn't have these I reccommend using PIP (and virtualenv)

In [ ]:
import cx_Oracle
import keyring
import os

We need to use an environment variable to reflect the location of the downloaded credentials files to be used by SQL*Net.

In [ ]:
os.environ['TNS_ADMIN'] = '/Users/dgiles/Downloads/wallet_DOMSDB'

This is equivlent to bash export TNS_ADMIN=/Users/dgiles/Downloads/wallet_DOMSDB and points to the unzipped directory containing the tnsnames.ora, sqlnet.ora etc. NOTE: you'll need to update the sqlnet.ora to ensure the wallet points to the same directory specified in the TNS_ADMIN environment variable. i.e.

WALLET_LOCATION = (SOURCE = (METHOD = file) (METHOD_DATA = (DIRECTORY="/Users/dgiles/Downloads/wallet_DOMSDB")))
SSL_SERVER_DN_MATCH=yes

In the example above I've changed DIRECTORY to the location where I downloaded and unzipped the credentials file.

The next steps are to connect to the Oracle ADW instance. In the example below I've store my password using the Python Module "keyring". I'm also using the jupyter notebook magic sql functionality. We'll test the connection using the admin user and the connect string "domsdb_medium" which is one of the services included in the tnsnames.ora file.

In [ ]:
%load_ext sql
In [257]:
password = keyring.get_password('adw','admin')
%sql oracle+cx_oracle://admin:$password@domsdb_medium
Out[257]:
'Connected: admin@None'
In [258]:
%%sql admin@domsdb_medium
select 1 from dual
0 rows affected.
Out[258]:
1
1

Generating test data

We've connected to the oracle database and can now start uploading data to the instance. In this example we'll use datagenerator to generate the data into flat files and then place these on Oracle Object Store and load them from there.

The first step is to install datagenerator. You can find details on how to do that here. We can now simply generate data for the "SH" benchmark.

In [259]:
import subprocess

# Change the 2 following parameters to reflect your environment
generated_data_dir = '/Users/dgiles/Downloads/generated_data'
datagenerator_home = '/Users/dgiles/datagenerator'

# Change the following paramters relating to the way datagenerator will create the data
scale = 100
parallel = 8

dg_command = '{dg}/bin/datagenerator -c {dg}/bin/sh.xml -scale {s} -cl -f -d {gdd} -tc {p}'.format(
    dg = datagenerator_home,
    s = scale,
    gdd = generated_data_dir,
    p = parallel
)
# Typically we'd use a command similiar to the one below but since we're in a notebook it's easier to use the default functionality
# p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
# (output, err) = p.communicate()

print(dg_command)
!{dg_command}
/Users/dgiles/datagenerator/bin/datagenerator -c /Users/dgiles/datagenerator/bin/sh.xml -scale 100 -cl -f -d /Users/dgiles/Downloads/generated_data -tc 8
Started Datagenerator, Version 0.4.0.1083

============================================
|           Datagenerator Run Stats        |
============================================
Connection Time                        0:00:00.000
Data Generation Time                   0:00:26.274
DDL Creation Time                      0:00:00.000
Total Run Time                         0:00:26.284
Rows Inserted per sec                        5,367
Data Generated (MB) per sec                    0.5
Actual Rows Generated                      137,000
Commits Completed                                0
Batch Updates Completed                          0

We should now have a series of files in the "generated_data_dir" directory. These will be a mix of csv files, create table scripts, loader scripts etc.

In [255]:
!ls {generated_data_dir}
CHANNELS.csv                   SALES.csv
CHANNELS.ctl                   SALES.ctl
COUNTRIES.csv                  SUPPLEMENTARY_DEMOGRAPHICS.csv
COUNTRIES.ctl                  SUPPLEMENTARY_DEMOGRAPHICS.ctl
CUSTOMERS.csv                  constraints.sql
CUSTOMERS.ctl                  createindexes.sql
PRODUCTS.csv                   createsequences.sql
PRODUCTS.ctl                   createtables.sql
PROMOTIONS.csv                 droptables.sql
PROMOTIONS.ctl

Uploading the data to the Oracle Object Store

We're really only interested in the "csv" files so we'll upload just those. But before we do this we'll need to establish a connection to the Oracle Object Store. I give some detail behind how to do this in this notebook. I'll be using the object storage out of the Frankfurt Region.

In [256]:
import oci
import ast

my_config = ast.literal_eval(keyring.get_password('oci_opj','doms'))
my_config['region'] = 'eu-frankfurt-1'
object_storage_client = oci.object_storage.ObjectStorageClient(my_config)
namespace = object_storage_client.get_namespace().data

We've now got a handle to the Oracle Object Store Client so we can now create a bucket which we'll call and upload the "CSV" Files.

In [ ]:
import os, io

bucket_name = 'Sales_Data'

files_to_process = [file for file in os.listdir(generated_data_dir) if file.endswith('csv')]

try:
    create_bucket_response = object_storage_client.create_bucket(
        namespace,
        oci.object_storage.models.CreateBucketDetails(
            name=bucket_name,
            compartment_id=my_config['tenancy']
        )
    )
except Exception as e:
    print(e.message)


for upload_file in files_to_process:
    print('Uploading file {}'.format(upload_file))
    object_storage_client.put_object(namespace, bucket_name, upload_file, io.open(os.path.join(generated_data_dir,upload_file),'r'))    

We need to create an authentication token that can be used by the ADW instance to access our Object storage. To do this we need to create an identity client.

In [ ]:
indentity_client = oci.identity.IdentityClient(my_config)
In [ ]:
token = indentity_client.create_auth_token(
    oci.identity.models.CreateAuthTokenDetails(
        description = "Token used to provide access to newly loaded files"
    ),
    user_id = my_config['user']
)

Creating users and tables in the ADW Instance

The following steps will feel very familiar to any DBA/developer of and Oracle database. We need to create a schema and assocated tables to load the data into.

First we'll need to create a user/schema and grant it the appropriate roles

In [ ]:
%sql create user mysh identified by ReallyLongPassw0rd default tablespace data

Grant the "mysh" user the DWROLE

In [ ]:
%sql grant DWROLE to mysh
In [ ]:
%sql oracle+cx_oracle://mysh:ReallyLongPassw0rd@domsdb_medium

We can now create the tables we'll use to load the data into.

In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE COUNTRIES (
  COUNTRY_ID           NUMBER       NOT NULL,
  COUNTRY_ISO_CODE     CHAR(2)      NOT NULL,
  COUNTRY_NAME         VARCHAR2(40) NOT NULL,
  COUNTRY_SUBREGION    VARCHAR2(30) NOT NULL,
  COUNTRY_SUBREGION_ID NUMBER       NOT NULL,
  COUNTRY_REGION       VARCHAR2(20) NOT NULL,
  COUNTRY_REGION_ID    NUMBER       NOT NULL,
  COUNTRY_TOTAL        NUMBER(9)    NOT NULL,
  COUNTRY_TOTAL_ID     NUMBER       NOT NULL,
  COUNTRY_NAME_HIST    VARCHAR2(40)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE SALES (
  PROD_ID       NUMBER     NOT NULL,
  CUST_ID       NUMBER     NOT NULL,
  TIME_ID       DATE       NOT NULL,
  CHANNEL_ID    NUMBER     NOT NULL,
  PROMO_ID      NUMBER     NOT NULL,
  QUANTITY_SOLD NUMBER(10) NOT NULL,
  AMOUNT_SOLD   NUMBER(10) NOT NULL
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE SUPPLEMENTARY_DEMOGRAPHICS (
  CUST_ID                 NUMBER NOT NULL,
  EDUCATION               VARCHAR2(21),
  OCCUPATION              VARCHAR2(21),
  HOUSEHOLD_SIZE          VARCHAR2(21),
  YRS_RESIDENCE           NUMBER,
  AFFINITY_CARD           NUMBER(10),
  BULK_PACK_DISKETTES     NUMBER(10),
  FLAT_PANEL_MONITOR      NUMBER(10),
  HOME_THEATER_PACKAGE    NUMBER(10),
  BOOKKEEPING_APPLICATION NUMBER(10),
  PRINTER_SUPPLIES        NUMBER(10),
  Y_BOX_GAMES             NUMBER(10),
  OS_DOC_SET_KANJI        NUMBER(10),
  COMMENTS                VARCHAR2(4000)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE CUSTOMERS (
  CUST_ID                NUMBER       NOT NULL,
  CUST_FIRST_NAME        VARCHAR2(20) NOT NULL,
  CUST_LAST_NAME         VARCHAR2(40) NOT NULL,
  CUST_GENDER            CHAR(1)      NOT NULL,
  CUST_YEAR_OF_BIRTH     NUMBER(4)    NOT NULL,
  CUST_MARITAL_STATUS    VARCHAR2(20),
  CUST_STREET_ADDRESS    VARCHAR2(40) NOT NULL,
  CUST_POSTAL_CODE       VARCHAR2(10) NOT NULL,
  CUST_CITY              VARCHAR2(30) NOT NULL,
  CUST_CITY_ID           NUMBER       NOT NULL,
  CUST_STATE_PROVINCE    VARCHAR2(40) NOT NULL,
  CUST_STATE_PROVINCE_ID NUMBER       NOT NULL,
  COUNTRY_ID             NUMBER       NOT NULL,
  CUST_MAIN_PHONE_NUMBER VARCHAR2(25) NOT NULL,
  CUST_INCOME_LEVEL      VARCHAR2(30),
  CUST_CREDIT_LIMIT      NUMBER,
  CUST_EMAIL             VARCHAR2(40),
  CUST_TOTAL             VARCHAR2(14) NOT NULL,
  CUST_TOTAL_ID          NUMBER       NOT NULL,
  CUST_SRC_ID            NUMBER,
  CUST_EFF_FROM          DATE,
  CUST_EFF_TO            DATE,
  CUST_VALID             VARCHAR2(1)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE CHANNELS (
  CHANNEL_ID       NUMBER       NOT NULL,
  CHANNEL_DESC     VARCHAR2(20) NOT NULL,
  CHANNEL_CLASS    VARCHAR2(20) NOT NULL,
  CHANNEL_CLASS_ID NUMBER       NOT NULL,
  CHANNEL_TOTAL    VARCHAR2(13) NOT NULL,
  CHANNEL_TOTAL_ID NUMBER       NOT NULL
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE PRODUCTS (
  PROD_ID               NUMBER(6)      NOT NULL,
  PROD_NAME             VARCHAR2(50)   NOT NULL,
  PROD_DESC             VARCHAR2(4000) NOT NULL,
  PROD_SUBCATEGORY      VARCHAR2(50)   NOT NULL,
  PROD_SUBCATEGORY_ID   NUMBER         NOT NULL,
  PROD_SUBCATEGORY_DESC VARCHAR2(2000) NOT NULL,
  PROD_CATEGORY         VARCHAR2(50)   NOT NULL,
  PROD_CATEGORY_ID      NUMBER         NOT NULL,
  PROD_CATEGORY_DESC    VARCHAR2(2000) NOT NULL,
  PROD_WEIGHT_CLASS     NUMBER(3)      NOT NULL,
  PROD_UNIT_OF_MEASURE  VARCHAR2(20),
  PROD_PACK_SIZE        VARCHAR2(30)   NOT NULL,
  SUPPLIER_ID           NUMBER(6)      NOT NULL,
  PROD_STATUS           VARCHAR2(20)   NOT NULL,
  PROD_LIST_PRICE       NUMBER(8)      NOT NULL,
  PROD_MIN_PRICE        NUMBER(8)      NOT NULL,
  PROD_TOTAL            VARCHAR2(13)   NOT NULL,
  PROD_TOTAL_ID         NUMBER         NOT NULL,
  PROD_SRC_ID           NUMBER,
  PROD_EFF_FROM         DATE,
  PROD_EFF_TO           DATE,
  PROD_VALID            VARCHAR2(1)
)
In [ ]:
%%sql mysh@domsdb_medium
CREATE TABLE PROMOTIONS (
  PROMO_ID             NUMBER(6)    NOT NULL,
  PROMO_NAME           VARCHAR2(30) NOT NULL,
  PROMO_SUBCATEGORY    VARCHAR2(30) NOT NULL,
  PROMO_SUBCATEGORY_ID NUMBER       NOT NULL,
  PROMO_CATEGORY       VARCHAR2(30) NOT NULL,
  PROMO_CATEGORY_ID    NUMBER       NOT NULL,
  PROMO_COST           NUMBER(10)   NOT NULL,
  PROMO_BEGIN_DATE     DATE         NOT NULL,
  PROMO_END_DATE       DATE         NOT NULL,
  PROMO_TOTAL          VARCHAR2(15) NOT NULL,
  PROMO_TOTAL_ID       NUMBER       NOT NULL
)
In [245]:
%%sql mysh@domsdb_medium
CREATE TABLE times AS
SELECT udate time_id,
  TO_CHAR(udate,'Day'Winking day_name,
  TO_CHAR(udate,'DD'Winking day_number_in_month,
  TO_CHAR(udate,'DDD'Winking day_number_in_year,
  TO_CHAR(udate,'YYYY' ) calendar_year,
  TO_CHAR(udate,'Q' ) calendar_quarter_number,
  TO_CHAR(udate,'MM' ) calendar_month_number,
  TO_CHAR(udate,'WW' ) calendar_week_number,
  TO_CHAR(udate,'YYYY-MM'Winking calendar_month_desc,
  TO_CHAR(udate,'YYYY-Q'Winking calendar_quarter_desc
FROM
  (SELECT to_date('31/12/1994','DD/MM/YYYY'Winking+rownum udate
  FROM all_objects
  WHERE to_date('31/12/1994','DD/MM/YYYY'Winking+rownum <= to_date( '31/12/2014','DD/MM/YYYY'Winking
  )
7305 rows affected.
Out[245]:
[]
In [248]:
%%sql mysh@domsdb_medium
select * from tab
0 rows affected.
Out[248]:
tname tabtype clusterid
COUNTRIES TABLE None
SALES TABLE None
SUPPLEMENTARY_DEMOGRAPHICS TABLE None
CUSTOMERS TABLE None
CHANNELS TABLE None
PRODUCTS TABLE None
PROMOTIONS TABLE None
TIMES TABLE None

Copying the data from the object store

We need to add the authorisation token to the newly created schema to allow it to access the object stores files. We can't do this using the sql magic syntax we've been using till this point so we'll do it using standard cx_Oracle calls.

In [ ]:
connection = cx_Oracle.connect('mysh', 'ReallyLongPassw0rd', 'domsdb_medium')
cursor = connection.cursor();
In [ ]:
cursor.callproc('DBMS_CLOUD.create_credential', keywordParameters = {'credential_name':'SALES_DATA_AUTH',
                                                                    'username':'dominic.giles@oracle.com',
                                                                    'password':token.data.token})

We can access the object storage using a url of the the format

https://swiftobjectstorage.<region>.oraclecloud.com/v1/<tenancy>/<bucket name>/<object name>

We can use this to dynamically generate a url for each of the objects inside of the bucket we've just created and use the DBMS_CLOUD package to copy the data into the ADW instance. The code below gets all of the names of the tables we've just created and the loops through each table copying the associated csv file into the ADW instance.

In [ ]:
from tqdm import tqdm

format = '''{"delimiter" : ",",
          "skipheaders" : 1,
          "ignoremissingcolumns" : "true",
          "removequotes" : "true",
          "dateformat" : "DD-MON-YYYY HH24:MI:SS",
          "blankasnull" : "true"}'''

file_location = '''https://swiftobjectstorage.{region}.oraclecloud.com/v1/{tenancy}/{bucket_name}/{table_name}.csv'''
region = my_config['region']
tenancy= 'oracleonpremjava'


rs = cursor.execute("select table_name from user_tables where table_name not like 'COPY%'")
rows = rs.fetchall()
for row in tqdm(rows):
    url = file_location.format(region=region, tenancy=tenancy, bucket_name=bucket_name, table_name=row[0])
    cursor.callproc('DBMS_CLOUD.copy_data', keywordParameters= {'table_name':row[0],
                                                           'credential_name':'SALES_DATA_AUTH',
                                                           'file_uri_list':url, 
                                                           'format': format
                                                           })
                               

We can now take a look and see how many rows we've loaded into the tables

In [261]:
rs = cursor.execute("select table_name from user_tables where table_name not like 'COPY%'")
rows = rs.fetchall()

for row in rows:
    rs2 = cursor.execute("select count(*) from {}".format(row[0]))
    rows2 = rs2.fetchone()
    print('{tn: <35}{rc:>10,}'.format(tn=row[0],rc=rows2[0]))
COUNTRIES                                  22
SALES                                  97,799
SUPPLEMENTARY_DEMOGRAPHICS             19,599
CUSTOMERS                              19,599
CHANNELS                                    4
PRODUCTS                                   71
PROMOTIONS                                502
TIMES                                   7,305

It's now possible to run standard queries against the newly loaded data. No need to create anything else (indexes etc.)

In [249]:
%%sql mysh@domsdb_medium
SELECT channels.channel_desc, 
       countries.country_iso_code,
       TO_CHAR(SUM(amount_sold), '9,999,999,999'Winking SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id 
  AND sales.cust_id=customers.cust_id 
  AND customers.country_id = countries.country_id
  AND sales.channel_id = channels.channel_id 
  AND channels.channel_desc IN ('Tele Sales','Internet'Winking 
  AND times.calendar_year = '2006' 
  AND countries.country_iso_code IN ('GB','DE','FR','DK'Winking
GROUP BY 
  ROLLUP(channels.channel_desc,countries.country_iso_code)
ORDER BY 1
0 rows affected.
Out[249]:
channel_desc country_iso_code sales$
Internet DE          3,130
Internet DK          1,676
Internet FR          2,545
Internet GB          2,320
Internet None          9,671
Tele Sales DE          3,077
Tele Sales DK          3,116
Tele Sales FR          3,184
Tele Sales GB          2,386
Tele Sales None         11,763
None None         21,434

Tidying up the object store and database

You can run the following steps if you want to remove all of the tables from the schema and purge the object store of files. Lets start by removing the tables. NOTE : The code below will remove all of the tables from the schema. Make sure you've not got anything in the schema that you want to keep before running it.

In [ ]:
from tqdm import tqdm

rs = cursor.execute("select table_name from user_tables")
rows = rs.fetchall()

for row in tqdm(rows):
    rs2 = cursor.execute("drop table {} purge".format(row[0]))

And then removing the object store files and bucket

In [ ]:
object_list = object_storage_client.list_objects(namespace, bucket_name)

for o in object_list.data.objects:
    print('Deleting object {}'.format(o.name))
    object_storage_client.delete_object(namespace, bucket_name, o.name)

print('Deleting bucket')    
response = object_storage_client.delete_bucket(namespace, bucket_name)
Comments

Accessing the Oracle Object Store

OCIConnection

OCI Object Store Examples

The following are a series of examples showing the loading of data into the Oracle Object Store. For these to work with your own data you'll need to have your own Oracle Cloud account and uploaded a key. You can find details on how to achieve this here

I'll be using the Oracle OCI Python SDK which wrappers the REST API. You can find details on the API here

Before we do anything we'll need to load the required needed Python modules.

In [127]:
import oci
import keyring
import ast
import os

Configuration needed to connect

I'm using the "keyring" Python module to hold the config for my connection to OCI (to avoid needlessly exposing sensitive information). It's of the form

{
    "user": "your user ocid",
    "key_file": "the path to your private key file",
    "fingerprint": "the fingerprint of your public key",
    "tenancy": "your tenancy ocid",
    "region": "the region you are working with"
}

After retrieving it from my keyring store I then need to convert it into a dictionary before using it. You can also validate the config you are using as well. Handy if this is the first time you've configured it.

In [128]:
my_config = ast.literal_eval(keyring.get_password('oci_opj','doms'))
oci.config.validate_config(my_config)

Create object storage client

Then I just need to retireve a Object Storage client to start working with data

In [129]:
object_storage_client = oci.object_storage.ObjectStorageClient(my_config)
In [130]:
namespace = object_storage_client.get_namespace().data
bucket_name = "doms_object_store"

Upload the contents of user directory to a bucket

I'll create a bucket and then select all of the files from a user defined directory and upload them to the newly created bucket

In [131]:
import os, io

directory = '/Users/dgiles/datagenerator/bin/generateddata'
files_to_process = [file for file in os.listdir(directory) if file.endswith('csv')]

Create a bucket named "Sales_Data" and give it the tenancy ocid from your config.

In [132]:
try:
    create_bucket_response = object_storage_client.create_bucket(
        namespace,
        oci.object_storage.models.CreateBucketDetails(
            name='Sales_Data',
            compartment_id=my_config['tenancy']
        )
    )
except Exception as e:
    print(e.message)

Then we just need to loop through the list of files in the directory specified and upload them to the newly created bucket

In [133]:
bucket_name = 'Sales_Data'
for upload_file in files_to_process:
    print('Uploading file {}'.format(upload_file))
    object_storage_client.put_object(namespace, bucket_name, upload_file, io.open(os.path.join(directory,upload_file),'r'))
Uploading file CUSTOMERS.csv
Uploading file PRODUCTS.csv
Uploading file COUNTRIES.csv
Uploading file PROMOTIONS.csv
Uploading file CHANNELS.csv
Uploading file SUPPLEMENTARY_DEMOGRAPHICS.csv
Uploading file SALES.csv

Retrieve a list of objects in a bucket

The folowing retrieves a bucket and gets a list of objects in the bucket

In [134]:
bucket = object_storage_client.get_bucket(namespace, bucket_name)
object_list = object_storage_client.list_objects(namespace, bucket_name)

for o in object_list.data.objects:
    print(o.name)
CHANNELS.csv
COUNTRIES.csv
CUSTOMERS.csv
PRODUCTS.csv
PROMOTIONS.csv
SALES.csv
SUPPLEMENTARY_DEMOGRAPHICS.csv

Download the contents of an object

The following downloads a file from a named bucket in chunks and writes it to user defined directory on the client

In [135]:
# Attempt to download a file

object_name = "CUSTOMERS.csv"
destination_dir = '/Users/dgiles/Downloads'.format(object_name) 
get_obj = object_storage_client.get_object(namespace, bucket_name, object_name)
with open(os.path.join(destination_dir,object_name), 'wb') as f:
    for chunk in get_obj.data.raw.stream(1024 * 1024, decode_content=False):
        f.write(chunk)

Delete a bucket

We can just as simply delete the bucket we've just created but first we'll need to delete all of the objects inside of it.

In [136]:
object_list = object_storage_client.list_objects(namespace, bucket_name)

for o in object_list.data.objects:
    print('Deleting object {}'.format(o.name))
    object_storage_client.delete_object(namespace, bucket_name, o.name)

print('Deleting bucket')    
response = object_storage_client.delete_bucket(namespace, bucket_name)
Deleting object CHANNELS.csv
Deleting object COUNTRIES.csv
Deleting object CUSTOMERS.csv
Deleting object PRODUCTS.csv
Deleting object PROMOTIONS.csv
Deleting object SALES.csv
Deleting object SUPPLEMENTARY_DEMOGRAPHICS.csv
Deleting bucket
Comments

Python, Oracle_cx, Altair and Jupyter Notebooks

OracleAltAir

Simple Oracle/Jupyter/Keyring/Altair Example

In this trivial example we'll be using Jupyter Lab to create this small notebook of Python code. We will see how easy it is to run SQL and analyze those results. The point of the exercise here isn't really the SQL we use but rather how simple it is for developers or analysts who prefer working with code rather than a full blown UI to retrieve and visualise data.

For reference I'm using Python 3.6 and Oracle Database 12c Release 2. You can find the source here

First we'll need to import the required libraries. It's likely that you won't have them installed on your server/workstation. To do that use the following pip command "pip install cx_Oracle keyring pandas altair jupyter jupyter-sql". You should have pip installed and probably be using virtualenv but if you don't I'm afraid that's beyond the scope of this example. Note whilst I'm not importing jupyter and jupyter-sql in this example they are implicitly used.

In [1]:
import cx_Oracle
import keyring
import pandas as pd
import altair as alt

We will be using the magic-sql functionality inside of jupyter-lab. This means we can trivially embed SQL rather than coding the cusrsors and fetches we would typically have to do if we were using straight forward cx_Oracle. This uses the "magic function" syntax" which start with "%" or "%%". First we need to load the support for SQL

In [2]:
%load_ext sql

Next we can connect to the local docker Oracle Database. However one thing we don't want to do with notebooks when working with server application is to disclose our passwords. I use the python module "keyring" which enables you to store the password once in a platform appropriate fashion and then recall it in code. Outside of this notebook I used the keyring command line utility and ran "keyring set local_docker system". It then prompted me for a password which I'm using here. Please be careful with passwords and code notebooks. It's possible to mistakenly serialise a password and then potentially expose it when sharing notebooks with colleagues.

In [3]:
password = keyring.get_password('local_docker','system')

We can then connect to the Oracle Database. In this instance I'm using the cx_Oracle Diriver and telling it to connect to a local docker database running Oracle Database 12c (12.2.0.1). Because I'm using a service I need to specify that in the connect string. I also substitute the password I fetched earlier.

In [4]:
%sql oracle+cx_oracle://system:$password@localhost/?service_name=soe
Out[4]:
'Connected: system@'

I'm now connected to the Oracle database as the system user. I'm simply using this user as it has access to a few more tables than a typical user would. Next I can issue a simple select statement to fetch some table metadata. the %%sql result << command retrieves the result set into the variable called result

In [5]:
%%sql result <<
select table_name, tablespace_name, num_rows, blocks, avg_row_len, trunc(last_analyzed)
from all_tables
where num_rows  > 0
and tablespace_name is not null
 * oracle+cx_oracle://system:***@localhost/?service_name=soe
0 rows affected.
Returning data to local variable result

In Python a lot of tabular manipulation of data is performed using the module Pandas. It's an amazing piece of code enabling you to analyze, group, filter, join, pivot columnar data with ease. If you've not used it before I strongly reccomend you take a look. With that in mind we need to take the resultset we retrieved and convert it into a DataFrame (the base Pandas tabular structure)

In [6]:
result_df = result.DataFrame()

We can see a sample of that result set the the Panada's head function

In [7]:
result_df.head()
Out[7]:
table_name tablespace_name num_rows blocks avg_row_len TRUNC(LAST_ANALYZED)
0 BOOTSTRAP$ SYSTEM 60 3 314 2018-06-14
1 CON$ SYSTEM 7068 28 22 2018-06-20
2 OBJ$ SYSTEM 72687 1211 112 2018-06-14
3 TS$ SYSTEM 7 7 89 2018-06-14
4 IND$ SYSTEM 2892 1565 89 2018-06-19

All very useful but what if wanted to chart how many tables were owned by each user. To do that we could of course use the charting functionlty of Pandas (and Matplotlib) but I've recently started experimenting with Altair and found it to make much more sense when define what and how to plot data. So lets just plot the count of tables on the X axis and the tablespace name on the Y axis.

In [8]:
alt.Chart(result_df).mark_bar().encode(
    x='count()',
    y='tablespace_name',
)
Out[8]:

Altair makes much more sense to me as a SQL user than some of the cryptic functionality of Matplotlib. But the chart looks really dull. We can trvially tidy it up a few additonal commands

In [9]:
alt.Chart(result_df).mark_bar().encode(
    x=alt.Y('count()', title='Number of tables in Tablespace'),
    y=alt.Y('tablespace_name', title='Tablespace Name'),
    color='tablespace_name'
).properties(width=600, height= 150, title='Tables Overview')
Out[9]:

Much better but Altair can do significantly more sophisticated charts than a simple bar chart. Next lets take a look at the relationship between the number of rows in a table and the blocks required to store them. It's worth noting that this won't, or at least shouldn't, produce any startling results but it's a useful plot on a near empty database that will give use something to look at. Using another simple scatter chart we get the following

In [10]:
alt.Chart(result_df).mark_point().encode(
    y = 'num_rows',
    x = 'blocks',
).properties(width=600)
Out[10]:

Again it's a near empty database and so there's not much going on or at least we have a lot of very small tables and a few big ones. Lets use a logarithmic scale to spread things out a little better and take the oppertunity to brighten it up

In [11]:
alt.Chart(result_df).mark_point().encode(
    y = alt.Y('num_rows',scale=alt.Scale(type='log')),
    x = alt.X('blocks', scale=alt.Scale(type='log')),
    color = 'tablespace_name',
    tooltip=['table_name']
).properties(width=600)
Out[11]:

Much better... But Altair and the libraries it uses has a another fantastic trick up its sleeve. It can make the charts interactive (zoom and pan) but also enable you to select data in one chart and have it reflected in another. To do that we use a "Selection" object and use this to filter the data in a second chart. Lets do this for the two charts we created. Note you'll need to have run this notebook "live" to see the interactive selection.

In [12]:
interval = alt.selection_interval()

chart1 = alt.Chart(result_df).mark_point().encode(
    y = alt.Y('num_rows',scale=alt.Scale(type='log')),
    x = alt.X('blocks', scale=alt.Scale(type='log')),
    color = 'tablespace_name',
    tooltip=['table_name']
).properties(width=600, selection=interval)

chart2 = alt.Chart(result_df).mark_bar().encode(
    x='count()',
    y='tablespace_name',
    color = 'tablespace_name'
).properties(width=600, height= 150).transform_filter(
    interval
)

chart1 & chart2
Out[12]: