Oracle, Pythons and Pandas... Oh My.

Over the last year I've been using Python more frequently and enjoying the experience. Some things are just simpler to do in Python than my go to language, Java. One area that makes things simpler is the extensive range of libraries and the simplicity of installing and maintaining them… This blog isn't intended to be an introduction to python and it's many libraries just a quick intro to those I seem to use more and more.

There are few libraries that are perfect for working with a database.

  • cx_oracle : This driver enables you to connect to the oracle database via Oracle's OCI Client. You'll need to install Oracle's instant or full client.
  • MatPlotLib : A powerful charting library enabling you to visualise your data in an unimaginalable number of ways
  • Pandas : An easy to use library for data analysis
  • Numpy : A powerful scientific computing package

Also one aspect that I'm not going to cover but that really takes Python from an expressive programming language to a collaborative to a tool for sophisticated collaborative data exploration and development is IPython Notebook. IT provides a means of writing up and executing live code that can be modified by other collaborators. I'll cover this in a future blog.

Setting up Python

Before we go any further we need to make sure we have python installed and the correct libraries. I'm using Python 2.7 but the code we are using should work fine with python 3.0. Most operating systems will have have python installed by default. If yours doesn't you can get it here. Next we need to insure that we have all of the correct libraries. To do this I use the python library manager "pip". Again I won't go into it's installation if you don't have it installed but you can find details here. After you've installed pip all you need to do to ensure you've got the correct libraries is to issue a command like


pip install matplotlib numpy cx_oracle pandas --upgrade

This will sort out all the dependencies for you. I've also used the "—upgrade" option to ensure we refresh any out of date libraries. If you get permission errors it's likey to be because you don't have the privilege to install the libraries into the shared system lib location. Either rerun the command as root or with sudo i.e.


sudo pip install matplotlib numpy cx_oracle pandas --upgrade

Or create a virtual environment

You can check to see that your libraries are installed by using the "pip list" command.

Code

For the impatient amongst you lets start with code and explain some of the details later

__author__ = 'dgiles'

import cx_Oracle
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style

def read_from_db (username, password, connectString, mode=None, save=False):

    if mode is None:
        connection = cx_Oracle.connect(username, password, connectString)
    else:
        connection = cx_Oracle.connect(username, password, connectString, mode)
    with connection:
        try:
            df = pd.read_sql_query("SELECT\
                                      wc.wait_class                    AS waitclass,\
                                      TRUNC(begin_time, 'MI')          AS sample_time,\
                                      round((wh.time_waited) / wh.intsize_csec, 3) AS DB_time\
                                    FROM V$SYSTEM_WAIT_CLASS wc,\
                                      v$waitclassmetric_history wh\
                                    WHERE wc.wait_class != 'Idle'\
                                          AND wc.wait_class_id = wh.wait_class_id\
                                    UNION\
                                    SELECT\
                                      'CPU'                   AS waitclass,\
                                      TRUNC(begin_time, 'MI') AS sample_time,\
                                      round(VALUE/100, 3)         AS DB_time\
                                    FROM v$sysmetric_history\
                                    WHERE GROUP_ID = 2\
                                          AND metric_name = 'CPU Usage Per Sec'\
                                    ORDER by sample_time, waitclass",
                                   connection)
            if save:
                df.to_csv('results.csv')
            return df
        except cx_Oracle.DatabaseError as dberror:
            print dberror

def read_from_file(filename):
    return pd.read_csv(filename, parse_dates=['SAMPLE_TIME'])


if __name__ == '__main__':
    style.use('ggplot')
    df = read_from_db(username='sys', password='welcome1', connectString='oracle12c2/soe', mode=cx_Oracle.SYSDBA, save=True)
    # df = read_from_file('results.csv')

    print df.head()

    pdf = df.pivot(index='SAMPLE_TIME', columns='WAITCLASS', values='DB_TIME')

    print pdf.head()

    pdf.plot(kind='area', stacked=True, title='DB Time over the last hour', color=['red', 'green', 'orange', 'darkred', 'brown', 'brown', 'pink', 'lightgreen', 'cyan', 'blue'])
    plt.show()

On running the code with a command like. Note : I saved my code to a file dbtime60min.py

python dbtime60min.py


You should see a chart like the following.

Pasted Graphic 1

So in a relatively short amount of code we can quickly produce a useful chart. And the beauty of the solution is that we could replace the relatively complex piece of SQL I used with some thing more trivial and Pandas does much of the heavy lifting to convert it into a useful chart that we can share with others.

An explanation of the code

There are a view areas that are worth understanding in the code and that highlight the easy of working with a a framework like Pandas.

I created a function "def read_from_db" that connects to the database (via the cx_oracle library) and then asks the Pandas framework to read the data back from the database.
df = pd.read_sql_query("SELECT\
                      wc.wait_class                    AS waitclass,\
                      TRUNC(begin_time, 'MI')          AS sample_time,\
                      round((wh.time_waited) / wh.intsize_csec, 3) AS DB_time\
                    FROM V$SYSTEM_WAIT_CLASS wc,\
                      v$waitclassmetric_history wh\
                    WHERE wc.wait_class != 'Idle'\
                          AND wc.wait_class_id = wh.wait_class_id\
                    UNION\
                    SELECT\
                      'CPU'                   AS waitclass,\
                      TRUNC(begin_time, 'MI') AS sample_time,\
                      round(VALUE/100, 3)         AS DB_time\
                    FROM v$sysmetric_history\
                    WHERE GROUP_ID = 2\
                          AND metric_name = 'CPU Usage Per Sec'\
                    ORDER by sample_time, waitclass",
                   connection)

The Pandas reads this data into a multi dimensional structure, much like the table we read this information from. And just like a database table Pandas enables us to sort and filter the data. Using a command like

print.info()

Shows us the details of the information we've just read in. i.e.


Int64Index: 549 entries, 0 to 548
Data columns (total 3 columns):
WAITCLASS      549 non-null object
SAMPLE_TIME    549 non-null datetime64[ns]
DB_TIME        549 non-null float64
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 17.2+ KB

We can also see a sample of the data with "tail()" or "head()" i.e.

print.head()

results in

       WAITCLASS         SAMPLE_TIME  DB_TIME
0            CPU 2015-08-17 11:47:00        0
1         Commit 2015-08-17 11:47:00        0
2    Concurrency 2015-08-17 11:47:00        0
3  Configuration 2015-08-17 11:47:00        0
4        Network 2015-08-17 11:47:00        0

We can also select individual columns from this data set with a command like


wcdf = df['WAITCLASS'] print wcdf.head()


Which will give us


0 CPU 1 Commit 2 Concurrency 3 Configuration 4 Network


Instead of explicitly iterating through the data to find information we can also filter out just the relevant information i.e.

fdf = df[df['WAITCLASS'] == 'CPU']

Which will enable us to just select data from the dataset where the WAITCLASS column only contains 'CPU. Very similar to a SQL where clause i.e. "where WAITCLASS = 'CPU'. When we print the head of the fdf Data frame we get


print fdf.head()

  WAITCLASS         SAMPLE_TIME  DB_TIME
0        CPU 2015-08-17 12:38:00    0.001
9        CPU 2015-08-17 12:39:00    0.001
18       CPU 2015-08-17 12:40:00    0.000
27       CPU 2015-08-17 12:41:00    0.003
36       CPU 2015-08-17 12:42:00    0.000

Another capability of Pandas (and there are many and we've only touched on a few) is the ability to pivot the data. Now in this example you could make the case it would have been simpler to pivot the data in the database. But I'm doing it here to illustrate the point. All we need to do is to specify which columns will be the index (y axis) and which column(s) will be the column headers (x axis) and which column will be the value.


pdf = df.pivot(index='SAMPLE_TIME', columns='WAITCLASS', values='DB_TIME')

will turn this time series data


WAITCLASS SAMPLE_TIME DB_TIME 0 CPU 2015-08-17 12:47:00 0 1 Commit 2015-08-17 12:47:00 0 2 Concurrency 2015-08-17 12:47:00 0 3 Configuration 2015-08-17 12:47:00 0 4 Network 2015-08-17 12:47:00 0


into this pivoted data


WAITCLASS CPU Commit Concurrency Configuration Network \ SAMPLE_TIME 2015-08-17 12:47:00 0.000 0 0 0 0 2015-08-17 12:48:00 0.002 0 0 0 0 2015-08-17 12:49:00 0.001 0 0 0 0 2015-08-17 12:50:00 0.000 0 0 0 0 2015-08-17 12:51:00 0.000 0 0 0 0 WAITCLASS Other Scheduler System I/O User I/O SAMPLE_TIME 2015-08-17 12:47:00 0.000 0 0 0 2015-08-17 12:48:00 0.001 0 0 0 2015-08-17 12:49:00 0.002 0 0 0 2015-08-17 12:50:00 0.000 0 0 0 2015-08-17 12:51:00 0.000 0 0 0

The last step is to chart the data and this is achieved in a single command in Pandas.


pdf.plot(kind='area', stacked=True, title='DB Time over the last hour', color=['red', 'green', 'orange', 'darkred', 'brown', 'brown', 'pink', 'lightgreen', 'cyan', 'blue']) plt.show()

There's a couple of things to note. I've overwritten the default colour map to use colours that would be familiar to Oracle DBAs via the Enterprise Managers DB Time charts. And I also didn't use a colour map to ensure that CPU is always rendered in "green".

So just a quick example of the power of Pandas when used in conjunction with Oracle. I'll try and post a few more Python articles over the coming weeks.
blog comments powered by Disqus