Oracle, Pythons and Pandas... Oh My.
14/08/15 12:06
Over the last year I've been using Python more frequently and enjoying the experience. Some things are just simpler to do in Python than my go to language, Java. One area that makes things simpler is the extensive range of libraries and the simplicity of installing and maintaining them… This blog isn't intended to be an introduction to python and it's many libraries just a quick intro to those I seem to use more and more.
There are few libraries that are perfect for working with a database.
Also one aspect that I'm not going to cover but that really takes Python from an expressive programming language to a collaborative to a tool for sophisticated collaborative data exploration and development is IPython Notebook. IT provides a means of writing up and executing live code that can be modified by other collaborators. I'll cover this in a future blog.
Before we go any further we need to make sure we have python installed and the correct libraries. I'm using Python 2.7 but the code we are using should work fine with python 3.0. Most operating systems will have have python installed by default. If yours doesn't you can get it here. Next we need to insure that we have all of the correct libraries. To do this I use the python library manager "pip". Again I won't go into it's installation if you don't have it installed but you can find details here. After you've installed pip all you need to do to ensure you've got the correct libraries is to issue a command like
This will sort out all the dependencies for you. I've also used the "—upgrade" option to ensure we refresh any out of date libraries. If you get permission errors it's likey to be because you don't have the privilege to install the libraries into the shared system lib location. Either rerun the command as root or with sudo i.e.
Or create a virtual environment
You can check to see that your libraries are installed by using the "pip list" command.
For the impatient amongst you lets start with code and explain some of the details later
On running the code with a command like. Note : I saved my code to a file dbtime60min.py
You should see a chart like the following.
So in a relatively short amount of code we can quickly produce a useful chart. And the beauty of the solution is that we could replace the relatively complex piece of SQL I used with some thing more trivial and Pandas does much of the heavy lifting to convert it into a useful chart that we can share with others.
There are a view areas that are worth understanding in the code and that highlight the easy of working with a a framework like Pandas.
I created a function "def read_from_db" that connects to the database (via the cx_oracle library) and then asks the Pandas framework to read the data back from the database.
The Pandas reads this data into a multi dimensional structure, much like the table we read this information from. And just like a database table Pandas enables us to sort and filter the data. Using a command like
Shows us the details of the information we've just read in. i.e.
We can also see a sample of the data with "tail()" or "head()" i.e.
results in
We can also select individual columns from this data set with a command like
Which will give us
Instead of explicitly iterating through the data to find information we can also filter out just the relevant information i.e.
Which will enable us to just select data from the dataset where the WAITCLASS column only contains 'CPU. Very similar to a SQL where clause i.e. "where WAITCLASS = 'CPU'. When we print the head of the fdf Data frame we get
Another capability of Pandas (and there are many and we've only touched on a few) is the ability to pivot the data. Now in this example you could make the case it would have been simpler to pivot the data in the database. But I'm doing it here to illustrate the point. All we need to do is to specify which columns will be the index (y axis) and which column(s) will be the column headers (x axis) and which column will be the value.
will turn this time series data
into this pivoted data
The last step is to chart the data and this is achieved in a single command in Pandas.
There's a couple of things to note. I've overwritten the default colour map to use colours that would be familiar to Oracle DBAs via the Enterprise Managers DB Time charts. And I also didn't use a colour map to ensure that CPU is always rendered in "green".
So just a quick example of the power of Pandas when used in conjunction with Oracle. I'll try and post a few more Python articles over the coming weeks.
There are few libraries that are perfect for working with a database.
- cx_oracle : This driver enables you to connect to the oracle database via Oracle's OCI Client. You'll need to install Oracle's instant or full client.
- MatPlotLib : A powerful charting library enabling you to visualise your data in an unimaginalable number of ways
- Pandas : An easy to use library for data analysis
- Numpy : A powerful scientific computing package
Also one aspect that I'm not going to cover but that really takes Python from an expressive programming language to a collaborative to a tool for sophisticated collaborative data exploration and development is IPython Notebook. IT provides a means of writing up and executing live code that can be modified by other collaborators. I'll cover this in a future blog.
Setting up Python
Before we go any further we need to make sure we have python installed and the correct libraries. I'm using Python 2.7 but the code we are using should work fine with python 3.0. Most operating systems will have have python installed by default. If yours doesn't you can get it here. Next we need to insure that we have all of the correct libraries. To do this I use the python library manager "pip". Again I won't go into it's installation if you don't have it installed but you can find details here. After you've installed pip all you need to do to ensure you've got the correct libraries is to issue a command like
pip install matplotlib numpy cx_oracle pandas --upgrade
This will sort out all the dependencies for you. I've also used the "—upgrade" option to ensure we refresh any out of date libraries. If you get permission errors it's likey to be because you don't have the privilege to install the libraries into the shared system lib location. Either rerun the command as root or with sudo i.e.
sudo pip install matplotlib numpy cx_oracle pandas --upgrade
Or create a virtual environment
You can check to see that your libraries are installed by using the "pip list" command.
Code
For the impatient amongst you lets start with code and explain some of the details later
__author__ = 'dgiles' import cx_Oracle import pandas as pd import matplotlib.pyplot as plt from matplotlib import style def read_from_db (username, password, connectString, mode=None, save=False): if mode is None: connection = cx_Oracle.connect(username, password, connectString) else: connection = cx_Oracle.connect(username, password, connectString, mode) with connection: try: df = pd.read_sql_query("SELECT\ wc.wait_class AS waitclass,\ TRUNC(begin_time, 'MI') AS sample_time,\ round((wh.time_waited) / wh.intsize_csec, 3) AS DB_time\ FROM V$SYSTEM_WAIT_CLASS wc,\ v$waitclassmetric_history wh\ WHERE wc.wait_class != 'Idle'\ AND wc.wait_class_id = wh.wait_class_id\ UNION\ SELECT\ 'CPU' AS waitclass,\ TRUNC(begin_time, 'MI') AS sample_time,\ round(VALUE/100, 3) AS DB_time\ FROM v$sysmetric_history\ WHERE GROUP_ID = 2\ AND metric_name = 'CPU Usage Per Sec'\ ORDER by sample_time, waitclass", connection) if save: df.to_csv('results.csv') return df except cx_Oracle.DatabaseError as dberror: print dberror def read_from_file(filename): return pd.read_csv(filename, parse_dates=['SAMPLE_TIME']) if __name__ == '__main__': style.use('ggplot') df = read_from_db(username='sys', password='welcome1', connectString='oracle12c2/soe', mode=cx_Oracle.SYSDBA, save=True) # df = read_from_file('results.csv') print df.head() pdf = df.pivot(index='SAMPLE_TIME', columns='WAITCLASS', values='DB_TIME') print pdf.head() pdf.plot(kind='area', stacked=True, title='DB Time over the last hour', color=['red', 'green', 'orange', 'darkred', 'brown', 'brown', 'pink', 'lightgreen', 'cyan', 'blue']) plt.show()
On running the code with a command like. Note : I saved my code to a file dbtime60min.py
python dbtime60min.py
You should see a chart like the following.
So in a relatively short amount of code we can quickly produce a useful chart. And the beauty of the solution is that we could replace the relatively complex piece of SQL I used with some thing more trivial and Pandas does much of the heavy lifting to convert it into a useful chart that we can share with others.
An explanation of the code
There are a view areas that are worth understanding in the code and that highlight the easy of working with a a framework like Pandas.
I created a function "def read_from_db" that connects to the database (via the cx_oracle library) and then asks the Pandas framework to read the data back from the database.
df = pd.read_sql_query("SELECT\ wc.wait_class AS waitclass,\ TRUNC(begin_time, 'MI') AS sample_time,\ round((wh.time_waited) / wh.intsize_csec, 3) AS DB_time\ FROM V$SYSTEM_WAIT_CLASS wc,\ v$waitclassmetric_history wh\ WHERE wc.wait_class != 'Idle'\ AND wc.wait_class_id = wh.wait_class_id\ UNION\ SELECT\ 'CPU' AS waitclass,\ TRUNC(begin_time, 'MI') AS sample_time,\ round(VALUE/100, 3) AS DB_time\ FROM v$sysmetric_history\ WHERE GROUP_ID = 2\ AND metric_name = 'CPU Usage Per Sec'\ ORDER by sample_time, waitclass", connection)
The Pandas reads this data into a multi dimensional structure, much like the table we read this information from. And just like a database table Pandas enables us to sort and filter the data. Using a command like
print.info()
Shows us the details of the information we've just read in. i.e.
Int64Index: 549 entries, 0 to 548 Data columns (total 3 columns): WAITCLASS 549 non-null object SAMPLE_TIME 549 non-null datetime64[ns] DB_TIME 549 non-null float64 dtypes: datetime64[ns](1), float64(1), object(1) memory usage: 17.2+ KB
We can also see a sample of the data with "tail()" or "head()" i.e.
print.head()
results in
WAITCLASS SAMPLE_TIME DB_TIME 0 CPU 2015-08-17 11:47:00 0 1 Commit 2015-08-17 11:47:00 0 2 Concurrency 2015-08-17 11:47:00 0 3 Configuration 2015-08-17 11:47:00 0 4 Network 2015-08-17 11:47:00 0
We can also select individual columns from this data set with a command like
wcdf = df['WAITCLASS'] print wcdf.head()
Which will give us
0 CPU 1 Commit 2 Concurrency 3 Configuration 4 Network
Instead of explicitly iterating through the data to find information we can also filter out just the relevant information i.e.
fdf = df[df['WAITCLASS'] == 'CPU']
Which will enable us to just select data from the dataset where the WAITCLASS column only contains 'CPU. Very similar to a SQL where clause i.e. "where WAITCLASS = 'CPU'. When we print the head of the fdf Data frame we get
print fdf.head()
WAITCLASS SAMPLE_TIME DB_TIME 0 CPU 2015-08-17 12:38:00 0.001 9 CPU 2015-08-17 12:39:00 0.001 18 CPU 2015-08-17 12:40:00 0.000 27 CPU 2015-08-17 12:41:00 0.003 36 CPU 2015-08-17 12:42:00 0.000
Another capability of Pandas (and there are many and we've only touched on a few) is the ability to pivot the data. Now in this example you could make the case it would have been simpler to pivot the data in the database. But I'm doing it here to illustrate the point. All we need to do is to specify which columns will be the index (y axis) and which column(s) will be the column headers (x axis) and which column will be the value.
pdf = df.pivot(index='SAMPLE_TIME', columns='WAITCLASS', values='DB_TIME')
will turn this time series data
WAITCLASS SAMPLE_TIME DB_TIME 0 CPU 2015-08-17 12:47:00 0 1 Commit 2015-08-17 12:47:00 0 2 Concurrency 2015-08-17 12:47:00 0 3 Configuration 2015-08-17 12:47:00 0 4 Network 2015-08-17 12:47:00 0
into this pivoted data
WAITCLASS CPU Commit Concurrency Configuration Network \ SAMPLE_TIME 2015-08-17 12:47:00 0.000 0 0 0 0 2015-08-17 12:48:00 0.002 0 0 0 0 2015-08-17 12:49:00 0.001 0 0 0 0 2015-08-17 12:50:00 0.000 0 0 0 0 2015-08-17 12:51:00 0.000 0 0 0 0 WAITCLASS Other Scheduler System I/O User I/O SAMPLE_TIME 2015-08-17 12:47:00 0.000 0 0 0 2015-08-17 12:48:00 0.001 0 0 0 2015-08-17 12:49:00 0.002 0 0 0 2015-08-17 12:50:00 0.000 0 0 0 2015-08-17 12:51:00 0.000 0 0 0
The last step is to chart the data and this is achieved in a single command in Pandas.
pdf.plot(kind='area', stacked=True, title='DB Time over the last hour', color=['red', 'green', 'orange', 'darkred', 'brown', 'brown', 'pink', 'lightgreen', 'cyan', 'blue']) plt.show()
There's a couple of things to note. I've overwritten the default colour map to use colours that would be familiar to Oracle DBAs via the Enterprise Managers DB Time charts. And I also didn't use a colour map to ensure that CPU is always rendered in "green".
So just a quick example of the power of Pandas when used in conjunction with Oracle. I'll try and post a few more Python articles over the coming weeks.
blog comments powered by Disqus