Pre-parsers
Notes on pre-parsing data for Oracle data loads
14/02/11 18:53 Filed in: Oracle
Sometimes data simply isn't in a form that is easy to load into an Oracle database i.e. column form. It would be great if everybody exchanged data in a simple CSV form with a single file to table mapping. Sadly that isn't the case and sometimes you have to do a little work to get it into a form thats useable. A recent benchmark highlighted this issue very well. The customer provided the data in compressed CSV form (so far so good) but the data was held in key value pairs (not so good). They also provided us with a mapping file that describes how it all fits together.
Now typically the approach many people would take would be to develop some form of program that parses all of the data and writes it to staging area and then loads all of it in one go to the target database. I make no criticism of this approach since it works well and as long as its not time critical. It's by far the simplest method. However Im a big fan of taking advantage of whats already available and one of the most underused and powerful features of the Oracle database is the preparser. It enables you to pipeline various operations so they all run as quickly as possible. So going back to my benchmark we used this approach to load data into out target database. It consisted of 4 steps
The java program simple reads from stdin and writes to stdout. To handle key value pairs just required the program to read the mapping file in and split and parse the values from stdin. The data was then written to stdout in a well know order.
Java extract from my program...
All that was needed for sqlloader to process the files was a control file that understood the order of the columns and any additional formatting.
One of the additional benefits is that we can load the data via "direct path" and implement other features such as multi table insert. The Java preparser enables you to add all of the additional formatting to make this a trivial process.
The following diagram illustrates the process.
This equates into a Unix/Linux statement such as
NOTE : one thing you may have noticed is that Im using dd to do 1MB I/Os. This just an efficiency operation and works well on structures such as DBFS, you could skip this part of the operation if needed.
Which brings us onto external tables and the preparser
External Tables and pre-parsers
As I mentioned earlier I like to take advantage of functionality that's already available and one of those features in the Oracle database is external tables. I don't intend to go into much detail as to why you should use external tables other than they do much of the heavy lifting for you and they provide a seamless interface between the filesystem and the database. They effectively make files look like tables.
This means it's trivial to implement parallelism for our pre-parser. We don't need to worry about how to handle the files and how to schedule everything, external tables take care of all of that for you. In our benchmark we used them in the following way
Our previous pipeline remains the same except that we don't need sqlloader its all managed by the table definition itself. So we end up with something similar to the following for the table definition (I've abbreviated it quite substantially and highlighted the important bits)
One of the things to note is that I've included the pipelined preprocessor inside of a shell script which looks like this
The important part of this script is the parameter ($1) that is passed to the shell script. This is the file name that the external table wants to process.
A simple select statement from my "staging_ext_mydata_jan01" unzips and parses the data converting it to usable columns. Whats more if I issue the statement in parallel Oracle takes care of creating the processes for me and making sure everything is scheduled in an orderly fashion.
To finish the load we simply used a multi table insert to put the data into the correct tables in an efficient fashion. Using this approach we were able to read zipped files, parse them and insert them into our three target tables at over 1.5 million source records/sec.
Now typically the approach many people would take would be to develop some form of program that parses all of the data and writes it to staging area and then loads all of it in one go to the target database. I make no criticism of this approach since it works well and as long as its not time critical. It's by far the simplest method. However Im a big fan of taking advantage of whats already available and one of the most underused and powerful features of the Oracle database is the preparser. It enables you to pipeline various operations so they all run as quickly as possible. So going back to my benchmark we used this approach to load data into out target database. It consisted of 4 steps
- Read the data of the filesystem as efficiently as possible and write it to stdout
- Read from stdin and Unzip the the data writing it to stdout
- Read from stdin into a java program to do the key value mapping and error detection/correction writing the output to stdout
- Read from stdin into sqlloader
The java program simple reads from stdin and writes to stdout. To handle key value pairs just required the program to read the mapping file in and split and parse the values from stdin. The data was then written to stdout in a well know order.
Java extract from my program...
BufferedReader br = new BufferedReader(new InputStreamReader(System.in), OneMB); String line = null; HashMapkeyValuePairs = null; MyTokenizer mt = null; while ((line = br.readLine()) != null) { keyValuePairs = new HashMap (200); mt = new MyTokenizer(line, delimitor); for (String token : mt) { int loc = token.indexOf("="); if (loc != -1) { String i = token.substring(0, loc); String s = token.substring(loc + 1, token.length()); keyValuePairs.put(i, s); } StringBuffer outRec = new StringBuffer(1000); outRec.append(checkForNull(keyValuePairs.get("uniqueID"), "")).append(seperator); // mapping logic similar to above repeats System.out.println(outRec.toString()); }
All that was needed for sqlloader to process the files was a control file that understood the order of the columns and any additional formatting.
One of the additional benefits is that we can load the data via "direct path" and implement other features such as multi table insert. The Java preparser enables you to add all of the additional formatting to make this a trivial process.
The following diagram illustrates the process.
This equates into a Unix/Linux statement such as
/bin/dd if=myverybigfile.txt bs=1024k status=noxfer 2>/dev/null | /bin/gunzip -c | java -classpath /home/oracle/loader.jar com.dom.KeyValueParserStdIn | sqlloader bench/bench control=kv.ctl data=\"-\" direct=TRUE;
NOTE : one thing you may have noticed is that Im using dd to do 1MB I/Os. This just an efficiency operation and works well on structures such as DBFS, you could skip this part of the operation if needed.
Which brings us onto external tables and the preparser
External Tables and pre-parsers
As I mentioned earlier I like to take advantage of functionality that's already available and one of those features in the Oracle database is external tables. I don't intend to go into much detail as to why you should use external tables other than they do much of the heavy lifting for you and they provide a seamless interface between the filesystem and the database. They effectively make files look like tables.
This means it's trivial to implement parallelism for our pre-parser. We don't need to worry about how to handle the files and how to schedule everything, external tables take care of all of that for you. In our benchmark we used them in the following way
Our previous pipeline remains the same except that we don't need sqlloader its all managed by the table definition itself. So we end up with something similar to the following for the table definition (I've abbreviated it quite substantially and highlighted the important bits)
create table staging_ext_mydata_jan01 ( uniqueid NUMBER, .. -- Lots of columns .. ) ORGANIZATION EXTERNAL ( TYPE oracle_loader DEFAULT DIRECTORY load_dir_jan01 ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE PREPROCESSOR exec_dir:'external_tab.sh' BADFILE log_dir: 'external.bad' LOGFILE log_dir: 'external.log' fields terminated by '|' OPTIONALLY ENCLOSED BY '"' AND '"' ( uniqueid char(100), .. -- Lots of defintions .. ) ) LOCATION('data1.txt',data2.txt'...'data100.txt')) REJECT LIMIT UNLIMITED;
One of the things to note is that I've included the pipelined preprocessor inside of a shell script which looks like this
/bin/dd if=$1 bs=1024k status=noxfer 2>/dev/null | /bin/gunzip -c | java -classpath /home/oracle/loader.jar com.dom.KeyValueParserStdIn
The important part of this script is the parameter ($1) that is passed to the shell script. This is the file name that the external table wants to process.
A simple select statement from my "staging_ext_mydata_jan01" unzips and parses the data converting it to usable columns. Whats more if I issue the statement in parallel Oracle takes care of creating the processes for me and making sure everything is scheduled in an orderly fashion.
To finish the load we simply used a multi table insert to put the data into the correct tables in an efficient fashion. Using this approach we were able to read zipped files, parse them and insert them into our three target tables at over 1.5 million source records/sec.
Comments