Tuesday, June 11, 2013

Replication Renaming Schemas, Tables and Columns

Schema definitions on replication slaves sometimes differ. Ability to rename schemas, tables and columns and still setup replication with comfort is useful for MySQL to MySQL replication, while for heterogeneous topologies like these, it's a must. In this article I'll present a component which I recently developed to enable an easy way of renaming database objects during both (a) initial slave load (provisioning) and (b) real-time replication.

It is called RenameFilter and is part of the latest Tungsten Replicator build. RenameFilter works with all DBMS types that Replicator supports.

Specifying What to Rename


First, let's look at how rename definitions file looks like. It is a CSV file which first three columns specify what to rename, while last three - what into.
# CSV format:
#
# originalSchema,originalTable,originalColumn,newSchema,newTable,newColumn 
There are two special characters:
* (asterisk) - match all objects (schemas, tables or columns);
               used in the search part (first three columns).
- (dash)     - leave original object name;
               used in the rename part (last three columns).
Let's look into one example in detail:
# Rename schema from schemac to schemac1.
#
#    Match all tables
#       |         Leave table names intact   
#       |            |
schemac,*,*,schemac1,-,-
#         |            |
#         |         Leave column names intact
#      Match all columns 
It is quite flexible: it's possible to rename from one specific column in a specific table and schema all the way up to renaming all occurrences anywhere in the database.

Rename during Provisioning


First use case for this is during initial provisioning when you are preparing your slave database.

a.) For homogeneous MySQL to MySQL replication, it's a manual, though straightforward, process. After restoring a backup to the slave, we go inside the database and make needed changes (like rename schema, some tables and columns). While doing this, we log everything to the tables_to_rename.csv in format introduced above. We will then use this file while setting up replication (see one section below).

b.) For heterogeneous topologies, we'll use ddlscan tool. Let's consider a MySQL to Oracle replication topology. Tables are in MySQL master, but Oracle is completely empty without even a schema. We'll need:
  • MySQL DBMS with some tables on the master.
  • Empty Oracle DBMS on the slave.
  • Tungsten Replicator with ddl-mysql-to-oracle.vm template available for ./ddlscan utility (translates MySQL schema dialect into Oracle dialect).
  • CSV file with rename defitinions.
In the CSV we define all MySQL tables and columns that exceed 30 characters and how to rename (shorten) them, tables and columns which start with a number and how to rename these, etc. A trivial example:
test,3table,*,-,freetable,-
Then we provide this CSV to the ./ddlscan:
./ddlscan
  -template ddl-mysql-oracle.vm
  -rename tables_to_rename.csv
  -db test
  -out test.sql
In effect, ddlscan working together with the RenameFilter, will transform a MySQL table like this:
describe test.3table;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| i     | int(11) | NO   | PRI | NULL    |       |
+-------+---------+------+-----+---------+-------+
And provide DDL SQL that is Oracle-compatible (note how table was renamed):
DROP TABLE test.freetable;
CREATE TABLE test.freetable
(
  i NUMBER(10, 0),
  PRIMARY KEY (i)
);

Rename during Real-Time Replication


After schema (with some objects renamed) is ready on the slave, we can configure Tungsten Replicator for replication. However, we want all transactions that happen on the master be aware of the renamed objects - eg. if there's a transaction on "3table", it should be transformed and applied into "freetable" on the slave. Luckily, Replicator can use the same rename definitions CSV file we defined during initial provisioning. All we need to do is provide it upon installation:
RENAME=/home/oracle/install/tables_to_rename.csv
tungsten-installer -a \
  --master-slave \
  --datasource-user=tungsten \
  --datasource-password=secret \
  --service-name=mysql2oracle \
  --home-directory=$TUNGSTEN_DEPLOY \
  --cluster-hosts=$SLAVE_HOST \
  --datasource-type=oracle \
  --datasource-oracle-service=ORCL \
  --master-host=$MASTER_HOST \
  --svc-applier-filters=dropstatementdata,zerodate2null,rename \
  --property=replicator.filter.rename.definitionsFile=$RENAME \
  --start-and-report
When Replicator starts-up, you'll see how it parses the definition file and constructs a look-up structure in the trepsvc.log:
INFO  filter.RenameDefinitions Parsing /home/oracle/install/tables_to_rename.csv:
INFO  filter.RenameDefinitions test.3table.* -> -.freetable.-
Succeeding transactions on this table will be transformed accordingly.

More RenameFilter CSV Examples


More examples covering common cases:
# Rename schema for all tables.
schemac,*,*,schemac1,-,-

# This table moved to a different schema.
schemac,tablew,*,schemac2,-,-

# Table renamed and moved to a different schema.
schemav,tablee,*,schemav1,tableee,-

# Rename table in a particular schema.
schemab,tabler,*,-,tabler1,-

# Rename each table with a given name across all schemas.
*,tablet,*,-,tablet1,-

# Rename column of this particular table.
schemaz,tableq,cola,-,-,colaa # Column renaming doesn't change schema/table.

# Rename column of each table with a given name across all schemas.
*,tablew,colb,-,-,colbb

# Rename column occurring in any table across all schemas.
*,*,colc,-,-,colcc

# Rename column across all tables in the given schema.
schemax,*,cold,-,-,coldd

A Note on Order of Preference


Order of lines in the CSV file doesn't matter - all of them are read during first load of the filter and they fill a fast in-memory look-up structure. The first three CSV columns (the search path) is used for look-up. With that said, it is possible to define rules that overlap. Consider the following two cases:
# Renamed schema, but one table is left in the original schema
# and some of its columns renamed.
scheman,*,*,scheman2,-,-
scheman,tabley,*,-,tabley2,- # Leave table in original schema.
scheman,tabley,colj,-,-,colj2
scheman,tabley,colk,-,-,colk2

# Renamed schema and one table with one column in it.
schemam,*,*,schemam2,-,-
schemam,tableu,*,schemam2,tableu2,- # Move renamed table to the new schema too.
schemam,tableu,colj,-,-,colj2 # Column renaming doesn't change schema/table.
The order of preference that is followed is straightforward. There are three cases to cover: finding new name for a column, table and schema. Each of these search from the most specific definition to the most generic. To be exact:

a. Order of preference when searching for a new column name:

1. schema.table
2. schema.*
3. *.table
4. *.*

Example:
client100,orders,amount,-,-,amount1 # 1. schema.table
client100,*,amount,-,-,amount2      # 2. schema.*
*,orders,amount,-,-,amount3         # 3. *.table
*,*,amount,-,-,amount4              # 4. *.* 
b. Order of preference when searching for a new table name:

1. schema.table
2. *.table

Example:
client100,orders,*,-,orders1,- # 1. schema.table
*,orders,*,-,orders2,-         # 2. *.table
c. Order of preference when searching for a new schema name:

1. schema.table
2. schema.*

Example:
client100,orders,*,client101,-,- # 1. schema.table
client100,*,*,client102,-,-      # 2. schema.*

Conclusion


We have reviewed how to use the new RenameFilter, that is available in the open-source Tungsten Replicator. RenameFilter is designed to help with various renaming needs during both initial provisioning and following replication, as opposed to developing multiple small filters to handle various rename needs that appear.

Wednesday, February 27, 2013

ddlscan - Utility to Help Analyze and Migrate Database Schemas

Intro


While working on one of the MySQL to Oracle replication projects for Continuent, I needed to implement an open-source utility for transforming MySQL schema to an Oracle dialect (DDL statements that create specific schema on Oracle) to save from otherwise tedious work. This article introduces ./ddlscan tool, which does that and is extensible to do much more.

Ingredients


Here's what you'll need:
  • Your favorite DBMS with some tables. Currently supported MySQL, Oracle and PostgreSQL.
  • Latest Tungsten Replicator build. Not even needed to install, enough to untar.
  • Velocity template of your choice. You will find currently available ones in tungsten-replicator/samples/extensions/velocity/ folder.

Recipe


Mix ./ddlscan with your chosen schema and a template:

$ cd tungsten-replicator/bin
$
$ ./ddlscan -db test -template ddl-mysql-oracle.vm -url jdbc:mysql:thin://localhost/ -user tungsten -pass secret

If you have Replicator installed, it's even simpler - ./ddlscan can read connection credentials from Replicator properties for you (no need for "-url" parameter):

$ ./ddlscan -db test -template ddl-mysql-oracle.vm -user tungsten -pass secret

Result - rendered text file (or stdout). Eg. of ddl-mysql-oracle.vm template:

/*
SQL generated on Wed Feb 27 14:13:21 EET 2013 by ./ddlscan utility of Tungsten

url = jdbc:mysql:thin://localhost/
user = tungsten
dbName = test
*/

CREATE TABLE 3colors /* ERROR: table starts with a number, use rename option and a filter */
(
  id NUMBER(10, 0),
  color VARCHAR2(1) /* ENUM('R','G','B') */,
  enabled NUMBER(1) /* BIT(1) - no constraints for 0 or 1 */,
  acolumnthatdoesntfittooracleat NUMBER(10, 0) /* WARN: truncated column name exceeding 30 characters (acolumnthatdoesntfittooracleatall) */

  /* ERROR: table must have a primary key! */
);


Architecture


To make it more scientific, here's an overview of how it works:


When you call ./ddlscan it uses Tungsten Replicator libraries and connects to the underlying DBMS through appropriate JDBC driver. There are classes in the Replicator which extract metadata from the requested schema. Tables, columns, keys and so on are identified and then provided as JAVA objects to the Velocity template-generating engine.

Velocity templates are not hard to write. Here's an example of a template that reports tables without primary keys: ddl-check-pkeys.vm - if comments and headers were removed, it's only 3 lines of code (at least, at the time of writing this article). To make it easier to handle the syntax, use some plugin like veloedit.

Finally, user receives a rendered result from applying schema metadata to a template.

Possibilities


By leveraging this simple approach many tasks can be automated. Here's just a few ideas:
  • The mentioned transformation of schema dialects between different DBMS types.
  • Automatic trigger generation for trigger-based replication or ETL tools.
  • Staging tables generation for loading data warehouses (eg. Vertica).
  • Schema validation (scanning for potential issues).
  • Automatic code generation. Eg. generating shell scripts to do something with the tables, generating JavaScript filters for Tungsten Replicator.

Conclusion


We have briefly reviewed how ./ddlscan tool works. More articles about specific use cases to come. Happy rendering!