Is There Such a Thing as Easy ETL?

E.T.L. That’s Extract – Transform – Load.  That doesn’t sound like a lot of work when all you need to get loaded is a simple Access database or an Excel spreadsheet.  In a situation like that, the process is so simple, all you really need to focus on is the L in ETL.  There’s not a whole lot of E.T. to process, despite how wonderful that movie is. [pun intended]  But as soon as your data loading process involves some difficult or sophisticated cleansing or transformations, it gets really, really hard.

The other cross-thread that had really caught my interest lately is the USA federal governments Open Data Initiative.  I think it’s remarkable that President Obama is the first president to appoint a federal CIO.  (Shouldn’t that have happened in the past?)  In addition, President Obama instructed the entire executive branch to open up their data (where security isn’t at risk) and make it readily available to the public.  And the US government collects mountains of interesting and valuable data for its own uses, but figuring out how or who to share it with was always an afterthought.  While I was a contractor for NASA, for example, I worked on some incredibly interesting projects which yielded amazing and commercially valuable information.  It was all public domain.  But unless you knew it was there, you couldn’t get to it. Making use of all of that data always intrigued me.

Now, with ODI, it’s all being put on the internet at an ever-increasing rate at Data.gov.  However, all of this data, while open and available, is not standardized.  Some data sets might be a CSV file, while others might be something like a spreadsheet.  That means you’ll need to extract, transform, and load that data if you want to synthesize more valuable data sets.

For those reasons, I’ve been researching tools to help make this process easier.  (I also wanted to research SSIS and ETL tools for my Tool Time column in SQL Server Magazine.)  Now, I’ve been following expressor software for quite some time and really like their unique approach.  (I actually ran into the expressor software team at a PASS Summit one or two years ago and asked for a demo of their software.  And I really liked what I saw.)  Rather than the workflow approach used by SSIS, expressor software uses a data mapping approach combined with reusable business rules.  Their mapping approach is fundamentally different from the traditional point-to-point, source-to-target mappings paradigm.  Basically, you can define a semantic type representative of your business data, create a business rule(s) to apply to the data, and then implement a “canonical” mapping which connects data sources and targets to that same semantic type.  And it’s free!

 

Abstraction is Awesome

What’s cool about that?  Don’t forget that “semantic” means “meaning”.  So a semantic type is an abstraction of the meaning of the data.  The net result is that expressor shields your data integration application, with its associated business and transformation rules, from changes that might occur to underlying target or source files with different field names and data type representations have to be processed. 

For example, let’s assume that you need to process invoices from different vendors in slightly different formats.  If you use a traditional ETL tool like SSIS, any changes in the source and/or target formats will require you to modify your data mappings and transformation rules, because the mappings are tied directly to the metadata structure of the invoice file format(s). expressor, on the other hand, lets you define a common “invoice” semantic type, build all your downstream data processing off that type and map one or multiple invoice file schemas to the type.

This approach greatly simplifies the mapping process and provides for more flexible data integration applications that can be more easily adapted to changes in the source and target data sources.

expressor Studio Desktop

 

 

Benefits Abound

Since the semantic types in expressor are captured as reusable artifacts, you can also reuse them again in new data flows within your project(s).  You can even share them across your entire organization.  As I tinkered with the expressor Studio tool, I hit on a few other benefits with this approach:

  • Handles data type conversions automatically without having to write data transformation rules for these conversions
  • Builds new semantic types from existing types and reuses types in existing and new applications
  • Creates multiple, reusable business rules against a single type and applies them repeatedly as needed
  • Easily implements data quality rules and constraints

In an Ideal World…

In an ideal world, I’d figure out some brilliant way to make money from bringing together all kinds of that government data that I used to work with.  Other folks are doing it at the Windows Azure Data Market.  But in the meanwhile, I’m also looking forward to tinkering with this data to build better demos.  Along the way, I’m going to use the expressor Studio desktop ETL tool (Did I mention that it’s free!) as well as tell you about my experiences as I try to build out some Data.gov data sets.

Those of you who know me, know that I look a good discussion and cooperative, constructive team work.  So I encourage your feedback and suggestions, as I work through these data integration challenges and share my experiences.  I’m looking forward to sharing with you my insights on what the expressor data integration software can do with this challenge and what some of its features and capabilities are.  In upcoming releases, I’ll let you know what I find intriguing and worth mentioning.

Check out their website, www.expressor-software.com, to learn more about their company and products.

Enjoy,

-Kev

Follow me on Twitter

Speak Your Mind

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.