Integrating Reltio MDM with Snowflake

Saama Analytics
Saama Analytics
Published in
5 min readMay 7, 2020

--

Master Data Management (MDM) software is used to process (cleanse, transform & integrate) the raw data which is typically gathered from multiple sources involving the same business entities. In this process, the data is unified/triangulated/integrated by:

- Matching (read fuzzy) various data points and removing duplicates

- Defining the associations and rules

- Prioritizing data sources etc.

Thereby making it one large unique entity or master source reference — enabling various business functions across the enterprise to leverage this data as it’s accurate, up-to-date, consistent and complete — termed as Golden Record.

For example, a customer can transact using an online website, over the counter/store or using a telephone call. Values for attributes like home address, work address, email, phone, the preferred mode of communication may change across these sources over time. How does the business create a Golden Record from such data?

Source

This article talks about one of the practical scenarios that we came across while working with Reltio Cloud MDM. Reltio is hosted on the AWS cloud, leverages Graph technology to discover relationships between data elements and runs on Cassandra NoSQL database along with Apache Spark for rapid processing.

Phase 1 was to integrate and leverage Reltio MDM with an existing on-premise infrastructure.

Phase 2 was to migrate/modernize and leverage Reltio MDM with Snowflake — discussed below.

We realized there are certain integration standards while working with Reltio MDM:

  • Reltio consumes JSON (files) with API calls that require files to be kept in intermediate storage
  • These JSON files may need intermediate storage, preferably with logging and archival
  • On-Premise ETL tools are generally incapable of directly converting structured data to JSON
  • Reltio doesn’t provide a ready connector to Snowflake for data consumption

After discussing these challenges on the forums and speaking to support, the suggested approach was on the same lines i.e. convert data into Reltio supported format using ETL solution or utilities provided by Reltio, stage converted JSON files in intermediate storage and transfer files to Reltio, careful file deletion & archival, implement MDM and stage golden & cross-reference records from Reltio followed by ingestion into Snowflake for downstream reporting & usage.

This data pipeline or orchestration directly impacts:

  • Data Storage Efforts (although low $ wise)
  • Compute Costs (high $ wise)
  • Managing Changing Data (CDC) or Slowly Changing Dimension (SCD)
  • And above all, software development and engineering costs to develop and maintain

Considering the frequency of data ingestion and downstream integration.

Based on the above know-how, we worked on a solution primarily using Reltio MDM, Snowflake & Python (scripting). Here, we have:

  • Structured Customer Data in Snowflake Database Tables
  • Active Reltio Tenant Credentials
  • Python 3 Scripts

Here we leverage ELT capabilities available in Snowflake, and make maximum use of compute at a lower cost, thereby leveraging the full power of its underlying MPP. You can find more about Migrating to Snowflake here.

First, we prepared Reltio supported JSON from structured data. Snowflake has a powerful SQL engine which has the capability to mold your data in a required format, below is a sample SQL for converting structured to JSON data.

Select
array_construct(object_construct(
‘attributes’,object_construct(
‘Name’,array_construct(object_construct(
‘value’,to_varchar(name))))
,’type’,to_varchar(‘configuration/entityTypes/Organization’)
,’crosswalks’,array_construct(object_construct(
‘type’,to_varchar(‘configuration/sources/Finance’),
‘value’,to_varchar(id)
))
))
from reltio_schema.re_customer limit 100 offset 1 ;

Snowflake SQL Snippet

After conversion from structured data to JSON, below is a sample record -

[ {
“attributes”: { “Name”: [ { “value”: “Darakjy” } ] },
“crosswalks”: [ { “type”: “configuration/sources/Finance”, “value”: “111101” }],
“type”: “configuration/entityTypes/Organization”
} ]

Secondly, all converted JSON records are stored in the Snowflake variant table — variant table creation is fairly easy similar to usual DDL SQL. Extra columns with other data types like varchar, timestamp, boolean, etc. can be added as well. To handle the delta detection it is recommended to augment the last update date or indicator column which has the current status of the record.

JSON stored in Snowflake variant table

Finally, the Python script used for posting the JSON data to Reltio Cloud:

  • Query JSON data from Snowflake, clean it, make it Reltio supported
  • Handle the authorizations & POST it directly to Reltio MDM.
  • Error & exceptions handling, archiving the files, automated JSON records chunking for smoother transactions through API
  • If the authorization token is available for a few seconds then refresh the token and use it once a good window for POST operation is available

The script can be used in ETL python components which will minimize the usage of components for refreshing tokens and API callings. Have already tested this with Matillion and works as expected.

token_headers = {
“Authorization”: ‘Basic cmRsdGlvX3gsgm1ha2l0YQ==’
}
response_getToken =
requests.post(“https://auth.reltio.com/oauth/token?username=USERNAME&password=PASSWORD&gra
nt_type=password", headers=token_headers)
print(response_getToken.json()[‘access_token’])
reltio_token = response_getToken.json()[‘access_token’]
postData_headers ={
“Authorization” : ‘Bearer ‘+reltio_token,
“Content-Type” : “application/json”
}
response_postData =
requests.post(‘https://sndbx.reltio.com/reltio/api/TENANT_ID/entities',headers=postData_headers,data=d
ata)
print(response_postData.json())
ctx.close()

And that’s about it!

Advantages of this Solution:

  1. Automated data loading to Reltio without using complex transformations in ETL/ELT
  2. No intermediate storage required
  3. No file archival mechanism is needed, as the table query option is always available along with metadata
  4. Delta records can be processed by merely querying records based on LAST_UPDATE_DATE & IS_ACTIVE_IND columns in the Snowflake table holding JSON data

Would like to hear about your experience of integrating Reltio Cloud and other MDM tools with Snowflake.

Author: Suraj D.
Sr. Data Engineer (MDM | iPaaS-ETL | DQ | Snowflake | Cloud-AWS) Certified AWS Developer with 6+ years of experience Master Data Management, Snowflake & IPaaS solutions.

Originally Published on — saamaanalytics.com

--

--