Breaking all the rules: Why finance is turning to machine learning to manage data
- DATAMASTERS on Demand
- Breaking all the rules: Why finance is turning to machine learning to manage data
The sun is setting on the “old ways” of managing data at modern financial institutions. Spurred on by new regulation, intense competition, and the scale of modern data challenge — mapping and managing information principally based on rules is on the way out. In its place, the largest financial organizations are turning to machine learning to more precisely manage their complex and diverse data ecosystems.
Hear from two industry experts driving this change today, Thomas Pologruto, Chief Data Architect & CTO of Liquid Markets and Data Science at Blackstone and Daniel Waldner, Principal Strategist, Financial Services for Traction on Demand (formerly Director of Customer Data at Scotiabank).
Data Masters Summit 2020 presented by Tamr.
Welcome everyone to this session of Data Master, titled Breaking All The Rules. Why Finance Is Turning to Machine Learning to Manage Data. My name is Ravi Hulasi. I’m responsible for international pre-sales activities at Tamr. Especially with partners in financial service industries. My background is in data mastery and data management. I will be today’s moderator. I’m very excited to dig into this session, as we have two great guests who have deep practical experience in delivering transformative change by leveraging data for some of the world’s largest financial organizations.
Before I introduce our guests, I think it’s important to set the stage for this discussion. We are in a moment where the pace of change from a shift to digital is rapidly occurring. The sun is setting on the old ways of managing data at modern financial institutions. Spurred on by new regulation, intense competition and the scale of modern data, mapping and managing information principally based on rules is on the way out. In its place, the largest financial organizations are turning to machine learning to more precisely manage their complex and diverse data ecosystem.
With that, let me get started. It’s my pleasure to first introduce Tom Pologruto. Tom is Managing Director in the Blackstone Technology and Innovations Group. He serves as the firm’s Chief Data Architect as well as the Chief Technology Officer in the quick markets technology and data science. His focus is on bringing advanced data analytics, warehousing and visualization to all of the Blackstone corporate and investment teams. Prior to this, he was the Chief Technology Officer of a hedge fund solutions group and quantitative research and Senior Portfolio Manager at Kepos Capital.
Joined by Tom is Dan Waldner. He is the Principle Strategist at Traction On Demand, formerly a Director of Customer Data at Scotiabank where he was responsible for the definition and execution of the customer data strategy with a focus on the creation of a single customer data repository, anti-money laundering, risk and business enablement for an improved customer experience. Dan has served in a variety of technology and data oriented roles within Scotiabank throughout his 12 year tenure geared towards improving how thousands of consumers are using data within the bank.
With that, let’s begin today’s session and let’s start with Tom. Tom, thank you for joining. First question for you is, before you became the Chief Data Architect and CTO at Blackstone, you had a fantastic career leading constant research, how did this research influence your views on the importance of data quality?
That’s a great question, as someone who has been doing data analysis for over two and a half decades, first as an experimental scientist and then in the financial service industry, I’ve never really felt as if there was anything more fundamental than data itself. That is in any question that we’re trying to answer, we need good data, good statistics and then we can answer our question reliably. That was a big extension was in quantitative finance in particular, where we’re very data driven, but for the last 10 years I’ve actually been focused more on a quantamental approach, which is combining data and insights from investment professionals and from people across the organization and how to give better data driven answers to really complex questions in the investment space.
That’s certainly been our experience as well. That there’s been a shift of people now wanting to ask more of the data, make sure it’s fit for purpose prior to running through systems which generate answers to those questions. So, now as you lead architecture for Blackstone as a whole, how has your view changed?
It’s a lot more data that we’re gathering now, that is a lot messier. So when you look across an enterprise that has been around and leader in the industry like Blackstone has been for the last four decades, there is a lot of chronology to the data. The chronology to the data passes through many technology systems, many platforms, many data vendors, many third parties. In order to now, in 2020, we really have the ability to put all of that data into a symmetric pace with a good data warehousing strategy and then how do we join that data together? How do we master that data? The Mastery Data Management project at Blackstone is a key part and cornerstone of all of the data work that we do.
We can view our MDM system as many folks do, as the Rosetta Stone that connects everything together. This was ultimately how I was led to engage with Tamr, was behind that MDM project and understanding how can we take four decades of data and produce a golden copy of a unified data set?
That’s quite the endeavor and certainly we see with a lot of our customers, they’re just trying to get to that Rosetta Stone, trying to have that single view of the data they can trust upon the downstream analytics. Dan, as your role as the Director of Customer Data at Scotiabank, trying to move into a machine learning based approach to build that single customer data repository, such an endeavor cannot involve letting go of the previous behaviors to actually build a rules based approach to solve this, certainly in my past I’ve written a bunch of rules myself. Tell me, how did you go about [inaudible 00:06:04] from rules?
It was surprisingly easy considering the environment that was and still is a Scotiabank, very large organization, very client based in 52 countries across capital markets, corporate banking and the various dimensions of data when you actually dig into the underlying systems that service those business lines. You have different models for how customer data is stored and broken up, both at a logical model and at an abstract model of different pieces of data. So in some systems you have clean pristine data that’s tied to reference data such as Thompson Reuters sources or Avox or what not. In others, it’s keyed in by people on the trading desk or even worse, have been keyed in. So dealing with a wildly complex environment.
A rules based approach at a certain tier was just not going to work. Usually, for a rules based engine within an MDM, that’s about 20 systems. You get past 20, it’s far more difficult to mean the solution and keep it working efficient, than not. When looking at machine learning was, a scale as you start to add more and more sources of data, but then, basically handles all the work itself. It scales as you add additional systems and that to us was the key that the approach, the only real sensible way to go forward.
When you consider that data is not maintaining, data is shrinking, year over year, any rules based approach is just not going to have strategic ability, growing, that’s something we were able to take advantage of. So we saw that as Scotiabank. We were able to stand up a master customer data across, I think it was 30 systems as our initial phase and we did that in six months. That was customer rows across those 30 systems that we were able to [inaudible 00:08:38] and several times the same amount of money.
That’s quite an original investment to take in this approach and certainly sounds, base or key factor in applying a machine learning based approach. There must have been some inertia points along the way, where maybe the temptation is to go back towards rules? Were there any other factors that helped to drive member use of machine learning and avoid [inaudible 00:09:12] internal mindsets, which wanted to stick with the status quo?
I think the key lynch pin to that was leadership in the industry. That whether it be the enterprise data management council or the various technical architects that were included on the solution, that it was key that machine learning is on the rise and is tackling the problems exactly like this. Changing the mindset that a rules based approach is tried, tested and true. It’s been around for 30 years or longer and the glide path for how a project like that would go is relatively defined. Machine learning took courage to be able to adopt because it is within the scope of an enterprise bank at the time, it was relatively unknown, but we had seen applications of the machine learning and RPA and artificial intelligence in other applications that were similar to allow us to take the leap that, this is an innovative investment. The risks involved in it are not significantly higher than what the risks would look like for a rules based implementation.
If you look, Gartner put out a survey not too long ago that shows only 15% of rules based MDM projects actually succeed 85% of projects fail for one reason or another. Maybe it’s a technological limitation, maybe it’s an organizational shift, maybe it’s a leadership or governance problem. Maybe it’s a data problem. But when you’re considering gambling on the scale of $100 million or more and five to seven years for implementation, on a 15% success rate, you’re willing to take a look at maybe some of the new emerging technologies as maybe a better, smarter approach to doing it. That won the day at Scotiabank.
I think that’s a sound analysis, especially as you look through and understand that success rate that just needs to get better and machine learning can provide that competitive advantage, then by all means it can be adopted. Tom, earlier you mentioned how [inaudible 00:11:23] data and enriching data from further parties and how it’s starting to become a bit more common and important to that effort to build a 360 degree view of an entity in financial services, both from DMB hierarchies, [inaudible 00:11:35] reference data, Bloomberg, Kappa IQ, [inaudible 00:11:39] what sort of challenges does the mix of internal and external databases create?
The other challenge I think, as Dan pointed out, is that everybody thinks that their way of doing things is the correct way. So in every system that you’re looking at, whether it’s internal, external, it really comes down to everyone’s trying to innovate and they’re trying to differentiate themselves. For us, we were trying to answer questions. I’m trying to make decisions based on data. So, in a way we want to force the issue where we don’t want to have to pend the majority of our time searching and joining and pasting data together from 10 or 20 different systems in order to have a holistic view of the world, a view that we can ultimately make a decision on with confidence.
So, in that sense, the challenge does grow exponentially. Every new piece of data that you collect has a different way of doing things, a different way of synchronizing or joining up and most of our world is really around private entities. There is no publicly available mastering of data in the world that we could just reference. Even those projects that do have, or pretend to have those things, are typically not very encompassing and there’s entire industries upon industries based on security mastering and security mastering systems but I wanted to take a more holistic approach to that. Thinking about entity mastering, our master data management project and if we could do that and use a better mousetrap in order to accomplish our goals, that would set us up and future proof us against that tide of exponential increases that are already here.
We don’t even have to wait very long to already say that we have dozens of different data sources that are all referencing what is ultimately a singular entity or singular view of the world. So I just want to get there as fast as we can and have the right people enabled to build a better machine. That’s where I think really the machine learning aspect of this comes in. I think it was done very elegantly with Tamr. I’ve done this for a very long time. I have a background I physics, I understand the models and the complexities of them acutely. I think this was really the right approach to make it. We’re not rules based, but at the same time, there is some reinforcement aspects of things that we can build better models pretty straightforwardly. I think that’s a rarity among a lot of what we end up doing or trying to do with machine learning.
I think that’s a really good point. The variety in your data exists today. The problem is here and it needs to be solved. So how did your data architecture decision prevent this?
The data architecture position, I think just made it a firm wide initiative. So we looked across the entire enterprise now as opposed to just within a few of our businesses and realized that this was a generic problem. This is not a problem specific to one of our business groups or to our investment professionals or to our client management team. It’s a holistic problem. It’s a Blackstone problem, and it’s not just Blackstone, every company who wants to be better about their best asset, which is data, we now have the tools to do it. We have the tools to do it with warehouses and catalogs, but we just need to still get the right data in so we can get the right analysis out. That’s always been my mandate is, to state it quite simply, I’m trying to help people answer questions with good data analysis. If can do that, and do that as easily as possible and bring to bear the decades and decades of experience that we have across so many different industries, onto every marginal decision that we’re making in the firm, then I’m on the right path to accomplishing my overall goal.
Certainly principles, a process, technology that you apply can be applied to each of those different divisions. Be it client, reference data, securities, [inaudible 00:15:53] next phase challenge. Daniel, in terms of your perspective, what’s the view of the importance of data enrichment?
I think there are two problems or two different halves of the same problem when it comes to data enrichment and data identification. That’s namely around identity and about interrogation and attributes. What I mean by that is, if you take a series of ants and you put a rolling dice in front of the ants and you ask each ant what it says on the side of the dice, one will say four and one will say six. One will say one, one will say five and what not. Because to them the identity is what they see in front of them. That’s analogous to a problem that you experience with all of these disconnected systems that gathering the identity to say that entity A and entity B in another system and entity C in yet another system are actually the same entity. Wrapping that in a bow is one half to the problem.
That is the problem that in Scotiabank we use Tamr for to great success, to be able to determine that in fact all these various different sides of the dice are the same dice. That is very difficult to leverage successfully, external data enrichment services, but where data enrichment does come in to play, to glory success and once you’ve got your arms around that entity, it’s very easy to be able to say, “Okay I have a Dunn’s ID.” Or “I have an Avox ID.” Or “I have an LEI for this particular entity.” Now I can go out and get reference data that people will spend their careers making sure are correct and focusing back on the customer who is the right source of that data to be able to get the right attributes for that entity.
So, try not to tackle both sides of that problem all at once. To simply say, what are the separate entities that I need to be concerned with and then tell me how they all relate and what their attributes are so that I can get the right answer. That was the approach that we took to great success and that enhancement to data so critically important to being able to provide users and customers with correct information and timely information that they can action.
It’s interesting that you bring out hierarchies as we often see the need for, let’s say a sales hierarchy is different to the legal hierarchy view that’s required. Or the hierarchy to examine in terms of credit risk. It’s almost like, taking your analogy of a six sided die, now we’re going on to a 12 sided die, or even larger. How do you maintain that need for flexibility in the data view when you can see the mastering, especially when it comes to managing these hierarchies coming in from all these sources?
It’s incredibly difficult. I’ve seldom come across a problem in my career that can be relatively succinctly posed in 30 seconds, but could require several days to adequately explain the problem itself, like hierarchies. Whether it be credit hierarchies, risk hierarchies, legal hierarchies. You name it, there are relationships between these various entities that are at the base level and then branches underneath those within the financial services world. We love to make things extremely complex because that’s the way that they are in reality. You can leverage the same machine learning approach that Tamr provides, to reconcile identities at every level of your hierarchy. It just depends on how you want to aggregate that information.
So we took a first pass approach to say, here are all the legal entities that we know about in our organization. So that’s the first cut. But then the second cut was, tell me about the parentage of those first and help us coalesce that into a similar machine learning model to say that maybe three versions of this one entity has a parent, but then there’s a fourth one that we’ve noticed off to the side that may indicate either that the underlying identity is different or that the parentage in this one record is wrong.
The same applicability to machine learning to identifying each entity can be applied to its parentage all the way through. Then it’s a relatively trivial exercise to split that three or four fold depending on the hierarchy that you’re looking for. Tamr’s ability to be able to solve that through its machine learning process saved us an incredible amount of time in helping to reconcile all of that. Then layering on top of that some human eyes to ensure some sanity on some of our records, it made it incredibly accurate in a very short time.
I think that’s a key point there. Really critical to all this is, as you get that master view layering on the hierarchies and then splitting out as required by the different consumers of the data.
Just going to finish off just talking about return on investments. So, let’s pose the question to you Tom, with financial services being such a numbers driven industry, how do you consider, go about positioning the business value and ROI from data mastery? Especially when it comes to ensuring that your team is delivering and driving towards that end business value, what we often hear is that many people focused on perhaps a more statistical approach where they say, “Okay I’ve reduced [inaudible 00:21:37] by X percent.” Often they struggle to really translate into business value. What’s been your experience with that?
I’m very fortunate at Blackstone that the entire upper management of our firm CTO, that we recently hired, and myself are all completely aligned in the sense that data is our most valuable asset. It’s our people and our data and what do we do to make everyone’s life better? It starts with data ultimately. Our master data management system is the backbone of so many of our processes across so many of our different businesses and organizations. We don’t look at it from an accounting perspective. We really look at it economically. That’s really what we’re trained to do as well, is that the economic value of good data and good mastering of data is so plentiful. There’s a huge multiplier effect there that when you take everything in consideration, it makes justifying the ROI almost a non-issue.
We don’t particularly think about that, we want to see what is the value we can extract? What is the time that we’re saving? What are the other projects that we are working on? Because we have a good solution for this particular problem because it’s so key to solving so many other problems. It unlocks that door. So, from that perspective, I think it was very uncontroversial. Blackstone has always been a thought leader in the space of data driven investments and I think the role jut elevates that to the proper status that I have now, which is making sure that we can deliver on that in the easiest and cleanest path possible.
It is certainly the case that good quality data is an enabler, that plays into it significantly. Dan, what’s your take then on positioning return on investments? How do you stay close to aligning to it and business need?
As Tom mentioned, data is so important to business and it’s a multiplier throughout your organization. I echo those comments 100%, but I also look at data in two different ways. The defensive value of data and the offensive value. I put it into a sport’s metaphor. The defensive, especially from a financial services perspective is the ease and ability to deliver on regulatory, on risk, on operations, on streamlining efficiencies throughout your organization. Things that just make sense to people to do and do quickly for your organization. To make sure that operations continue to run in an effective manor and you’re not being chased by the FED or various regulatory bodies throughout the world. Oftentimes, defensive approaches are relatively straightforward to get funding for because regulatory flags tend to have money thrown at them in all financial institutions, just we need to go and get it done.
From an offensive perspective, once you’ve got those regulatory safe guards in place, offensively you can look at becoming that data driven organization to allow the foundation of all the work that you’ve done on the first side, to being able to drive white space analysis next best action analytical models efficiently. Oftentimes, what I’ve heard is companies want to fast forward to all of this cool stuff and they get an R programmer and a Python programmer and couple of data scientists to be able to bash together these cool models that are going to tell them where the next big gold mine is in their industry or in their investments, but those people are crippled because all of their data is in such a bad state that it requires a ton of manual effort to get it into place and thus, once they start to work on those models, their data is out of date and it starts this unvirtuous cycle of inefficiencies.
You set the foundation with your defensive approaches to be able to get all of your data up to date and accurate on a regular and timely basis and then you apply an offensive side to your data to be able to produce analytical models to people to drive your business forward to actually impact the bottom line and that’s where you start to see the magic. People want to skip that and it’s often to their detriment. Tamr allows a shortcut to get to the fun stuff without having to spend years building that foundation. Data is a strategic asset that many companies are starting to realize that investment gets a significant payback, but it’s very difficult to prove that.
We’ve all spent countless hours a week rummaging through bad client data or bad transactional data or bad account data, but it’s very hard to quantify on a global scale. We all know that it’s there though. We all know the order of magnitude is well beyond that which it would take to rectify through an engagement like this.
That’s an important point there. Just having good data, getting the data right is such a key foundational principle before getting to all the other fun stuff downstream. Now a question for both of you. In terms of how this is delivered today, you mentioned different roles, different competencies within an organization, data curators, machine learning experts. How do you see that breakdown of roles within a company changing as this becomes more established, more widespread and say in terms of taking external capabilities and competencies and blending those with the internal ones you’ve built out today.
From my perspective, it happens every day. There is constantly new clients, new businesses, new data sets, new questions. So, all the work I have to do, I’m doing while the airplane is in the air flying at a very high velocity with a ton of cargo on board. Blackstone has $550 billion of assets. We’re very much an alive business and you’re trying to engineer a lot of these things in motion. I think that is a challenge in general, but we really do need to be cognizant of the fact that we have to do no harm as well in any of these things. We’re not starting from scratch where we can, we don’t have that advantage typically in any of our businesses. Sometimes, if we’re at a start up, it would be a different story. But almost everyone else is walking into a Scotiabank. You’re walking into 30 different systems day one. Then by the time you’re halfway through, there’s 34 systems, or 38 systems. So, there is quite a bit of it.
I think that if I could piggyback on that, the how we are going to work and how we are going to align as organizations will change and alter as a result of leveraging these types of tools that I think there’s a misconception in the industry that these tools are going to supplant human workers for various parts of the organization. I think we have seen, over the last 30 or 40 years, we’ve seen the technology evolution of how computers interact with humans in business organizations and there are rules that will go away as a result of things like RPA and analytics and data centralization and what not, but for the most part those are roles where for instance there used to be a profession where a person would sit in an elevator and you’d get on and you’d tell them the number. They push the button or they raise the handle and it would just take you up to the fourth floor and you thank them and get off.
That role doesn’t exist today. That job has been moved to a button that has replaced them. Processing forms and doing boiler plate legal agreements and pushing paper around an organization is going to go away. Giving humans more enhanced tools to augment their ability to be able to do thing like close deals, approve customer relationships, spot trends in marketplaces, investment opportunities, enhancing humans innate abilities is what this is going to bring and is what we’re going to now focus on. Amazon doesn’t have greeters. Amazon doesn’t have janitorial staff in retail locations. They don’t have people to stock shelves. These roles are changed, but there are still people in their warehouses to go and pick up the items on the shelves and there are people whose job it is to think about logistics. So the same number of people or a great number of jobs are going to shift, but still remain. It’s about spotting where humans can add value and where computers can do the heavy lifting and I think to your question, how that’s going to change. I think that’s the impact of all of this.
Even within Tamr, within pre-sales we see that as well. Just how much can be automated, reused, applied as we engage with different customers and then how much [inaudible 00:31:35] you experience [inaudible 00:31:38] enhance that process. So it’s definitely a common theme. Great.
With that, on behalf of Tamr and our audience today, I’d like to thank you both, Tom and Dan, for the insights you shared. It’s been a pleasure talking to you and look forward to future conversations. Thank you both.
Definitely. Thank you as well.
That was a fantastic conversation. Thank you so much Thomas and Daniel. I enjoyed hearing about your experience in utilizing machine learning to manage complex data ecosystems. And thank you all for joining the session. We’re thrilled that you’re a part of the Data Master Summit. I’m Juliana [inaudible 00:32:19], Director of Analytics at Tamr. I lead our efforts in building new analytics and data science solutions.
And I’m Elizabeth Mitchell, Analytics Engineer at Tamr.
We are here to show you a quick demo of how Tamr can help financial institutions accurately assess their risk by understanding relationships between entities and their data. Now your customer and anti-money laundering compliance requires financial institutions to have a Master Database of their customers as well as understand relationships to external, high risk entities such as criminal organizations. Our financial services clients often use graph databases to store this relational data, but required the data mastering capabilities of Tamr to be able to perform the large scale mastering of customers, institutions or addresses. Tamr solution for AML and KYC allows the user to leverage mastered entities such as customers or institutions in a data model specific to graph databases.
Downstream of Tamr, the users can easily export the master data to a graph database and visualization tool, which allows them to analyze their customer data and identify potential risks.
Here we have a Tamr project where we’ve mastered data for AML and KYC use case for a financial institution. On the left, in green, we have the entity mastering projects, which we’ll dive into first. In this solution, we master the entities of companies, people and institutions and accurately analyze the relationships between them. The pairs page is where the real magic happens. From the unified data set, Tamr identifies questions for the user to answer. This allows subject matter experts to tell the machine learning model where two records actually reference the same entity.
For example, here we see two records. One pertaining to Brian with an A, the other pertaining to Brien with an E, both with the last name Thompson. We can see that they both work for the same company with the same legal entity ID. These we might label as a match. Here we see two very different names and the individuals work for different companies. I am able to decide whether or not these two people are the same and train the machine learning model accordingly. After providing my input on a few similar questions, I have retrained the underlying mastering model.
Tamr’s machine learning model uses the information that the user has provided as well as similarity between different entities to master entities into groups or clusters. Here you can see Sue, Susie and Susan Ryan have all been mastered into a single entity based on my model training. Even in a simple example such as this one, writing a set of rules that would recognize these records as all one entity would be extremely difficult and unscalable. Each of these mastered people now gets assigned a single persistent ID, a unique identifier that is applicable across data silos and which can be used to label and identify an entity.
We follow the same workflow for both companies and addresses, mastering them with user feedback to train the machine learning model. From each of these mastering projects, we are able to build golden records, the yellow projects on the upper right hand side. Here are the golden records for the people we just mastered. Each record in this data set takes the most common value for a given attribute among the mastered records in a single golden record for each entity. These golden records will be the node of our graph. Finally, we build out a data set, containing the relationship or edges of the graph. Here you see the HAS address relationship, which connects companies to their respective addresses. For this relationship, the golden records for companies and addresses constitute the nodes. Now that we’ve mastered our entities and built our relationships, we have a data model suited for export to a graph database.
The data model is agnostic when it comes to the downstream graph data store or visualization software. This means that the output of Tamr can be directly integrated into the users analytic pipeline of choice without the need for further post processing. In this example, we exported the resulting golden record relationship data sets in to [inaudible 00:36:53], the graph database and visualization tool.
Say we want to investigate a potential customer, Feran Deniso and all of his connections. For the purpose of demonstration, let’s first look at what his connections looked like before Tamr mastering was implemented. You can see Feran works for a company called Dow Jam where a couple of other employees in our database work as well. If any of these entities are risky, we will know that they’re related to Feran and so his risk level should be evaluated accordingly. Now, let’s look at what his knowledge graph tree looks like after Tamr has mastered all entities.
Feran’s tree has expanded significantly. On the lower left, you can see the cluster of nodes we were partially able to see before Tamr mastering. Even just within Dow Jam and it’s relationships, you can see the consolidation of entities has led to several more entities and relationships being visible. Beyond Dow Jam, we can see that a shared address with two other companies and more people and addresses related to those companies and even further out, four more companies and their related people appearing in relationship Feran. All of these new nodes that were able to see with Tamr mastered entities are potential risks that we were unaware of before mastering the underlying data. If any of these entities is a bad actor and we do not factor that into your interactions with Feran Deniso, this puts us as a financial institution at a great deal of risk.
So as you’ve seen, the Tamr KYC graph data model can be fed directly into a graph database, such as [inaudible 00:38:36] customer DI and visualize a knowledge graph visualization tool like [inaudible 00:38:41]. Tamr providers a feedback driven iterible workflow for keeping entities mastered, so your knowledge graphs can be trusted.
Thanks for joining us at Data Masters. It’s been great having you here. I hope you got tremendous value and look forward to seeing you the next time.