datamaster summit 2020

Best Practices for DataOps in the Public Sector

 

Donna Cotton

CDO, AFSEO

During this session, Donna Cotton, CDO the U.S. Air Force SEEK EAGLE Office (AFSEO) discusses how the Air Force leveraged 50+ years of accumulated data in its data lake to synthesize previously approved analyses rather than building new analysis from scratch.

Join to learn how both efforts helped engineers certify new stores and capabilities with the agility required to support the modern-day US Air Force.

Transcript

Speaker 1:
Data Masters Summit 2020 presented by Tamr.

Donna:
Hi everyone, thank you for joining us today. My name is Donna Cotton, and I’m the Chief Data Officer at the Air Force SEEK EAGLE office at Eglin Air Force Base near Pensacola, Florida. I’d like to take this time to introduce some of my colleagues from Axio, Tamr, and Dell EMC that have made all of this possible. Can you go ahead and introduce yourself, you guys?

Scott:
Good afternoon, I’m Scott Gulig, I’m a data scientist with the Air Force SEEK EAGLE office.

Ted:
Hi, I’m Ted Gudmundsen, I’m the technical lead for the public sector at Tamr.

Aaron:
Hi, my name’s Aaron Solomon, I’m a machine learning engineer at public sector Tamr.

Todd:
Hello, my name is Todd Nguyen, I am a senior principal engineer with Dell Technologies.

Brian:
Brian MacKinnon, senior principal engineer, Dell Technologies.

Donna:
Thank you guys. So, I’ve been working with engineering data over the past 28 years in one capacity or another, but over the past four years, my team and I have been working on this data lake and it’s a product I’m extremely proud of. We’ve managed to not just change the way that our office uses data, but we’ve turned that data into new products. So I’m looking forward to telling you about our journey in hopes that other CDOs in the room will find it interesting and potentially relevant.
But before we get started discussing the data lake, I’d like to explain a little bit about my office. We’re responsible for insuring safety of flight, and making certification recommendations that new configurations are safe. These two pictures represent an F-16. It’s a 40 year old aircraft and it’s the work horse of the fleet. It’s capable of supersonic operations 10 miles over the Earth’s surface. The unloaded F-16 on the left is very sleek and fast. These modern day fighter jets are like cheetahs.
But in reality you see the loaded F-16 on the right and the amount of stuff that gets strapped to its underbelly, you start to realize that it’s much less like a cheetah than it is a pack mule. But what’s a configuration? So the configuration is the plane and everything else that gets strapped to it. You might be thinking to yourselves too, “After 40 years you should know everything there is to know about an F-16,” but that’s just not true. People are coming up with new configurations and inventing new stores every day. So what could go wrong? And we have some fun videos to share. We’ll start with the separations video.
Separations is evaluating how store releases from the aircraft, with respect to the aircraft, and as you can see here, the store is hitting the aircraft, and it’s going to take out the rest of the stores too. Now this is not safe or acceptable, so next we’re going to be showing you a flutter video and while this isn’t a military air craft, you can see the effects of flutter on the wings. So flutter is basically looking at whether the wings will vibrate unsafely because of the weight and aerodynamic effects, and as you can see here, this is not safe or acceptable either.
So, the Air Force SEEK EAGLE office, or AFSEO, our mission is to deliver state-of-the-art capability to the field. We really value being a responsible steward of our resources, and it is critical to our leadership, and to be effective, we must be agile, trusted, and responsive. We get about a couple of hundred new requests to fly new configurations per year from various customers like program offices, DOD, other commands, and even some commercial customers, and with that AFSEO must formally recommend whether a configuration can be flown safely and under what conditions, that includes mach and altitude, with the documentation of the engineering rationale for all recommendations.
So we’re the gate keeper, for better or worse, to innovative new capabilities getting out to the field. So we have to be responsive. Neither the Air Force, nor the pilots can afford for us to be the bottleneck.
And then with the F-35 coming online and continuing innovation on every front, we have more requests than ever, and it’s not slowing down, so we’ve got to get more effective ways to get certification recommendations out the door with less human effort. Next slide.
So what’s interesting to me with this timeline is it illustrates just how long AFSEO has been around. In the 60s, there weren’t a lot of strict rules or rigor about what configurations were flown on aircraft, and things went wrong, as you saw in the videos earlier. So in 1961, the Air Force decided, “We have to be more systematic about this,” and initiated project SEEK EAGLE, and this was at the height of the Cold War.
Then in 1987, the Air Force SEEK EAGLE office was chartered to manage this SEEK EAGLE process for the Air Force. I mean the fact that engineering data has been collected for some 59 years is especially relevant to a CDO. I mean, a lot of the aircraft that are more than 40 years old are still flying, so the data from the 80s is still relevant. It’s not like we can just archive this data to tape and forget about it. This data is still relevant to our day to day decisions.
So a little bit more about AFSEO. AFSEO consists of eight engineering disciplines that produce and analyze data. Each is responsible for preventing one mode of catastrophic failure from store loading on the flight line to weapon carriage during flight, and finally to store release and delivery to target.
As you can see here, fit and function, they evaluate the fitment, functionality, and loading and handling procedures of the aircraft store configuration. And then as we start to lift off, you’ve got stability and control that are evaluating the impact of the external stores on the aircraft handling qualities. Kind of like weight and balance at the liftoff.
Then you have loads, who is evaluating the structural compatibility of both the impact of the store on the aircraft, and the aircraft on the store. I mean, are things going to overheat in flight? Or is the store going to vibrate too much and stop working? EMC and EMI are evaluating whether there will be dangerous electromagnetic compatibility and interference between the store and the aircraft. You don’t want anything to happen before you’re ready.
With flutter, you saw that in the video earlier with the vibration of the wings. Next, we’re going to get into separations, which you also saw in the video earlier, the effects of that. Ballistics, they’re evaluating the aerodynamic characteristics of stores to model the flight trajectory, and that means will the store release on target? And then can we safely escape? So they’re evaluating the safe delivery of fragments to minimize risk to the aircraft.
So next we’re going to get into the data for each of these engineering disciplines. All eight engineering disciplines have very unique datasets based on their area of expertise, but they all share some commonality with respect to data types and data gathering methodologies. They have all this data stove piped into their data silos. They conduct their own tests, they store their data in their own ways, and they’re not leveraging this data across the disciplines very efficiently either.
So fit and function, they have really cool laser scans and they’re evaluating the physical properties data from the aircraft and store. As you can see, across several of the disciplines, they’re actually doing a lot of modeling and simulation, they’re doing ground tests and flight tests, but there’s some unique things, like with EMC and EMI, where they’re using a Maxwell Solver software, and then you have ballistics and safe escape that are using very customized software platforms or models to evaluate trajectory. But several of these disciplines are also using modeling and simulation on a local high performance cluster, and that brings a whole other challenge as CDO.
So as you can see, we have a very diverse compute load, with very diverse datasets coming out of all of these very specialized engineering disciplines, but ultimately all this diverse data and analysis is synthesized into a single clearance recommendation on a unified store limitations sheet that you see here. This sets the limits for how fast and how far, how high, and under what conditions you can fly, fire, or jettison. I mean, this is our product. This is the most important strategic asset for AFSEO.
And there’s a couple of things that should jump out at you here. Compared to other organizations, we rely on our old data very heavily. A lot of organizations are using their data for this and that analytic, but 90% of our engineering decisions are by analogy. So what does that mean? That means that our old historical data is the key element in our new products, so we’re not kidding about it being valuable. The solution for a new product is a mix of old data, like a wind tunnel test from 15 years ago, and new data from a wind tunnel test that we did last week, and this blending right now is very manual. It involves a lot of engineers, but there’s a lot of potential here for making this process easier, and that’s the story we’re going to tell today.
I think the big win for AFSEO is this ability to connect all of our data together, and because of that, we’re streamlining the workflow from eight teams working independently to a collaborative group working together to produce a single product, and on top of that, we’re able to start leveraging the data and have transparency across disciplines where we don’t have that today.
So I was very fortunate to get to listen to the Air Force CDO session yesterday, and as she mentioned, the Air Force set up their office in September of 2017, and AFSEO was right on their heels around April of 2018 where they established our office, and I think we can all agree that the Air Force core goals that you see here on the left, are best practice and they are no different for AFSEO. I mean, to really begin to leverage SEEK EAGLE engineering data as a strategic asset, we had to focus on a few immediate priorities though.
One being data accessibility, and this one is a big one that’s address by the data lake, as you’ll see. Digitization is also related to that data accessibility, and we started this initiative back in 2016 to protect our data. There were filing cabinets in several buildings on multiple floors. I mean, they were lining hallways and filling entire rooms. This was a step in the right direction, but it caused its own accessibility issues.
So now the PhDs, and the SMEs, and more seasoned engineers can’t depend on their memory about where certain reports or test results are located, and until a few years ago, engineers were literally going to filing cabinets to get 30 year old CAD drawings to start their analysis. You’ll hear more about how we’ve used machine learning to automate a bunch of our data tasks like the data tagging and creating automatic products from past data.
One of the more important things thought that I’ve had to realize as CDO in AFSEO is that a single data lake can’t do it all, and trying to shove all my compute applications into a single place doesn’t make sense. I do need some level of variety. Even when this data lake is complete, and I don’t think it’ll ever really be complete, but I’ll have a large variety of computer environments, and that’s okay.
Like our computational fluid dynamics team has different compute requirements than our flutter team, and that’s okay. Even if they’re using standard hardware, these are PhDs that want to install and maintain their own tools, so they don’t need me slowing them down. The key is to unify where I can while supporting variety where it makes sense, and it becomes a win-win.
Data governance. All I can say is institutionalizing governance into a 33 year old organization is the hardest part by far. So our data challenges, from a data perspective, what kind of problems we’re trying, you know what are the problems that we’re trying to solve? I mean, ultimately, AFSEO needed an innovative solution to data management that could address the volume and variety in our current environment, and do it without impacting the mission, because at the end of the day, AFSEO has to get product out to the field. So it’s like rebuilding and modernizing an aircraft while it’s flying and airborne, which we can all agree is a challenge.
I think that the challenges that we are facing though are common across industry. Technology and management processes must go together. Technology alone isn’t going to work. It’s also important to identify technology that’s going to fit in the workflows, or they aren’t going to be adopted without a fight. We have to get repeatable processes in place for this to work.
The other really big thing is to be able to show tangible value very early on. We have to demonstrate a big win early to get the buy in across the organization. The skepticism is high, right? We need to have a future proofed solution because as we can see, AFSEO has a 50 year history and we’re likely to be in business for decades to come, so we can’t build a data lake architecture that’s archaic in 10 years, or even five. So we’ve been very intentional about choosing the technologies that are open source, scalable, and heavily adopted to future proof ourselves as much as possible.
We have some other work force issues due to being around for 33 years as well. Our PhDs and SMEs, they have good instincts, but they’ve also developed mental shortcuts. Some of these engineers have actually been around the office since the mid-80s. They have a lot of very personal processes that they haven’t written down, and we’re helping them to institutionalize some of that knowledge by putting it into code.
So, the overarching requirements for the data lake design are simple. The solution had to be on premise open, scalable, and secure. But let’s start with the foundation. We’re using the Dell EMC Isilon for storage, and this is not your average hard disk. You’ll hear more about this storage and why it’s a transformative technology for us. In the middle, we have Cloudera, and because we know that we’re going to need to add nodes in the future because our data’s getting bigger and not smaller, so our big data applications are built on this inherently multi-node scalable computer layer, this is the industry standard for big data technologies, and we know it’s going to hold up over time.
On the top, we have Tamr at the application layer for organizing data and interfacing with our organic tools. What’s great about Tamr is that they have this [cot 00:17:09] software product for data unification, and it plugs into and runs on Cloudera components. Tamr makes it easy to apply machine learning to automate engineer’s manual products also. And then on top of that, on top of their cot product, we get some applications that are really transformative incorporating a Google-like search and filter for our files.
So in the next slide, Ted is going to talk about a few of those applications.

Ted:
Awesome, thanks. Yeah, so I want to tell you a little bit about what this data lake actually does, what it produces. So let me start at the beginning of the workflow. A new request comes to the SEEK EAGLE office. That happens, once that happens it goes to the data lake and it kicks off some automatic processes. The first is metadata extraction. The file is tagged with informative tags across a number of dimensions so that it can be related to other documents that are already in the data catalog, already in the data lake.
The second is the document goes through a number of data processing data fundamentals. Entity resolution and so forth so that it can then go through some discipline-specific logic to be compared to past products. So the idea is you have a new request coming in, you want to compare it to the things that this office has produced over the last 30 some years and see which ones are most similar. That allows you to make these analogies by [inaudible 00:18:42].
And finally, it goes through a machine learning pipeline, where we’re actually predicting core values at 10,000 feet, at 0.9 mach, how much will the wing actually vibrate? And that’s coming out of the ML puddle.
So what are the outputs here? What you get from this data lake process is an actual recommendation, it’s a text document that says this configuration is safe under these conditions, this is how fast the plane can fly, and it actually cites its sources. It says, “Here’s why we think it can fly at this speed, because of this study, and this study, and this study,” wholly automatic. It also produces these predictions that you can see in the chart here. It says these are the conditions under which you’re green, here are your yellow conditions, here are your red conditions.
But we’re not trying, and don’t believe it’s possible to fully take the human out of the loop. The engineer, the expert engineer is still an important part of this process, both to review the automatic outputs, and also to do things that are truly new and can’t really be done based on your past data. So for those expert engineers, we had some productivity tools.
There’s an interactive [inaudible 00:19:54] browser where they can look through all of the past things that have been produced in a very user friendly way, and then this document catalog that you’re going to actually see a screenshot of in a minute, that lets them search through the tens of millions of documents that are in this data lake and find just that one that they’re looking for.

Donna:
Thank you Ted. So now let me tell you a little bit about the architecture that makes this all possible. And let’s get started with the Isilon. I mean, the key and the most important thing to us about the Isilon is this functionality called multi-protocol. Basically, the same data could be accessed by users through the CIFS and NFS protocols and it looks just like their X drive, or whatever drive is mapped, but from the point of view of the data lake applications, all of the data is in HDFS.
Because in our environment, we have this fundamental problem, we support Windows, Linux, and Mac, but the data lake is entirely on HDFS. Cloudera wants the data in HDFS, and so does Tamr. So do we need two copies of the data? How do we keep them in sync? I mean, it was going to be a mess. But with this multi-protocol, there’s only one copy of the data, so there’s no synchronization necessary. I mean, the users just save files on their OS the way that they always have, and the data lake just thinks they’ve saved the files in HDFS.
We also don’t have to worry about mirroring permissions between the two locations. Whatever the file level access controls are set, which is important, because in some cases we have visitors from different countries with restricted access to data, but all of that just transfers over to our Cloudera and Tamr applications out of the box because there’s only one copy of the file and only one copy of the permissions, it’s great.
Cloudera, we’ll move on to the middle layer now. Cloudera’s making that multi-node architecture pretty easy. We’ve got a few big VMs, we’ve got Cloudera microservices distributed between them, and so between Cloudera Manager, Spark, HBase, and Solr, my application layer doesn’t even realize it’s running across multiple nodes, it just sees a lot of power.
Then let’s get to the app layer. The first application we’re going to highlight is this data catalog. And there’s really two parts to this, the tagging of the file so that you have something to search against, and then the search application itself. The tagging is Tamr, the visualization actually relies on Solr index, this huge Solr index, which is piped through a UI called Banana, and Banana is the actual user experience.
The second application is the recommendation engine and it’s looking through a bunch of historical data, and then it makes these automatic processes. There are these Spark and HBase jobs running behind the scenes that are coordinated by Tamr.
Now the third application, this is custom machine learning, and it’s the deepest application of the three. We’re actually trying to predict key engineering values in a very complicated, complex, aerodynamic circumstances. We’re actually using finite element simulations and all kinds of things, so this is the most forward looking, and the riskiest of the apps, but it’s well supported in this architecture.
So let’s talk a little bit more about the Isilon. For me, this multi-protocol storage solution is a huge success for the data lake and for AFSEO. It satisfies all the Air Force CDO goals, and our AFSEO immediate priorities for a solution, and it’s kind of easy to remember with this acronym SPEED. There’s a scale out architecture, and it can scale out to 58 petabytes on edge, core, or cloud. Performance keeps the solution simple without sacrifice. It’s efficient, it uses its own [inaudible 00:24:25] or ethernet network to communicate between nodes. It has enterprise features to include availability, redundancy, security, replication, and data protection. And then the data lake itself. Because all the data is HDFS, so it’s a one stop shop. Next slide.
So this Isilon storage solution brings flexibility to our data lake architecture, and as you can see, it supports a wide variety of software analytics platforms and tools. Because of its multi-protocol, traditional IT file share and analytics support, so by supporting all of this on a single platform, you enable us to do in place analytics on consolidated data because much of the [inaudible 00:25:15] processes can be run natively on the Isilon. This alone simplified our data life cycle. We didn’t need any pipelines to move data based on whether it was active or archive data, the data doesn’t have to be moved to leverage the power of the data lake.
So this flexibility, it supports multiple work loads simultaneously, and then the adaptability of it is minimizing our life cycle cost over time.
So before we start getting into the three applications, I’d like Aaron to give a demo of the processing foundation.

Aaron:
Great, thanks Donna. So as you’ve heard, at the core of AFSEO’s mission to accelerate its workflow is the need to unite data across engineering and disciplinary boundaries. Doing so enables the integration of multiple data sources into a singular global reference, which in turn powers the automated logic and analytics we’ve been talking about. The essential currency of AFSEO is the configuration, as Donna mentioned, and the configuration is made up of multiple stores. These are pylons, launchers, or munitions that can be attached to the wing of an aircraft and deployed from it.
As we’ve seen, AFSEO generates a lot of data about these stores, and each engineering discipline stores its store data in a variety of files. Some of these are very well formulated tabular documents with unique identifiers and model names and column names at different granularity levels, while others are just concatenated lists of free text parses or terabytes of historical data.
These highly variable sources must be unified into a common system of reference so that wind tunnel tests, mass properties, flight tests, and more can be analyzed automatically. Here, we used Tamr’s machine learning platform to power this unification. The first step in this process is taking the individual columns from each of our input datasets, shown here on the left hand side of the screen, and mapping them to a single unified dataset containing references to the stores from each of the individual data silos, which is shown on the right hand side of the screen.
This results in a large data frame that spans all of the input datasets and contains columns identifying key reference aspects of the store. Things like a common name, a model, or an internal ID number. And this is where Tamr’s machine learning kicks in.
Tamr’s ML platform takes all of the records that have been merged into that unified dataset across all of the individual input datasets and finds pairs of records that seem similar to each other. Each row here represents a pair of records that bare some similarity across their fields. Here, for example, we see AIM-9, AIM-9X. And Tamr asks the user, which is an AFSEO subject matter expert, to identify whether or not these two rows represent a reference to the same real world store, or not.
In addition to that, based on its prior training, Tamr’s machine learning model also makes its own guess as to whether or not they are the same, which you can see in this column right here. And then directly to the left of that, the user can provide its own feedback indicating that these two records, the AIM-9 and the AIM-9X, represent a reference to the same real world store or different ones.
Each column in this dataset represents some feature on which Tamr’s machine learning model is learning to compare these records and as we can see in each column, there is a status bar indicating how similar the two fields are to each other. The user annotates a number of stores, and then applies their feedback, which updates the machine learning model and allows it to reanalyze the records.
Under the hood, each of these features is being analyzed by a powerful machine learning pipeline to learn what similarity looks like for the particular kind of records we have at hand. Using this human in the loop machine learning process, Tamr can overcome the traditional limitations to data mastering without the messy business of crafting unmaintainable regular expression models, and [inaudible 00:29:28] one off ML code.
Once trained with enough feedback, Tamr can use this machine learning to group store references together into clusters. We see on the left hand side of this screen the identity of each cluster, and the number of records within it, and on the right hand side, we see the records within each of those clusters.
For example, if we scroll down here and say, pick this GBU-39, which is a particular air to surface munition, if we look within the GBU-39 cluster, we have GBU-39s with highly variable naming in which the users have concatenated different kinds of information, particular names, messy data of all different sorts, into these records, but using Tamr’s machine learning plus human user input and internal machine learning functions, Tamr’s been able to find that all of these records actually represent references to the same real world store, which is the GBU-39, and merge them together into one cluster.
Similarly, down here, in our AIM-9 cluster, we can see that we’ve got lots of variable spellings, some input datasets in which users concatenated two references to an AIM-9 on the same line, and yet, here still, Tamr has managed to successfully merge these instances together. Previously, if the user wanted to analyze an AIM-9 on a new position in an aircraft, they would’ve had to manually locate each of these records, coming from a handful of different datasets, transform them into a common format, and analyze them. Now, because these records have been successfully merged together a unified by Tamr, we can align the engineering data that’s tied with each record and use it to power computer driven analysis and accelerate AFSEO’s mission and workflow.
To see how that works, I’m going to hand this back over to Donna.

Donna:
(silence)
Thank you Aaron. Okay, so now that you’ve seen some of the foundational data processing, now we’re going to walk you through three of the [inaudible 00:31:35] applications that we’ve built on that data foundation.
The first application is a file catalog. We have three problems. Too little data accessibility, too little data sharing, and too little data usage. From a data accessibility standpoint, I mean the data is seriously hard to find. As we’ve mentioned before, you’d have to a PhD in library science and aerospace engineering just to find the right data.
For data sharing, this is the epitome of a stovepipe where nowhere dares to ask or look outside their area. There’s no standards or governance. Every discipline has their own system and naming conventions and reference files, so it’s like sediment in a lake that’s been layering up historically over time, and not all in one system either.
So this disorganization really limits the data usage because finding the right file, just the right file from 10 years ago for the judgment call that was made in the past is usually too difficult, so each engineer oftentimes is approaching the problem fresh. So for example, with flutter, you get a new request for a new configuration and that leads to 10,000 download configs, and each one of those is simulated. So an engineer may be considering the same problem repeatedly, but they don’t bother to check because they would literally have to be sifting through millions of past simulations. So this problem is crying out for some automation.
Ted will be reviewing this first application’s interface in the next slide.

Ted:
So this is the screen, the UI, it’s a modified version of Banana, and you can see some features right away. There’s a search bar across the top and this is going to, you’ve got some tags along the left hand side there and then the actual files that we have found in this bottom right. What we have here is a sort of Amazon search and filter experience. So the way the workflow often goes is something like this.
The user has been given some new variance in the GBU-39, and they want to know whether or not this can be approved by analogy, or whether it needs to be tested in a flight test. So they type GBU-39 into the top bar. The problem is that that gives them too many results, let’s say 500. That’s more than they can look through by hand. So now they want to be able to filter those results down based on these tags. So these tags are really the heart of this data catalog.
We’ve got the tags such as aircraft, the primary store, all of the other stores that were also on the aircraft at the same time, the kind of document it was, who wrote the document, and so forth. Using these tags, I might say, “Well, I just want to see the engineering rationale, and I know that this project we worked on was somewhere in the 2007 range, let me just turn on the 2005 to 2010 filter,” right? Now I’ve taken my list of 500 documents and I’m down to, let’s say 20.
Those 20 documents are already grouped by the project name, and maybe it’s only three or four projects. That’s a reasonable number for a human to look for. They can say, “Oh, based on that, that one project sounds familiar, let me see the engineering rationale for that document,” and aha, now we have found the needle in the haystack, that one document out of the whole data lake that really was the thing that you worked on 15 years ago and want to see again.
So as I mentioned, the tags themselves are at the heart of this, and I think that part of what has made this data catalog successful is that we’ve been more ambitious with our metadata tags than most data catalog projects. When I hear metadata, I sort of think of the computer science correlation of this, which is what are the permissions on this file and what’s the file owner and when was the last modified date?
But the average aerospace engineer doesn’t care what the file permissions are, that’s not really a relevant piece of metadata for them. They want things like what was the subject of the inquiry. We have a data type tag, which is one of our 30 tag types, and there are 70 possible file types. An engineering memo is totally different from an engineering rationale, which is itself totally different from a cover letter.
Now to an outsider, you might, you know when I first heard this, I said, “Really? An engineering cover letter is totally different from an engineering rationale?” But to anybody’s who in the office, they say, “Yeah, of course, they’re totally different documents.” So being able to reflect back the differences and the kinds of data they have makes this a useful data catalog to the average AFSEO engineer.
The last piece here is that when you have these ambitious tags, you have to work a bit harder to populate them. So imagine a video file, it just comes in and says, “Wind tunnel test, July fourth, 2019.” It doesn’t tell you what the project name was, it doesn’t tell you what the originator of the request was, or any of these other things that at this point I’ve said we need as tags. And so what we have to do here is not just be satisfied with the data in the file itself, we have to take a couple of hops because that individual video file may not have much information on it that is useful metadata, but there’s some reference file that says, “Oh yes, for this project, on this date, we took this video file.” When you get to that reference file, now you have all of this crucial metadata. So a lot of our metadata is actually tagged on one or two hops away from the original file.
Okay, so the next thing I want to highlight is sort of an IT detail that I hope the user doesn’t even notice. When they find that needle in the haystack, they find that one file that they really want, they just click on it and the file appears on their desktop. Again, I hope they never think about that, but it actually is a very nice piece of IT, right? Because they didn’t have to go through and say, “Well, let me find the location in the windows file browser and then click through 12 layers of folders to get there.” They didn’t even realize that the data came through HDFS because they don’t even know what HDFS is, and that’s how we want it. We want this experience to be seamless, and it is because Cloudera Hue actually just serves that file up if they have permissions to see it.
And the last piece that I want to highlight here is that this system lives alongside the user’s usual experience, which is that most of the time they don’t interact with files in this way. Most of the time they interact with the files through their H drive, or whatever their Windows Apps drive is, right? And they can create a file there and that file will suddenly appear here in the data catalog, fully tagged, as if they had put it in HDFS, but they don’t know that that happened behind the scenes, they just work in their usual environment, but the data ends up here. All right, back to you Donna.

Donna:
Thanks Ted. So as we’ve seen on previous slides, find data in an environment this complex is hard. And the data catalog helps, but we can go further. So as we said earlier, a large proportion of our requests are addressed by analogy, so if we can successfully make sense of our datasets, why not go to the next step, and find the correct files for the engineer in response to the new request? Once we automate this process, the compute isn’t an issue, we can search through tens of thousands of possibilities very quickly and just spit out the right answer, so that’s what this next application does.
And today this process of providing products for us is very manual. You’ve got engineers in each discipline that have to be consulted. So even if an engineer ends up deciding that this is something very similar to what we’ve done in the past and it isn’t going to require any testing, they still have to sit down and get into the details, and they’ve got a long queue of work. From the point of view of the program managers, even planning is hard because while most of the work is straightforward, sometimes the engineers come back and say, “Nope, this is actually totally new, and we need to do a flight test, so your schedule is backed up three months.” So the program managers are having to talk to at least eight engineers to take a serious technical look at things before they can even estimate whether a project is going to be three weeks or three months, so the project managers hate this, as you can imagine.
The big picture for us is that by comparing these configurations to the new request, we can automate up to 80% of the requests that come into our office. So by encoding the institutionalized knowledge into the logic of our application, we can power it with 50 years of antecedents and data that eases the workflow processes of the engineer.
The edge cases that were still going to the engineer and even if the engineer says no additional testing is necessary, and the configuration is safe. In this way, the application is kind of like an eager junior engineer consulting a senior engineer. It makes the straightforward calls without any input, but in the hard cases, it walks into the senior engineer’s office and says, “Hey, I did a bunch of research, and this is what I found out, can you give me your opinion?” So the logic of this application is incredibly important, as you can imagine, but we’ve managed to get a lot of the senior engineers to put their best ideas onto paper so that the applications contains a lot of the wisdom that it took these guys 30 years to distill. And in the next slide, Ted will be reviewing how [inaudible 00:41:21] recommendation engine works in practice.

Ted:
All right. So, as we’ve said, this process gets kicked off by a new SEEK EAGLE request coming in the door. When that happens, it immediately goes into the Tamr data processing layer to do entity resolution, to do natural language processing, and transformations. Entity resolution for the store, for example, actually happens on multiple levels. You’ve got the version of the store, you’ve got the level above that, which is the group of stores, and then you even have family of stores, which are not at all the same, but aerodynamically similar enough that they can be considered analogous in some cases.
Once you have gone through this entity resolution process, transformations, the NLP, to standardize the incoming data, it can then be compared to other standardized data. You have reference data, things like physical properties, moment of inertia, center of gravity, and so forth, and you have historical data, which is all of these past documents that have been approved and of these flight tests that have been run in the past. Again, the entity resolution and the standardization here is the key thing that allows the other steps to happen. Then you move into the scoring step. So in scoring there’s really two different pieces here that are important. One is that sometimes you get a new store in. It’s a new version of the GBU-39 that they just invented and it doesn’t have a perfect antecedent because it just got invented, right? But that doesn’t necessarily mean that you can’t make an analogy. The new version of the GBU-39 has a certain center of mass, it has a certain moment of inertia, and that can be compared to those things that have been approved in the past. That’s what the tolerance check does, the UI for this.
The second piece here is maybe the store itself is not all that different, but the stores overall on the two wings of the aircraft have been scrambled up in a new way that has never been tried before. And so now you really need to do this configuration level comparison to say how does this configuration compare to other configurations that we’ve approved in the past. So once that logic is done, then there’s a yes or no here. Either you can certify the thing by analogy, in which case we go ahead and produce the publishable documents, the flight limits, here’s where and how high and how fast you can fly, and the engineering rationale with sources cited.
Or, we say, “Nope, there are analogies, but they’re not close enough,” right? And that’s okay too. In that case, we don’t just leave the engineer on their own, we say, “Here are the closest documents that we can find, here’s the closest analogies that we’ve found even though they’re probably not close enough, you can use these as a starting point to decide what kind of testing you need to do, and if you need to get a plane in the air to actually try the phenomenon out in realtime.” All right, back to you Donna.

Donna:
Thanks, Ted. So the final method of using historical data that we’ve explored is to build this machine learning model, or models, really, to calculate key engineering values. And in this case we were trying to predict the amount of vibration in the wing of an F-16 under flight conditions, the actual number. And if you could predict this number accurately, then the decision about safe flight conditions is a trivial conclusion, but this problem is hard, it’s really hard. If you can predict this number accurately, then the decision about safe flights is so easy, but the problem is you have new configurations that lead to tens of thousands of potential download configurations, so in other words, what if I fire this missile from the wing tip, then it’s a whole new configuration.
So we do have finite element software that helps with this, but the results of that software themselves require expert interpretation. So there’s not a lot of testing, right? This is intentional because flight tests will involve pilots flying loop-the-loops in the air at the speed of sound for hours and hours, burning fuel as astronomical rates. So this process is highly manual. So our solution has been very machine learning based approach. So let’s take all the simulation and flight test data and throw it into an ML model and see what happens. Tamr has been very instrumental in helping us explore this path. They’ve got it hooked up so that the model predicts flutter at every point, and then it adds it all together into a single recommendation for the flight envelope, which is the safe conditions at which a pilot can fly.
So this helps us write fewer flight tests and feel more confidence that we’re making the certification recommendations correctly. Next slide.
So on the left here, you see the flutter predictions coming out of Tamr. On the x-axis is how fast the plane is flying and the y-axis is the altitude. The colors are intuitive, green is good, red is bad. So you can see an actual prediction at a variety of conditions. And this is all from the machine learning classifier. This is a new capability that has never existed for us before.
On the right, you see the results from software called DataRobot that the Tamr team used to help automate the tuning of the machine learning classifier. Basically DataRobot is cycling through thousands of possible classifier and meta-parameter settings to try and choose the best model. From the point of view of the CDO, what I found very gratifying about DataRobot was that it had an HDFS Cloudera plug-in so my data didn’t have to go anywhere, I just plugged it in and they were ready to use the data and the compute that I had provisioned. And this really is the vision of the data lake. We didn’t choose the data lake architecture with DataRobot in mind, it was a late addition, and that’s actually worked out quite fine.
I want data scientists and aerospace engineers to be able to use this data lake to quickly solve problems using techniques that they haven’t even thought of yet. The fact that we were automating our automation is just amazing to me. I want this to be a foundation of the future and the fact that we’re already finding new ways in the first couple of years is very gratifying. Next slide.
So, I mean, the bottom line is that we built this Tamr and Cloudera data pipeline on state-of-the-art Dell EMC storage that not only reveals and organizes our data, it actually uses our data, allowing us to bring the messy, disorganized, legacy data to make better decisions for the future. You know, and Tamr helped bring everything together, organize our data swamp as we called it for years, and we built some awesome ML driven applications on top. So we’ve been able to reduce the onboarding time for our new engineers. We are able to be more productive. It’s saving us a lot of money because for us it’s not about profit, but it’s about hours, it’s about how much effort are we going to have to use to answer a new request. So we did it. AFSEO have data lake architecture that provides capability to leverage our 59 years worth of engineering data as a strategic asset, so I’m very thankful for this team and for our organization for supporting us in this initiative. Thank you so much.