With that in mind, my time at Two Six Labs has taught many valuable (but humbling) lessons. The most consistent lesson is that of never-ending education; I had, and still have, a great deal to learn. With my one-year mark now in the rearview mirror, this is as good a time as any to reflect on my growth.
In an effort to avoid rambling about endless ideas and lessons, this discussion will be kept in the practical context of the first “full system” I had the opportunity to develop on this project. The practical context in this scenario comprises a data ETL (extract, transform, load) pipeline, written in Java, using Spring Integration.
This pipeline takes real world event data from the GDELT (Global Database of Events, Language, and Tone) Global Knowledge Graph, maps it to our project’s OWL ontologies, and inserts the resulting data into Virtuoso, our triple store database.
As our team’s position on this project requires extensive knowledge of triple store scaling, we built the pipeline to generate substantial amounts of data to fully test its limits. In the process of testing, we discovered bugs and limitations within Virtuoso, while developing a richer understanding of its underlying components and capabilities. My work on this ETL pipeline, and the lessons contained within, can be broken down into the three pieces of ETL: extraction, transformation, and loading.
The first step in any ETL pipeline is extraction. In this case, parsing the GDELT GKG (Global Knowledge Graph) datasets into meaningful RDF (Resource Description Framework) triples (more on “meaningful” later).
The GKG is a vast dataset which pulls from a large collection of news sources, encompassing a variety of events (e.g. criminal or political actions) with substantial context (such as location, actors, counts). The raw data is contained within a surprisingly unconventional CSV with an accompanying fifteen-page codebook.
By “unconventional”, I mean that this “CSV” utilizes field delimiters ranging from expected tabs and commas, to semicolons, colons, pound signs and even pipe characters. Once each tab-delimited section has been separated, you end up with fields such as V2.1COUNTS. Take this excerpt for example:
The line above denotes two blocks separated by a semicolon (“;”), with each sub-field identified by a hash symbol (“#”). This data would be useless without a descriptive codebook specifying each delimiter and field. Luckily, we had one readily available. After some data mangling, we ended up with a more useful representation of the data in the form of two Java objects:
Abstracting the implementation of this translation would be trivial if each field didn’t exploit unique values and delimiters. Instead of semicolons and pound symbols, other fields, such as V2GCAM, use comma-delimited (“,”) blocks with colon-delimited (“:”) key/value pairs, and fields like V2.1QUOTATIONS stray even further, utilizing pound-delimited (“#”) blocks with pipe-delimited (“|”) fields.
As I read the codebook and researched CSV parsing libraries, it was clear that an existing solution wouldn’t cut it. Taking to the creation of a custom GKG “CSV” parser, I made my first mistake: not properly employing Java types.
As mentioned earlier, I come from a JS and Python background, neither of which are typed languages. This project, however, was being written in Java, a heavily object-oriented and strictly typed language (ignoring type inference in the recently released Java 10). This led to the first iteration of my parser using maps, lists, and nested combinations of both to store the data.
The variety of types in the GKG meant I often defaulted to using a generic Object in these list and map structures. Because of this, when it came time to use the parsed data, I quickly realized the error of my ways. I had to “type cast”, or manually tell Java which type each object was just to retrieve an item in a list or value in a map.
This was far from ideal and a more preferred solution had to be sought. After speaking with our resident Java and all things software engineering expert, Karl, I took the more object-oriented approach of creating classes for each nested data type. Not only was the data syntactically easier to work with, but each field was mapped to the exact type of data it contained. Accessing the location in which a count took place now looked like this:
Not only is this shorter and easier to read, it is ferociously faster to compose with code autocompletion on hand.
Having learned my lesson in object-oriented programming, I moved onto transforming the data to align with our project based ontologies. An ontology, for those as unfamiliar as I was, is a model describing the object types and individuals in a domain, and the relationships between them.
For instance, if I wanted to store data from Hollywood articles relating to Steve McQueen, I would need a way to differentiate between Steve McQueen, the coolest actor of the 60s, and Steve McQueen, the modern Oscar winning director. To create this distinction, I could uniquely identify each Steve McQueen within an ontology. Using an IRI (as dictated by OWL), I would create the identifiers, “http://my.ontology#SteveMcQueenActor” and “http://my.ontology#SteveMcQueenDirector”, to refer to each individual in the data.
Continuing with the previous V2.1COUNTS scenario, this excerpt:
This RDF fragment describes gkg-event:20150218230000-13-1 as a wounding event located in Afghanistan with a group of 4 soldiers affected. Furthermore, Afghanistan is described as a location with an ISO code, label, latitude and longitude. The gkg-event: prefix, along with all other prefixes preceding a colon, are simply a notation for defining an XML namespace (to avoid naming conflicts across ontologies).
The most common prefix, “rdf”, equates to http://www.w3.org/1999/02/22-rdf-syntax-ns#. Note that we could have opted to use the geonames database instead of defining our own Afghanistan individual. For illustration purposes, however, constructing our own represents the GKG data more thoroughly.
A fair amount of information can be garnered from a terse 50-character snippet of a single field. However, if the contextual properties are taken into account, a more complete picture begins to form. Using this additional information from the GDELT record, the following triples can be added:
As may be gleaned through the descriptive properties, this RDF supplements the information above with the V2.1DATE and V2DOCUMENTIDENTIFIER fields. This auxiliary knowledge provides us with an event source and associated time. For instance, the wounding event now defines an explicit source of article:20150218230000-13, which itself is sourced from the GDELT CSV at line 13 (gkg-record:20150218230000-13). Using the data-prov:character_offset property, we can follow the chain of provenance through to the web page and exact point in the text where the information was sourced:
“…mortally wounding four soldiers and catastrophically injuring several others.”
In actuality, a “mortal wounding” would imply death. Given this information, we could add the more precise event type of MurderOrHomicide. However, the original level of accuracy is adequate for our use case. Although seemingly meaningless at first glance, through proper identification and transformation, this data has become usable not only quantitatively, but qualitatively as well.
While learning exactly what an ontology is and how to go about mapping the extracted GKG data to our project ontology, I discovered the utility of truly understanding your data. Instead of simply slapping values on fields, actually breaking down the data, looking for patterns, and mapping to useful identifiers make the data vastly more practical for real world use cases.
In short, data isn’t simply the information that gets passed through the application, it’s useful and revealing in and of itself. I’d venture to say you could create an entire field of study around it ;). This may be common knowledge to any experienced data scientist, but as my first venture into this territory, it was new, exciting, and revealed the extent of learning I still have to experience within the data science field.
After the basic pipeline for extracting and transforming the data was in place, I was tasked with loading said data into Virtuoso, our triplestore database for this project. This task didn’t go so smoothly the first few tries…
The insert rate of the first run started out slow; 400 individual inserts per second slow. Seeing this, I realized something had to be horribly wrong with the code. After my initial investigation yielded no results, I once again consulted our New Jersey software development connoisseur, Karl. This time, Karl had a look at the issue and promptly recommended using Visual VM to diagnose the problem. So I fired up Visual VM and Karl showed me around. The CPU sampler is what I needed for this task.
After starting the ETL pipeline, Visual VM allowed me to take a sample of the running threads and see exactly which function was occupying the largest amount of CPU cycles. In this scenario, loading my statements using RDF4J was creating a new connection to Virtuoso for each and every insert. Enlightened to the whereabouts of the issue, I was able to locate and examine the offending code directly.
Consequently, a change was put in place to reuse the same connection with each insert via a pooled connection to the DB. This modification dramatically increased the insert rate from roughly 400 to 5,000 statements per second.
But the optimization wasn’t finished there. Each RDF statement was still being inserted into the database as an inefficient individual operation. To remedy this, I incorporated an aggregator provided by Spring Integration. The code looked something like this:
This configuration instructs the aggregator to hold statements in a queue until the it contains 200 statements, then to pass all the queued statements down the pipeline for a bulk insert into Virtuoso. Rinse and repeat. Simple enough, right? Well, as expected, initiating the pipeline with my fancy new aggregator revealed a dramatically increased ingest rate of 15K statements / second.
This worked great, for the five minutes before it slowed down to a crawl. As I began debugging in Visual VM, I noticed the memory steadily increasing over time, to the point where garbage collection would continually trigger and nearly grind the pipeline to a halt.
CPU usage spikes from non-stop GC (displayed on the left) caused by ever growing memory (showcased to the right)
What could trigger such a catastrophic leak? Taking memory samples in Visual VM revealed that NodeHash arrays were the culprit, but what piece of code could be generating all these objects? Much furious googling, stack overflowing, and Spring Integration example searching later, I still wasn’t quite sure why memory was filling up with these objects. So what was left to try? Selling my soul and using a debugger.
I placed a few breakpoints around the code and began debugging. For the first couple tries, the code appeared to be operating properly. But after a bit more time, a few strategically placed breakpoints, and taking advantage of IntelliJ’s conditional breakpoints, I was able to nail down the issue.
The fix in this situation is above. I had been taking advantage of Spring IntegrationFlow’s “split” functionality previously in the pipeline, which assigns “correlation IDs” to every group of messages it processes.
However, these “correlation IDs” are random and unique for each group, and there was no guarantee these groups would be exactly 200 messages in size (in fact, there is good reason for them not to be).
When I instructed the aggregator to group these messages until 200 were queued, the aggregator would group them by correlation ID. This led to a group of say, 250 messages, with the same correlation ID, leaving a group of 50 behind. These 50 leftover messages would never reach the 200 count quota necessary to flow through the pipeline.
While the pipeline was running, incomplete message groups piled up, quickly filling memory to the brim. To fix the issue, the correlation strategy had to be explicitly instructed to group message by class instead of correlationID, as can be seen above.
So what have I learned? Well, in my time here at Two Six Labs, I’ve learned a bit about object-oriented programming and where to create types. The qualities of data itself and its ever-increasing usefulness have been continuously revealed throughout this project. I’ve learned new ways of debugging code via the use of a profiler. And finally, I’ve embraced using a fully integrated development environment and symbolic debugger.
Along the way, I’ve also learned various lessons about working in a team. The most important, I reckon, is how each individual’s skills and talents are especially invaluable when they are shared freely and effectively with others. In my case, I have Karl to thank for many of these lessons and many more not listed here. Of course, this post wouldn’t have happened without Mike Orr’s encouragement to share my experiences, so a tremendous thank you must be made there as well.
My hope is that in some way or another, you have been enlightened through my encounters with these lessons. Since my time spent creating this ETL pipeline, I’ve had the opportunity to work on many other enriching problems and projects that hold potential for a more technically in-depth review. As for the future, I am optimistically looking forward to the challenges, experiences, and lessons here at Two Six Labs.
Two Six Labs pushes the boundaries of the possible to protect the future. We design innovative solutions to complex challenges in data, cyber, IoT and beyond. We empower our clients’ critical missions, expanding operational capabilities and bringing new technologies to market.