The end of the snapshot for people analytics: Workday edition

I recently made a simple post on LinkedIn which received a crazy amount of views and overwhelmed us with requests to take a look at what we had built. The simple release was that we had managed to take workday's point in time (snapshot) based reporting and rebuild a data schema that is effective dated, and transactional in nature. The vast majority of organizations and people analytics vendors use snapshots for extracting data from workday because this is really the only choice they've been given to access the data.  

We don't like snapshots for several reasons

  • They are inaccurate - you will be typically miss out on the changes occurring between snapshots, this makes it impossible to track data/attribute changes in between, to pro rate, and create analysis any deeper than the snapshot's time context.  
  • They are inflexible - an object or time context has already been applied to the data which you can't change without replacing the entire data set with a new context.
  • They don't allow for changes - If data is corrected or changed in history you need to replace the entire data set, urggh.   
  • External data is difficult to connect - without effective dating joining in any external data means you have to assume a connection point and apply that time context's values to the external data set, see inaccurate.  This compounds the inaccuracy problem if you end up having to snapshot the external data as well.  
  • A pain in the #$% - to pull snapshots from workday now you need to create the report for each snapshot period that you need to provide.   3 years of data with a month end snapshot, that's 36 reports to build and maintain.  

With our background in working with raw data directly from HR systems this approach wasn't going to cut the mustard and couldn't deliver the accuracy that should be the basis of an HR data strategy.  

The solution is not to buy workday's big data tools because you're going to be living with many of the same challenges.  You need to take the existing structure, enhance, and fundamentally reconstruct a data architecture that solves these problems.  We do just that, we extract all employee and object data, analyse the data as it flows and generate additional requests to the workday api that work through the history of each object.  Data is materialized into a schema close to the original but has additional effective dated transactional records that you just wouldn't see in a snapshot based schema.  This becomes our raw data input into One Model or could be with other toolsets, delivered to your own warehouses or used any way you wish.     

The result is perfect for delivering accurate, flexible reporting and analytics.  The final structure is actually closer to what you would see with a traditional relational schema used by the HRIS sold by SAP, Oracle, PeopleSoft etc.   Say what you will about the interfaces of these systems but for the most part the way that they manage data is better suited for reporting and analytics.

Now don't get me wrong this is one area most people know Workday lags in, and in my opinion it should be a low priority decision point when selecting an HRIS.  Don't compromise the transactional fit to your business of the HRIS for reporting and analytics capability because ultimately you will be disappointed.   Choose the system that fits how your business operates.   

Time to get a little more technical.  What i'm going to discuss below is the original availability format of data in comparison to the approach we take at One Model.

Object Oriented - the why of the snapshot

Ok so we all know that Workday employs an Object Oriented approach to storing data which is impressively effective for it's transactional use case.  It's also quite good at being able to store the historical states of the object.  You can see what i mean by taking a look at the API references as below

note the time references

note the time references

The above means the history itself is there but the native format for access is a snapshot at a specific point in time.  We need to find a way of accessing this history and making the data useful for more advanced reporting and analytics.  

Time Context

In providing a point in time we are applying a time context to the data at the point of extraction.  This context is then static and will never change unless you replace the data set with a different time context.  Snapshot extractions are simply a collection of records with a time context applied.  Often when extracting for analytics, companies will take a snapshot at the end of each month for each person or object.  We get a result set similar to the below.

snapshot data example - end of month

snapshot data example - end of month

The above is a simple approach but will miss out on the changes that occur between snapshot because they're effectively hidden and ignored.  When connecting external data sets that are properly effective dated you need to make a decision on which snapshot is accurate to report against in the above an you simply don't have enough information available to make this connection correct. This is just simply an inaccurate representation of what is really occurring in the data set, it's terrible for pro-rating calculations to departments or cost centers and even something like basic like an average headcount is severely limited.  Close enough is not good enough.

If you are not starting out with a basis of accuracy then everything you do downstream has the potential to be compromised.

Remove the context of time

  There's a better way to represent data for reporting and analytics.  

  1. Connect transactional events into the timeline
  2. Extract the details associated with these events
  3. Collapse the record set to provide an effective dated set of records.
effective dating and collapsing records to match the transactional events

effective dating and collapsing records to match the transactional events

The above distills down the number of records to only that which is needed and matches transactional and other object changes which means you can join to the data set at the correct point in time rather than approximating.

Time becomes a flexible concept

This change requires that you apply a time context at query time providing infinite flexibility for aligning data with different time constructs like the below

  • Calendar
  • Fiscal
  • Pay Periods
  • Weeks
  • Any time construct you can think of

It's a simple enough join to create the linkage

left outer join timeperiods tp 
on between employee.effective_date and employee.end_date

We are joining at the day level here which gives us the most flexibility and accuracy but will absolutely explode the number of records used in calculations into the millions and potentially billions of intersections.  For us at One Model accuracy is a worthwhile trade off and the volume of data can be dealt with using clever query construction and of course some heavy compute power.  We recently moved to a Graphics Processing Unit (GPU) powered database because really why would you have dozens of compute cores when you can have thousands (as a side note it also allows us to run R and Python directly in the warehouse #realtimedatascience).   More on this in a future post but for a quick comparison take a look at the Mythbusters demonstration

What about other objects?

We also apply the same approach to the related objects within workday so that we're building a historical effective dated representation over time.  Not all objects support this so there are some alternative methods for building history.

Retroactive changes?

Data changes and corrections occur all the time as we know we regularly see volumes of changes being most active in the last six months and can occur several years in the past.  Snapshots will often ignore these changes unless you replace the complete data set each load.  The smarter way is to identify changes and replace only the data that is affected.  I.e. replace all historical data for a person who has had a retroactive change.  This approach facilitates a changes only feed and can get you close to a near-real time data set.  I say close to near-real time because the Workday API is quite slow so speed will differ depending on the amount of changes occurring.   

Ok so how do you accomplish this magic?

We have built our own integration software specifically for Workday that accomplishes all of the above.  

  1. Extracts all object data and for each
  2. Evaluates the data flow and identifies where additional requests are needed to extract historical data at a different time context
  3. Merge these records, collapse them, and effective date each record
  4. We now have an effective dated historical extract of each object sourced from the Workday API.  This is considered the raw input source into One Model, it is highly normalized and enormous in its scope with most customers having 300+ tables extracted.  The pattern in the below image is a representation of each object coming through, you can individually select the object slice itself
each green segment is an object sourced from workday.  the red slice shows one of those objects has suffered an error. Rarely do you see patterns like this illustrating the sheer size of the workday schema

each green segment is an object sourced from workday.  the red slice shows one of those objects has suffered an error. Rarely do you see patterns like this illustrating the sheer size of the workday schema

5. The One Model modelling and calculation engines take over to make sense of the highly normalized schema, connect in any other data sources available, and deliver a cohesive data warehouse built specifically for HR data.

One model has pre-built but fully configurable data models for known sources like workday, Successfactors, sap, oracle, etc that allows the data and it's behavioral interactions with other data to be contextually understood by the calculation engines.  

One model has pre-built but fully configurable data models for known sources like workday, Successfactors, sap, oracle, etc that allows the data and it's behavioral interactions with other data to be contextually understood by the calculation engines.  

6.  Data is available in our toolsets or you have the option to plug in your own software like tableau, SAS, etc

7. Up and running in a few days, all we need is a set of authorized api credentials with access provided to the objects you'd like us to access.   

Flexibility - the biggest advantage you now have

We now have virtually all data extracted from workday in a historically accurate transaction based format that is perfect for integrating additional data sources or generating an output with any desired time context (convert back to snapshots if you require).  Workday reporting and analytics starts with having a data strategy for overcoming the inherent limitations of a native architecture that is just not built for this purpose. 

We're HR data experts and we do this all day long, if you would like to take a look please feel free to contact us or book some time to talk directly below.