On May 25th, 2018, the General Data Protection Regulation (GDPR) will come into effect throughout the European Union.
The GDPR poses several challenges to businesses, as successful adoption means (amongst many other things) that customer records must be actively managed. This implies that customer records must be traceable and purgeable.
A customer may request to obtain an extract of the data about her that is held by the business. A customer may instruct that data records of her must be deleted. After a customer record has been purged, no information residue should be held.
The type of IT infrastructure or application landscape within a business can have a strong influence on the difficulty (and therefore cost) of becoming compliant with these two aspects.
For illustrative purposes, let’s distinguish between three different scenarios: A business utilizing a monolithic application landscape, a business with a Service Oriented Architecture (SOA) application landscape and a business using an Event-driven application landscape. This distinction is overly simplistic and in reality businesses often have a blend of the former, however, this categorization serves well for the underlying thought experiment.
Each of these landscapes has distinct characteristics as to where customer data has spread across the system, how accessible it is on demand and what the cost is of accessing individual records.
Scenario A: Monolithic Application Landscape
Here, we have one big application running all of the business logic, and typically we have a relational database catering for all of the business data persistence needs. In case the business deals with a lot of full-text, we may have a search-engine that also stores data.
Typically, these two data sinks would be the candidates where customer data is materialized and persisted (data in rest, e.g. in a customer-details table and index) before it is retrieved by the various subsystems of the monolith as part of the processing.
Logs would be written by each individual subsystem, and customer data may have spread into those.
A monolithic application is typically deployed on a single server with limited interfaces on its boundary (we can call these seams), which makes the enumeration of areas where customer data may be located a straightforward exercise.
Records in tables of a relational database and a search index are easy to access for CRUD operations, and log files have a well-defined shelf-life.
If a customer requested an export of personal information that is stored about her, a bunch of SQL (or other structured) queries will achieve the result. A similar bunch of queries can be utilized to purge the customer’s records from each of the data sources.
Log files are easily searchable and modifiable using standard tools provided by the operating system, e.g. grep, sed.
These processes can become part of the application, to be executed on behalf of the customer. The complexity of implementing this functionality is similar to the complexity of the application landscape.
Scenario B: Application Landscape based on Service Oriented Architecture
SOA is a common pattern across large enterprise applications. The enterprise IT infrastructure has grown over time, typically using enterprise-grade components such as application containers, message queues and a number of different databases.
While the overall picture is often complex and confusing, enterprises often have procedures in place that avoid sensitive customer data leaking uncontrolled into parts of the system.
The first important step is to have a clear understanding of the data-landscape, where customer records are persisted (data in rest) and where in the system customer data is flowing to (data in transit).
Enterprise Software Architects and GDBAs ensure data governance at an appropriate level. Sensitive data may have been spread across a number of databases, but generally would have been subject to strong data-governance.
As enterprise typically has large resources available, it is a question of resource allocation to the problem to ensure that the necessary procedures and processes are put in place to satisfy GDPR requirements.
Scenario C: Microservices in a distributed Event-driven Application Landscape
In an ideal microservices architecture, an event stream constitutes the data transport layer, thereby greatly increasing the amount of data in transit. In addition, each microservice may persist data in its own local database, leading to a further distribution of potentially sensitive data in rest.
Microservice architectures are popular among emerging software start-up companies, and they are very useful to encapsulate business responsibilities. They also require stringent data governance and skillful engineering.
Initially clean and tidy implementations tend to become unwieldy and difficult to manage once the number of microservices outgrows the number of engineers capable of assuming ownership of a service.
In the early stages of a software start-up, it is quite reasonable to expect the number of microservices to outgrow the number of dedicated developers in the early stages (first 2-3 years). If data governance has not been established by then, a microservices architecture – for all its merits – will pose significant challenges for day-to-day engineering, and GDPR will only add to that.
Data in rest is distributed all over the application, resulting in multiple entry-points should a customer instruct the deletion of her records.
Data in transit is plentiful and not randomly accessible. In event-driven architectures, message queues act as the data transport between different microservices, but individual messages cannot easily be accessed from a queue. Once a message containing sensitive customer information has found its way into a message queue, the message will stay there until its retention period has lapsed. Should a new microservice join the architecture, and start consuming from that queue, the sensitive customer information will resurface at the consumer.
As a result, retention periods for data in transit need to be chosen wisely in any event-driven architecture, so they can be utilized as compliance instrument for GDPR. Otherwise, complex and expensive processing logic will have to be deployed across the application landscape to ensure resurfacing sensitive customer data does not pose a risk of violating GDPR policy.
In all scenarios, further customer data may have leaked into Excel spreadsheets and as such would be lingering in reports as historical data on someone’s desktop or SharePoint.
One of the mantras that is often heard when discussing GDPR is ‘Know your data’, which becomes increasingly difficult in SOA and Microservices architectures, and which I will pick up in a follow-up post.
I am interested to hear about the challenges that you and your business is facing to ensure compliance with GDPR.
- Do you know your data, especially your customer data?
- How difficult was it to put GDPR procedures in place?
- Were you able to automate the processes that needed to be introduced?
I am looking forward to your replies, thank you for reading and until next time.