In 2017, The Economist wrote, “The world’s most valuable resource is no longer oil but data.” It was a catchphrase picked up by several other mainstream publications, but the idea had been around for well over a decade. As far back as 2006, British mathematician and data science entrepreneur Clive Humby had said, “Data is the new oil.”
In 2020, the International Data Corporation (IDC) estimated that more than 59 zettabytes (ZB) of data would be produced, acquired, duplicated, and consumed globally. It also predicted that the data produced over the next three years would surpass that of the previous thirty years!
These projections are likely conservative given the shift to the virtual that the pandemic entailed. The question today is not how to collect data but how to optimize it to provide valuable insights and timely action. Because ultimately, data is as data does.
It is in this context that data engineering has become so relevant. By enabling organizations to optimize data for usability, data engineering touches various stages of data, right from creation to movement, validation, curation, storage, and, in some cases, data retention. And as data engineering projects become more prominent and complex, testing is increasingly essential to keep pace.
The role of data testing
Testing works on the premise that, like software, data is vital to the success of your enterprise. Therefore, it must always be dependable and trustworthy. Data testing enables you to identify data quality concerns before they reach the production data stream. Data engineers can confirm their organization’s assumptions and develop rules to prevent problems.
Generally, when testing, two distinct aspects need consideration: software and data pipelines. Software testing initially had a three-tier architecture, covering the front end, middle tier, and database. To maximize efficiency, the backend has been divided into manageable, independent components that are interlinked but independent of each other.
This trend began about 5?6 years ago with the advent of service-oriented architecture (SOA). The transition to manageable, self-operating SOA components changed the entire IT industry. Today, we call these APIs or microservices. APIs are bridges that carry data. Testers need to focus on the frontend, the GUI, and the middleware, which includes APIs. You can test APIs and microservices without touching the front end. Similarly, you can test the backend independently by running SQL queries to evaluate the performance of CRUD operations.
The data’s effectiveness needs to be evaluated when it comes to data. For instance, are there null values in the extractor source system? Has the data been transposed correctly, and so on. Since data engineering touches upon various stages of data, we will need to rely on data testing tools that can test the data from all these angles.
There are a host of tools for testing the data. For instance, there is a toolset from Apache for extracting; certain Lambda functions in AWS allow you to transform the data, and there are different tools for load functions.
Automating the process
There are several toolkits for automating testing, and this field is expanding. However, even today, only some degree of automation is achievable. For instance, in data validation, open-source tools such as DataDog and Metabase (along with a few commercial solutions) may assist in monitoring the data’s characteristics.
When transferring data in batches, you may use these tools to examine the data’s attributes. From the perspective of data ingestion, where you load the data, Zookeepers and Apache Kafka create a pipeline to transfer data from A to B. You can also set up monitoring and logging methods to collect data-related information.
Data storage systems, such as your S3 buckets, data lakes, or HDFS file system, will include data-related metadata, indicating what kind of data is received and saved. Talend and Kettle are open-source ETL alternatives for data migration.
These tools help with configuring, testing, and moving the data. A test environment can be set up before these tools are configured in production. If you find anomalies in the source, you can rewrite the SQL queries or correct the dates, which will eventually fail in the target system. With automation and semi-automation, you can use some of these tools, look at the logs and then automate the logs to monitor the data, inspect its quality, and configure it in production.
Testing-first approach
Automation can help establish a testing-first approach, which has been important for the last decade. As opposed to bringing in testers at the end of the development cycle, both developers and testers are involved in requirements gathering right from the beginning.
DevOps pipelines automatically test code when it is fixed and redeployed, so fixing these errors early on improves the quality of the system and the product. Similarly, with data pipelines, especially as we embark toward smart automation with AI and ML.
Smart automation brings low-code aspects into the machine world, with drag-and-drop tools such as Mendix, Claris, or Filemaker simplifying workflow configuration even further. However, AI-led automation is complementary rather than substitutionary. While AI helps analyze data and get a model for forecasting, the testing world is guided by actual metrics rather than prediction. Even if the data engineering side of testing is evolving, it can still aid with automation and ensure quality assurance.
A script or code created today may be irrelevant tomorrow because the application will have changed. AI-led smart automation tools can identify elements through multiple means?such as element IDs, X-Parse, or CSS?and generate scripts automatically.
Moving towards data assurance
In the days ahead, we can expect to see tools that will automatically analyze the data, curate it, and so on, without human intervention. As data evolves and SaaS takes over the IT world, drawing insights from the data deluge for last-mile adoption will become increasingly critical. And this is where data assurance and data engineering validation will play a crucial role.
Trustworthiness and ease of use will be pivotal to adoption. When data scientists work with data, they usually divide it into training and test data in a 60:40, 70:30, or 80:20 ratio. Nevertheless, the model will be inaccurate if the training data is wrong. Manually checking zillions of rows and columns is impossible. By using tools such as DataBricks, Pandas, or Python, data assurance increases the model’s accuracy level from 80 to 90%.
The next step that companies can look at would be establishing R&D groups inside the data assurance practice and evaluating various OpenSource tools to create a spectrum of automation. Such customized automation solutions will take data assurance to the next level and prepare the system for the upcoming IoT revolution.
– The author is the Executive Vice President, Digital Assurance