Big Data Testing Overview#
Big data testing generally refers to the testing of systems or applications that utilize big data technologies. Big data testing can be divided into two dimensions:
- One dimension is data testing,
- The other dimension is big data system testing and big data application product testing.
This article mainly introduces data testing.
Core Content of Testing#
Data Quality Testing#
- Completeness: Verify whether there are any missing data (e.g., empty fields, lost records).
- Consistency: Check whether the format and logic of data are consistent across different systems or storage.
- Accuracy: Ensure that data values are consistent with the real world (e.g., correct value range, proper format).
- Uniqueness: Detect duplicate data (e.g., primary key conflicts).
- Timeliness: Verify whether data is updated or synchronized as expected.
Data Processing Logic Testing#
- MapReduce/Spark Job Verification: Test the logical correctness of distributed computing tasks (e.g., aggregation, filtering, join operations).
- ETL (Extract-Transform-Load) Testing: Verify whether the transformation rules of data from source to target systems are accurate.
- Data Partitioning and Sharding Testing: Check whether data is correctly distributed across different nodes according to rules.
Performance Testing#
- Throughput: Test the amount of data processed by the system in a unit of time (e.g., number of records processed per second).
- Latency: Verify the response time of data processing (e.g., query duration).
- Scalability: Assess the system's performance improvement capability after node scaling.
- Fault Tolerance: Simulate node failures to test the system's recovery capability.
System Integration Testing#
- Component Compatibility: Verify the collaborative work of components like Hadoop, Spark, Kafka, Hive, etc.
- Interface Testing: Check the correctness of data transmission in APIs or message queues (e.g., Kafka).
Security Testing#
- Access Control: Verify user/role access permissions to data (e.g., HDFS ACL, Kerberos authentication).
- Data Encryption: Test whether data is encrypted during transmission and storage (e.g., SSL/TLS, static encryption).
- Audit Logs: Check whether operation logs are completely recorded.
Main Steps of Testing#
Requirement Analysis and Test Planning#
- Clarify business requirements (e.g., data processing rules, performance metrics).
- Develop testing strategies (e.g., tool selection, environment configuration, data scale).
Test Environment Setup#
- Deploy infrastructure such as Hadoop clusters, Spark, databases, etc.
- Configure test data generation tools (e.g., Apache NiFi, custom scripts).
Test Data Preparation#
- Data Generation: Use tools (e.g., DBMonster, Mockaroo) to create structured/unstructured data.
- Data Injection: Load data into HDFS, Kafka, or other storage or message queues.
Test Case Design#
- Cover positive scenarios (e.g., normal data processing) and exceptional scenarios (e.g., node crashes, data skew).
- Design performance testing scenarios (e.g., high-concurrency queries, large-scale data writes).
Test Execution and Monitoring#
- Run test cases and record results.
- Use monitoring tools (e.g., Ganglia, Prometheus) to track resource usage (CPU, memory, disk I/O).
Result Analysis and Reporting#
- Analyze failed cases and locate the root cause of issues (e.g., code logic errors, configuration problems).
- Generate test reports, including pass rates, performance metrics, and defect lists.
Regression Testing and Optimization#
- Retest after fixing defects to ensure issues are resolved and no new problems are introduced.
- Optimize system configuration based on performance testing results (e.g., adjust JVM parameters, optimize Shuffle processes).
Common Testing Methods#
Functional Testing Methods#
- Sampling Verification: Perform a full check on a subset of a large dataset.
- End-to-End Testing: Simulate the complete business process to verify the correctness of data from input to output.
- Golden Dataset Comparison: Compare processing results with a pre-generated correct dataset.
Performance Testing Methods#
- Benchmark Testing: Use standard datasets (e.g., TPC-DS) to evaluate system performance.
- Stress Testing: Gradually increase the load until the system crashes to identify bottlenecks.
- Stability Testing: Run tasks for an extended period to detect memory leaks or resource exhaustion issues.
Automated Testing#
- Tool Selection:
- Data Quality: Great Expectations, Deequ
- Performance Testing: JMeter, Gatling, YCSB (NoSQL benchmarking tool)
- ETL Testing: QuerySurge, Talend
- Framework Integration: Integrate test scripts into CI/CD pipelines (e.g., Jenkins, GitLab CI).
Chaos Engineering#
- Simulate failures in a distributed environment (e.g., network partition, disk failure) to validate the system's fault tolerance.
Test Case Examples#
Data Quality Testing#
Scenario: Verify whether the order data collected in real-time from Kafka is complete and without duplicates.
Test Case Design:
Test Case ID | DQ-001 |
---|---|
Test Objective | Verify the completeness and uniqueness of order data |
Preconditions | 1. 100,000 simulated order data injected into Kafka Topic (including order_id, user_id, amount, etc.). 2. Data has been consumed and stored in HDFS at /user/orders . |
Test Steps | 1. Use Hive to query the total amount of data in HDFS:SELECT COUNT(*) FROM orders; 2. Check the null rate of key fields: SELECT COUNT(*) FROM orders WHERE order_id IS NULL OR user_id IS NULL; 3. Detect duplicate order IDs: SELECT order_id, COUNT(*) AS cnt FROM orders GROUP BY order_id HAVING cnt > 1; |
Expected Result | Total data amount matches the injected amount from Kafka (100,000). Null rate of key fields (order_id, user_id) is 0%. No duplicate order_id records. |
Tool Support:
Use Great Expectations for automated validation of data distribution and constraints.
ETL Testing#
Scenario: Verify whether the ETL process of user data from MySQL to Hive is accurate.
Test Case Design:
Test Case ID | ETL-002 |
---|---|
Test Objective | Verify the transformation logic of the user age field (birth_date in MySQL to age in Hive) |
Preconditions | 1. The user table in MySQL contains fields: id, name, birth_date (DATE type).2. The ETL job converts birth_date to age in the Hive table (INT type, calculated by year). |
Test Steps | 1. Insert test data in MySQL:INSERT INTO user (id, name, birth_date) VALUES (1, 'Alice', '1990-05-20'), (2, 'Bob', '2005-11-15'); 2. Execute the ETL job to synchronize data to the Hive table user_hive .3. Query the age field in the Hive table: SELECT name, age FROM user_hive WHERE id IN (1, 2); |
Expected Result | Alice's age is current year - 1990 (e.g., 33 in 2023). Bob's age is current year - 2005 (e.g., 18 in 2023). |
Performance Testing#
Scenario: Test Hive's response time and resource usage under 100 concurrent queries.
Test Case Design:
Test Case ID | PERF-003 |
---|---|
Test Objective | Verify Hive's stability under concurrent queries |
Preconditions | 1. 100 million sales records loaded in the Hive table. 2. Test cluster configuration: 10 nodes (8-core CPU, 32GB memory). |
Test Steps | 1. Use JMeter to create 100 threads, each executing the following query:SELECT product_category, SUM(amount) FROM sales WHERE sale_date BETWEEN '2022-01-01' AND '2022-12-31' GROUP BY product_category; 2. Monitor CPU, memory, and GC status of HiveServer2 (via Ganglia or Prometheus). 3. Record the response time of each query and calculate the 90th percentile. |
Expected Result | All queries execute successfully, with no timeouts or OOM errors. Average response time ≤ 15 seconds, 90th percentile ≤ 20 seconds. Peak CPU usage ≤ 80%, with no sustained memory growth. |
Fault Tolerance Testing#
Scenario: Verify whether the Spark Streaming job can automatically recover when a Worker node crashes.
Test Case Design:
Test Case ID | FT-004 |
---|---|
Test Objective | Test the fault tolerance capability of Spark jobs |
Preconditions | 1. The Spark cluster has 3 Worker nodes. 2. Running a real-time word frequency counting job, reading data from Kafka, with a window interval of 1 minute. |
Test Steps | 1. Continuously send data to Kafka (100 records per second). 2. After running the job for 5 minutes, manually terminate one Worker node. 3. Observe: Whether the Driver logs show task rescheduling. Whether a new Worker automatically joins the cluster (if dynamic resource allocation is enabled). Whether the window results are complete (compare total word counts before and after the failure). |
Expected Result | The job recovers processing within 30 seconds, with no data loss. The final word frequency statistics match the total amount of sent data. |
Tool Support:
Use Chaos Monkey to simulate node failures.
Security Testing#
Scenario: Verify whether the ACL permissions control of the HDFS directory is effective.
Test Case Design:
Test Case ID | SEC-005 |
---|---|
Test Objective | Ensure that sensitive directories (e.g., /finance ) are only accessible to authorized users |
Preconditions | 1. The HDFS directory /finance has permissions set to 750 (user group is finance_team).2. User alice belongs to finance_team, while user bob does not. |
Test Steps | 1. As alice, execute:hdfs dfs -ls /finance (should succeed)hdfs dfs -put report.csv /finance (should succeed)2. As bob, execute: hdfs dfs -ls /finance (should return "Permission denied")hdfs dfs -rm /finance/report.csv (should fail) |
Expected Result | Authorized user (alice) can read and write to the directory, while unauthorized user (bob) is denied access. |
Automated Verification:
# Use a Shell script for automated testing
if hdfs dfs -ls /finance >/dev/null 2>&1; then
echo "Test FAILED: Unauthorized access allowed."
else
echo "Test PASSED."
fi