Big Data - Overview of Big Data Testing

Big Data Testing Overview#

Big data testing generally refers to the testing of systems or applications that utilize big data technologies. Big data testing can be divided into two dimensions:

One dimension is data testing,
The other dimension is big data system testing and big data application product testing.

This article mainly introduces data testing.

Core Content of Testing#

Data Quality Testing#

Completeness: Verify whether there are any missing data (e.g., empty fields, lost records).
Consistency: Check whether the format and logic of data are consistent across different systems or storage.
Accuracy: Ensure that data values are consistent with the real world (e.g., correct value range, proper format).
Uniqueness: Detect duplicate data (e.g., primary key conflicts).
Timeliness: Verify whether data is updated or synchronized as expected.

Data Processing Logic Testing#

MapReduce/Spark Job Verification: Test the logical correctness of distributed computing tasks (e.g., aggregation, filtering, join operations).
ETL (Extract-Transform-Load) Testing: Verify whether the transformation rules of data from source to target systems are accurate.
Data Partitioning and Sharding Testing: Check whether data is correctly distributed across different nodes according to rules.

Performance Testing#

Throughput: Test the amount of data processed by the system in a unit of time (e.g., number of records processed per second).
Latency: Verify the response time of data processing (e.g., query duration).
Scalability: Assess the system's performance improvement capability after node scaling.
Fault Tolerance: Simulate node failures to test the system's recovery capability.

System Integration Testing#

Component Compatibility: Verify the collaborative work of components like Hadoop, Spark, Kafka, Hive, etc.
Interface Testing: Check the correctness of data transmission in APIs or message queues (e.g., Kafka).

Security Testing#

Access Control: Verify user/role access permissions to data (e.g., HDFS ACL, Kerberos authentication).
Data Encryption: Test whether data is encrypted during transmission and storage (e.g., SSL/TLS, static encryption).
Audit Logs: Check whether operation logs are completely recorded.

Main Steps of Testing#

Requirement Analysis and Test Planning#

Clarify business requirements (e.g., data processing rules, performance metrics).
Develop testing strategies (e.g., tool selection, environment configuration, data scale).

Test Environment Setup#

Deploy infrastructure such as Hadoop clusters, Spark, databases, etc.
Configure test data generation tools (e.g., Apache NiFi, custom scripts).

Test Data Preparation#

Data Generation: Use tools (e.g., DBMonster, Mockaroo) to create structured/unstructured data.
Data Injection: Load data into HDFS, Kafka, or other storage or message queues.

Test Case Design#

Cover positive scenarios (e.g., normal data processing) and exceptional scenarios (e.g., node crashes, data skew).
Design performance testing scenarios (e.g., high-concurrency queries, large-scale data writes).

Test Execution and Monitoring#

Run test cases and record results.
Use monitoring tools (e.g., Ganglia, Prometheus) to track resource usage (CPU, memory, disk I/O).

Result Analysis and Reporting#

Analyze failed cases and locate the root cause of issues (e.g., code logic errors, configuration problems).
Generate test reports, including pass rates, performance metrics, and defect lists.

Regression Testing and Optimization#

Retest after fixing defects to ensure issues are resolved and no new problems are introduced.
Optimize system configuration based on performance testing results (e.g., adjust JVM parameters, optimize Shuffle processes).

Common Testing Methods#

Functional Testing Methods#

Sampling Verification: Perform a full check on a subset of a large dataset.
End-to-End Testing: Simulate the complete business process to verify the correctness of data from input to output.
Golden Dataset Comparison: Compare processing results with a pre-generated correct dataset.

Performance Testing Methods#

Benchmark Testing: Use standard datasets (e.g., TPC-DS) to evaluate system performance.
Stress Testing: Gradually increase the load until the system crashes to identify bottlenecks.
Stability Testing: Run tasks for an extended period to detect memory leaks or resource exhaustion issues.

Automated Testing#

Tool Selection:
- Data Quality: Great Expectations, Deequ
- Performance Testing: JMeter, Gatling, YCSB (NoSQL benchmarking tool)
- ETL Testing: QuerySurge, Talend
Framework Integration: Integrate test scripts into CI/CD pipelines (e.g., Jenkins, GitLab CI).

Chaos Engineering#

Simulate failures in a distributed environment (e.g., network partition, disk failure) to validate the system's fault tolerance.

Test Case Examples#

Data Quality Testing#

Scenario: Verify whether the order data collected in real-time from Kafka is complete and without duplicates.

Test Case Design:

Test Case ID	DQ-001
Test Objective	Verify the completeness and uniqueness of order data
Preconditions	1. 100,000 simulated order data injected into Kafka Topic (including order_id, user_id, amount, etc.). 2. Data has been consumed and stored in HDFS at `/user/orders`.
Test Steps	1. Use Hive to query the total amount of data in HDFS: `SELECT COUNT() FROM orders;` 2. Check the null rate of key fields: `SELECT COUNT() FROM orders WHERE order_id IS NULL OR user_id IS NULL;` 3. Detect duplicate order IDs: `SELECT order_id, COUNT(*) AS cnt FROM orders GROUP BY order_id HAVING cnt > 1;`
Expected Result	Total data amount matches the injected amount from Kafka (100,000). Null rate of key fields (order_id, user_id) is 0%. No duplicate order_id records.

Tool Support:
Use Great Expectations for automated validation of data distribution and constraints.

ETL Testing#

Scenario: Verify whether the ETL process of user data from MySQL to Hive is accurate.

Test Case Design:

Test Case ID	ETL-002
Test Objective	Verify the transformation logic of the user age field (birth_date in MySQL to age in Hive)
Preconditions	1. The `user` table in MySQL contains fields: id, name, birth_date (DATE type). 2. The ETL job converts birth_date to age in the Hive table (INT type, calculated by year).
Test Steps	1. Insert test data in MySQL: `INSERT INTO user (id, name, birth_date) VALUES (1, 'Alice', '1990-05-20'), (2, 'Bob', '2005-11-15');` 2. Execute the ETL job to synchronize data to the Hive table `user_hive`. 3. Query the age field in the Hive table: `SELECT name, age FROM user_hive WHERE id IN (1, 2);`
Expected Result	Alice's age is current year - 1990 (e.g., 33 in 2023). Bob's age is current year - 2005 (e.g., 18 in 2023).

Performance Testing#

Scenario: Test Hive's response time and resource usage under 100 concurrent queries.

Test Case Design:

Test Case ID	PERF-003
Test Objective	Verify Hive's stability under concurrent queries
Preconditions	1. 100 million sales records loaded in the Hive table. 2. Test cluster configuration: 10 nodes (8-core CPU, 32GB memory).
Test Steps	1. Use JMeter to create 100 threads, each executing the following query: `SELECT product_category, SUM(amount) FROM sales WHERE sale_date BETWEEN '2022-01-01' AND '2022-12-31' GROUP BY product_category;` 2. Monitor CPU, memory, and GC status of HiveServer2 (via Ganglia or Prometheus). 3. Record the response time of each query and calculate the 90th percentile.
Expected Result	All queries execute successfully, with no timeouts or OOM errors. Average response time ≤ 15 seconds, 90th percentile ≤ 20 seconds. Peak CPU usage ≤ 80%, with no sustained memory growth.

Fault Tolerance Testing#

Scenario: Verify whether the Spark Streaming job can automatically recover when a Worker node crashes.

Test Case Design:

Test Case ID	FT-004
Test Objective	Test the fault tolerance capability of Spark jobs
Preconditions	1. The Spark cluster has 3 Worker nodes. 2. Running a real-time word frequency counting job, reading data from Kafka, with a window interval of 1 minute.
Test Steps	1. Continuously send data to Kafka (100 records per second). 2. After running the job for 5 minutes, manually terminate one Worker node. 3. Observe: Whether the Driver logs show task rescheduling. Whether a new Worker automatically joins the cluster (if dynamic resource allocation is enabled). Whether the window results are complete (compare total word counts before and after the failure).
Expected Result	The job recovers processing within 30 seconds, with no data loss. The final word frequency statistics match the total amount of sent data.

Tool Support:
Use Chaos Monkey to simulate node failures.

Security Testing#

Scenario: Verify whether the ACL permissions control of the HDFS directory is effective.

Test Case Design:

Test Case ID	SEC-005
Test Objective	Ensure that sensitive directories (e.g., `/finance`) are only accessible to authorized users
Preconditions	1. The HDFS directory `/finance` has permissions set to 750 (user group is finance_team). 2. User alice belongs to finance_team, while user bob does not.
Test Steps	1. As alice, execute: `hdfs dfs -ls /finance` (should succeed) `hdfs dfs -put report.csv /finance` (should succeed) 2. As bob, execute: `hdfs dfs -ls /finance` (should return "Permission denied") `hdfs dfs -rm /finance/report.csv` (should fail)
Expected Result	Authorized user (alice) can read and write to the directory, while unauthorized user (bob) is denied access.

Automated Verification:

# Use a Shell script for automated testing  
if hdfs dfs -ls /finance >/dev/null 2>&1; then  
  echo "Test FAILED: Unauthorized access allowed."  
else  
  echo "Test PASSED."  
fi