Development History#
The emergence of new things is rooted in the resolution of old contradictions and the birth of new contradictions; its essence is the phased result of the movement of contradictions;
Contradictions are the inherent universal properties of things, while development is a dynamic process achieved through the continuous "generation-solution-reproduction" of contradictions.
The essence of big data technology's emergence is the contradiction between storage and computation, while its development is a continuous game between "explosive growth of data scale and the demand for computational efficiency and real-time performance." This contradiction drives continuous technological innovation.
Data Scale Expansion vs Insufficient Storage and Computing Capacity#
- Essence of the contradiction: Traditional single-machine systems cannot bear the exponentially growing data volume (from TB to PB and even EB levels).
- Technological breakthroughs:
- Distributed storage (e.g., HDFS, cloud storage) disperses data across multiple nodes, solving storage bottlenecks.
- Distributed computing frameworks (e.g., MapReduce, Spark) enhance computational throughput through parallel processing.
Batch Processing Latency vs Real-time Demand#
- Essence of the contradiction: Early Hadoop only supported offline batch processing (hour/day-level latency), but business needs require second-level or even millisecond-level responses.
- Technological breakthroughs:
- In-memory computing (Spark) reduces disk I/O, improving batch processing speed.
- Stream processing engines (Flink, Kafka Streams) achieve "data processing in motion," meeting real-time analysis needs.
Data Diversity vs Uniformity of Processing Paradigms#
- Essence of the contradiction: The complexity of structured, semi-structured, and unstructured data (text, images, etc.) is incompatible with the single processing model of traditional databases.
- Technological breakthroughs:
- Multimodal storage: NoSQL (e.g., MongoDB), data lakes (e.g., Delta Lake) support flexible data models.
- Hybrid computing engines: Spark supports batch processing, stream processing, graph computing, and machine learning, achieving "one-stack" processing.
Static Resource Allocation vs Dynamic Elasticity Demand#
- Essence of the contradiction: Fixed cluster resources lead to low utilization and cannot cope with business fluctuations.
- Technological breakthroughs:
- Cloud-native architecture: Kubernetes achieves containerized elastic scaling, Serverless (e.g., AWS Lambda) allocates resources on demand, reducing costs.
Centralized Governance vs Distributed Complexity#
- Essence of the contradiction: Data is scattered across multiple systems (databases, data lakes, streaming platforms), making unified management and quality assurance difficult.
- Technological breakthroughs:
- Data Fabric: Achieves cross-platform data collaboration through metadata management (e.g., Apache Atlas) and automated pipelines (e.g., Airflow).
- Lakehouse: Tools like Delta Lake integrate the flexibility of data lakes with the governance capabilities of data warehouses.
Summary - Contradictions Drive Innovation#
The evolution of big data technology is essentially a process of "continuously breaking old balances and establishing new balances":
- From Hadoop's "exchanging storage for computation" to Spark's "exchanging memory for speed";
- From the determinism of batch processing to the uncertainty response of stream processing;
- From fixed resource allocation to cloud-native elastic scaling.
Each intensification of contradictions has given birth to new technologies, and future trends (such as edge computing and AI-native data platforms) will continue to revolve around this core contradiction.
Technical System#
Ecological Architecture#
Data Collection Layer#
- Goal: Efficiently collect data from multiple sources (databases, logs, sensors, etc.).
- Tools:
- Batch collection: Sqoop (relational database ↔ Hadoop), Flume (log collection).
- Real-time collection: Kafka (distributed message queue), Debezium (CDC change capture).
- Web scraping: Scrapy, Apache Nutch (web data scraping).
Data Storage Layer#
- Distributed file systems:
- HDFS: Core storage of the Hadoop ecosystem, suitable for cold data.
- Object storage: AWS S3, Alibaba Cloud OSS (cloud-native scenarios).
- NoSQL databases:
- Key-value: Redis (in-memory cache), DynamoDB (high concurrency).
- Column storage: HBase (massive random read/write), Cassandra (high availability).
- Document-based: MongoDB (flexible JSON structure).
- Data lakes: Delta Lake, Iceberg (lake-house architecture supporting ACID transactions).
Resource Management and Scheduling Layer#
- Cluster management:
- YARN: Hadoop resource scheduler.
- Kubernetes: Container orchestration, supporting hybrid cloud deployment.
- Workflow engines:
- Airflow: Task dependency management and scheduling.
- DolphinScheduler: Domestic visual scheduling tool.
Data Computing Layer#
- Batch processing:
- MapReduce: Native Hadoop computing model, suitable for offline tasks.
- Spark SQL: Compatible with SQL syntax, optimizing complex ETL processes.
- Stream processing:
- Flink: Low-latency stream processing, supporting state management and window computation.
- Spark Streaming: Micro-batch processing, seamlessly integrated with the Spark ecosystem.
- Interactive querying:
- Presto: Federated querying across multiple data sources, suitable for ad-hoc analysis.
- ClickHouse: Columnar storage, ultra-fast response in OLAP scenarios.
Data Analysis and Mining Layer#
- Machine learning:
- Spark MLlib: Distributed machine learning library.
- TensorFlow/PyTorch: Deep learning frameworks, integrated with big data platforms (e.g., Horovod).
- Data visualization:
- Tableau/Power BI: Business intelligence tools.
- Superset/Grafana: Open-source visualization dashboards.
- Graph computing: Neo4j (graph database), GraphX (Spark graph processing library).
Data Governance and Security#
- Metadata management: Apache Atlas (data lineage tracking).
- Data quality: Great Expectations (data validation framework).
- Security compliance: Kerberos (authentication), Ranger (permission control), GDPR compliance tools.
Data Layering#
In big data architecture, data layering is a design method that divides data into different levels based on processing stages, purposes, and access needs, aiming to improve data management efficiency, reduce redundancy, optimize performance, and support diverse analysis scenarios.
Raw Data Layer#
Data operation layer: Operation Data Store data preparation area, also known as the source layer.
Input table: None (directly connected to data sources)
Output table: Raw Data Tables
Example:
- User Click Log Table (ods_user_click_log)
{
"timestamp": "2023-10-01T14:22:35+08:00",
"user_id": "u_12345",
"event": "click_product_detail",
"device": "Android 12|Xiaomi 13 Pro",
"ip": "192.168.1.100",
"extra": "{'product_id':'p_678', 'page_num':3}"
}
- MySQL Order Table Snapshot (ods_order_mysql)
order_id | user_id | amount | currency | create_time | status |
---|---|---|---|---|---|
1001 | u_123 | 299.00 | CNY | 2023-10-01 14:25:00 | pending |
1002 | u_456 | 150.50 | USD | 2023-10-01 14:30:00 | completed |
Characteristics: Retains all fields of the raw data, including redundant and uncleaned information.
Cleaning and Standardization Layer#
Input table: ods_user_click_log
, ods_order_mysql
Output table: Cleaned structured table
Example:
- Standardized Click Log Table (cleaned_user_click)
log_id | event_time | user_id | event_type | device_os | device_model | ip_hash | product_id | page_num |
---|---|---|---|---|---|---|---|---|
1 | 2023-10-01 14:22:35 | 12345 | product_detail | Android | Xiaomi 13 Pro | a1b2c3d4 | p_678 | 3 |
Processing logic:
- Parse the
extra
field's JSON to extractproduct_id
andpage_num
. - Standardize
user_id
to pure numbers (removing prefixu_
). - Hash and anonymize the
ip
field. - Split the
device
field into operating system and device model.
- Unified Order Table (cleaned_order)
order_id | user_id | amount_cny | create_time | status_code |
---|---|---|---|---|
1001 | 123 | 299.00 | 2023-10-01 14:25:00 | 1 |
1002 | 456 | 1053.50 | 2023-10-01 14:30:00 | 2 |
Processing logic:
- Convert currency to CNY (assuming 1 USD = 7.0 CNY).
- Map status codes (pending→1, completed→2).
Integration and Modeling Layer#
Data detail layer: data warehouse details, DWD/Dimensional Model
This layer is the isolation layer between the business layer and the data warehouse, maintaining the same data granularity as the ODS layer; mainly involves some data cleaning and standardization operations on the ODS data layer, such as removing empty data, dirty data, outliers, etc.
Data intermediate layer: Data Warehouse Middle, DWM;
This layer performs some light aggregation operations on the data based on the DWD layer, generating some intermediate result tables to enhance the reusability of common metrics and reduce redundant processing work.
Input table: cleaned_user_click
, cleaned_order
Output table: Dimension table + Fact table
Example:
- Dimension Table: User Dimension (dim_user)
user_id | name | gender | age | reg_date | vip_level |
---|---|---|---|---|---|
123 | Zhang San | M | 28 | 2022-01-01 | 2 |
456 | Li Si | F | 35 | 2021-05-15 | 3 |
- Fact Table: Order Transaction Fact Table (fact_order)
order_id | user_id | product_id | amount | order_time | payment_time |
---|---|---|---|---|---|
1001 | 123 | p_678 | 299.00 | 2023-10-01 14:25:00 | 2023-10-01 14:26:05 |
1002 | 456 | p_901 | 1053.50 | 2023-10-01 14:30:00 | 2023-10-01 14:31:20 |
Modeling logic:
- Associate the fact table with the dimension table through
user_id
, supporting scenarios such as "analyzing order amounts by gender."
Summary and Aggregation Layer#
Data service layer: Data Warehouse Service, DWS/Data Mart;
This layer is based on the foundational data from DWM, integrating and summarizing data services for analyzing a specific subject area, generally wide tables, used for subsequent business queries, OLAP analysis, data distribution, etc.
Generally, the data tables in this layer are relatively few; one table covers more business content, and due to its many fields, this layer's tables are often referred to as wide tables.
Input table: fact_order
, dim_user
Output table: Pre-aggregated wide table
Example:
- Daily User Spending Summary Table (dws_user_daily_spend)
date | user_id | gender | total_amount | order_count | avg_amount |
---|---|---|---|---|---|
2023-10-01 | 123 | M | 299.00 | 1 | 299.00 |
2023-10-01 | 456 | F | 1053.50 | 1 | 1053.50 |
Calculation logic:
- Aggregate each user's total order amount, order count, and average amount by day.
- Join with the dimension table to obtain the gender field, supporting quick generation of "comparison reports on spending by different genders."
Application and Service Layer#
Data application layer: Application Data Service, ADS;
This layer mainly provides data for data products and data analysis, generally stored in systems like ES, Redis, PostgreSQL for online use; it may also be stored in Hive or Druid for data analysis and mining, such as commonly used data reports stored here.
Input table: dws_user_daily_spend
Output table: Business interface or report
Example:
- BI Report Data (ads_bi_gender_spend)
Date | Gender | Total Spending | Order Count |
---|---|---|---|
2023-10-01 | Male | 299.00 | 1 |
2023-10-01 | Female | 1053.50 | 1 |
- API Response (User Profile Interface)
{
"user_id": 123,
"last_purchase_date": "2023-10-01",
"total_spend_7d": 299.00,
"favorite_category": "Electronics"
}
Characteristics: Highly aggregated data, field naming conforms to business terminology, can be directly used for display or decision-making.
ETL#
Core Definition#
ETL is the standardized process of moving data from source systems to target storage, consisting of three stages:
- Extract: Extract raw data from heterogeneous data sources.
- Transform: Clean, standardize, and process data.
- Load: Write the processed data into target storage.
Technology Stack and Tools#
Stage | Typical Tools |
---|---|
Extract | Sqoop, Flume, Kafka, Debezium (CDC), AWS Glue |
Transform | Spark, Flink, dbt, Python Pandas, SQL |
Load | Hive, HBase, ClickHouse, Snowflake, Redis, Elasticsearch |
References#
Detailed Explanation of Data Layering in Data Warehousing: ODS, DWD, DWM, DWS, ADS