banner
武大胆

武大胆

不能为这个世界留下些什么,但是却想在这个世界留下些什么!
bilibili
x
discord user
email
steam

Big Data - Overview of Big Data Technologies

Development History#

The emergence of new things is rooted in the resolution of old contradictions and the birth of new contradictions; its essence is the phased result of the movement of contradictions;

Contradictions are the inherent universal properties of things, while development is a dynamic process achieved through the continuous "generation-solution-reproduction" of contradictions.

The essence of big data technology's emergence is the contradiction between storage and computation, while its development is a continuous game between "explosive growth of data scale and the demand for computational efficiency and real-time performance." This contradiction drives continuous technological innovation.

Data Scale Expansion vs Insufficient Storage and Computing Capacity#

  • Essence of the contradiction: Traditional single-machine systems cannot bear the exponentially growing data volume (from TB to PB and even EB levels).
  • Technological breakthroughs:
    • Distributed storage (e.g., HDFS, cloud storage) disperses data across multiple nodes, solving storage bottlenecks.
    • Distributed computing frameworks (e.g., MapReduce, Spark) enhance computational throughput through parallel processing.

Batch Processing Latency vs Real-time Demand#

  • Essence of the contradiction: Early Hadoop only supported offline batch processing (hour/day-level latency), but business needs require second-level or even millisecond-level responses.
  • Technological breakthroughs:
    • In-memory computing (Spark) reduces disk I/O, improving batch processing speed.
    • Stream processing engines (Flink, Kafka Streams) achieve "data processing in motion," meeting real-time analysis needs.

Data Diversity vs Uniformity of Processing Paradigms#

  • Essence of the contradiction: The complexity of structured, semi-structured, and unstructured data (text, images, etc.) is incompatible with the single processing model of traditional databases.
  • Technological breakthroughs:
    • Multimodal storage: NoSQL (e.g., MongoDB), data lakes (e.g., Delta Lake) support flexible data models.
    • Hybrid computing engines: Spark supports batch processing, stream processing, graph computing, and machine learning, achieving "one-stack" processing.

Static Resource Allocation vs Dynamic Elasticity Demand#

  • Essence of the contradiction: Fixed cluster resources lead to low utilization and cannot cope with business fluctuations.
  • Technological breakthroughs:
    • Cloud-native architecture: Kubernetes achieves containerized elastic scaling, Serverless (e.g., AWS Lambda) allocates resources on demand, reducing costs.

Centralized Governance vs Distributed Complexity#

  • Essence of the contradiction: Data is scattered across multiple systems (databases, data lakes, streaming platforms), making unified management and quality assurance difficult.
  • Technological breakthroughs:
    • Data Fabric: Achieves cross-platform data collaboration through metadata management (e.g., Apache Atlas) and automated pipelines (e.g., Airflow).
    • Lakehouse: Tools like Delta Lake integrate the flexibility of data lakes with the governance capabilities of data warehouses.

Summary - Contradictions Drive Innovation#

The evolution of big data technology is essentially a process of "continuously breaking old balances and establishing new balances":

  • From Hadoop's "exchanging storage for computation" to Spark's "exchanging memory for speed";
  • From the determinism of batch processing to the uncertainty response of stream processing;
  • From fixed resource allocation to cloud-native elastic scaling.

Each intensification of contradictions has given birth to new technologies, and future trends (such as edge computing and AI-native data platforms) will continue to revolve around this core contradiction.


Technical System#

Ecological Architecture#

image

Data Collection Layer#

  • Goal: Efficiently collect data from multiple sources (databases, logs, sensors, etc.).
  • Tools:
    • Batch collection: Sqoop (relational database ↔ Hadoop), Flume (log collection).
    • Real-time collection: Kafka (distributed message queue), Debezium (CDC change capture).
    • Web scraping: Scrapy, Apache Nutch (web data scraping).

Data Storage Layer#

  • Distributed file systems:
    • HDFS: Core storage of the Hadoop ecosystem, suitable for cold data.
    • Object storage: AWS S3, Alibaba Cloud OSS (cloud-native scenarios).
  • NoSQL databases:
    • Key-value: Redis (in-memory cache), DynamoDB (high concurrency).
    • Column storage: HBase (massive random read/write), Cassandra (high availability).
    • Document-based: MongoDB (flexible JSON structure).
  • Data lakes: Delta Lake, Iceberg (lake-house architecture supporting ACID transactions).

Resource Management and Scheduling Layer#

  • Cluster management:
    • YARN: Hadoop resource scheduler.
    • Kubernetes: Container orchestration, supporting hybrid cloud deployment.
  • Workflow engines:
    • Airflow: Task dependency management and scheduling.
    • DolphinScheduler: Domestic visual scheduling tool.

Data Computing Layer#

  • Batch processing:
    • MapReduce: Native Hadoop computing model, suitable for offline tasks.
    • Spark SQL: Compatible with SQL syntax, optimizing complex ETL processes.
  • Stream processing:
    • Flink: Low-latency stream processing, supporting state management and window computation.
    • Spark Streaming: Micro-batch processing, seamlessly integrated with the Spark ecosystem.
  • Interactive querying:
    • Presto: Federated querying across multiple data sources, suitable for ad-hoc analysis.
    • ClickHouse: Columnar storage, ultra-fast response in OLAP scenarios.

Data Analysis and Mining Layer#

  • Machine learning:
    • Spark MLlib: Distributed machine learning library.
    • TensorFlow/PyTorch: Deep learning frameworks, integrated with big data platforms (e.g., Horovod).
  • Data visualization:
    • Tableau/Power BI: Business intelligence tools.
    • Superset/Grafana: Open-source visualization dashboards.
  • Graph computing: Neo4j (graph database), GraphX (Spark graph processing library).

Data Governance and Security#

  • Metadata management: Apache Atlas (data lineage tracking).
  • Data quality: Great Expectations (data validation framework).
  • Security compliance: Kerberos (authentication), Ranger (permission control), GDPR compliance tools.

Data Layering#

In big data architecture, data layering is a design method that divides data into different levels based on processing stages, purposes, and access needs, aiming to improve data management efficiency, reduce redundancy, optimize performance, and support diverse analysis scenarios.

image

Raw Data Layer#

Data operation layer: Operation Data Store data preparation area, also known as the source layer.

Input table: None (directly connected to data sources)
Output table: Raw Data Tables
Example:

  • User Click Log Table (ods_user_click_log)
{
  "timestamp": "2023-10-01T14:22:35+08:00",
  "user_id": "u_12345",
  "event": "click_product_detail",
  "device": "Android 12|Xiaomi 13 Pro",
  "ip": "192.168.1.100",
  "extra": "{'product_id':'p_678', 'page_num':3}"
}
  • MySQL Order Table Snapshot (ods_order_mysql)
order_iduser_idamountcurrencycreate_timestatus
1001u_123299.00CNY2023-10-01 14:25:00pending
1002u_456150.50USD2023-10-01 14:30:00completed

Characteristics: Retains all fields of the raw data, including redundant and uncleaned information.


Cleaning and Standardization Layer#

Input table: ods_user_click_log, ods_order_mysql
Output table: Cleaned structured table
Example:

  • Standardized Click Log Table (cleaned_user_click)
log_idevent_timeuser_idevent_typedevice_osdevice_modelip_hashproduct_idpage_num
12023-10-01 14:22:3512345product_detailAndroidXiaomi 13 Proa1b2c3d4p_6783

Processing logic:

  • Parse the extra field's JSON to extract product_id and page_num.
  • Standardize user_id to pure numbers (removing prefix u_).
  • Hash and anonymize the ip field.
  • Split the device field into operating system and device model.
  • Unified Order Table (cleaned_order)
order_iduser_idamount_cnycreate_timestatus_code
1001123299.002023-10-01 14:25:001
10024561053.502023-10-01 14:30:002

Processing logic:

  • Convert currency to CNY (assuming 1 USD = 7.0 CNY).
  • Map status codes (pending→1, completed→2).

Integration and Modeling Layer#

Data detail layer: data warehouse details, DWD/Dimensional Model
This layer is the isolation layer between the business layer and the data warehouse, maintaining the same data granularity as the ODS layer; mainly involves some data cleaning and standardization operations on the ODS data layer, such as removing empty data, dirty data, outliers, etc.
Data intermediate layer: Data Warehouse Middle, DWM;
This layer performs some light aggregation operations on the data based on the DWD layer, generating some intermediate result tables to enhance the reusability of common metrics and reduce redundant processing work.

Input table: cleaned_user_click, cleaned_order
Output table: Dimension table + Fact table
Example:

  • Dimension Table: User Dimension (dim_user)
user_idnamegenderagereg_datevip_level
123Zhang SanM282022-01-012
456Li SiF352021-05-153
  • Fact Table: Order Transaction Fact Table (fact_order)
order_iduser_idproduct_idamountorder_timepayment_time
1001123p_678299.002023-10-01 14:25:002023-10-01 14:26:05
1002456p_9011053.502023-10-01 14:30:002023-10-01 14:31:20

Modeling logic:

  • Associate the fact table with the dimension table through user_id, supporting scenarios such as "analyzing order amounts by gender."

Summary and Aggregation Layer#

Data service layer: Data Warehouse Service, DWS/Data Mart;
This layer is based on the foundational data from DWM, integrating and summarizing data services for analyzing a specific subject area, generally wide tables, used for subsequent business queries, OLAP analysis, data distribution, etc.
Generally, the data tables in this layer are relatively few; one table covers more business content, and due to its many fields, this layer's tables are often referred to as wide tables.

Input table: fact_order, dim_user
Output table: Pre-aggregated wide table
Example:

  • Daily User Spending Summary Table (dws_user_daily_spend)
dateuser_idgendertotal_amountorder_countavg_amount
2023-10-01123M299.001299.00
2023-10-01456F1053.5011053.50

Calculation logic:

  • Aggregate each user's total order amount, order count, and average amount by day.
  • Join with the dimension table to obtain the gender field, supporting quick generation of "comparison reports on spending by different genders."

Application and Service Layer#

Data application layer: Application Data Service, ADS;
This layer mainly provides data for data products and data analysis, generally stored in systems like ES, Redis, PostgreSQL for online use; it may also be stored in Hive or Druid for data analysis and mining, such as commonly used data reports stored here.

Input table: dws_user_daily_spend
Output table: Business interface or report
Example:

  • BI Report Data (ads_bi_gender_spend)
DateGenderTotal SpendingOrder Count
2023-10-01Male299.001
2023-10-01Female1053.501
  • API Response (User Profile Interface)
{
  "user_id": 123,
  "last_purchase_date": "2023-10-01",
  "total_spend_7d": 299.00,
  "favorite_category": "Electronics"
}

Characteristics: Highly aggregated data, field naming conforms to business terminology, can be directly used for display or decision-making.


ETL#

Core Definition#

ETL is the standardized process of moving data from source systems to target storage, consisting of three stages:

  1. Extract: Extract raw data from heterogeneous data sources.
  2. Transform: Clean, standardize, and process data.
  3. Load: Write the processed data into target storage.

Technology Stack and Tools#

StageTypical Tools
ExtractSqoop, Flume, Kafka, Debezium (CDC), AWS Glue
TransformSpark, Flink, dbt, Python Pandas, SQL
LoadHive, HBase, ClickHouse, Snowflake, Redis, Elasticsearch

References#

Detailed Explanation of Data Layering in Data Warehousing: ODS, DWD, DWM, DWS, ADS

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.