OpenSource

[DaaS] dremio

아르비스 2019. 2. 28. 11:39

메타 데이터의 시작으로 이동

Data as a Service Platform

  • https://www.dremio.com/
  • Opensource (Community edition / Enterprise edition)
  • License : Apache 2.0
  • Main Language : Java
  • 참고문서 
  • Document Site
  • 사용 Opensource
    • Apache Arrow (in-memory analytics)

    • Apache Drill (SQL execution engine )

    • Apache Parquet (optimization for analytics Data Type)

    • Apache Calcite ( SQL Parser)

    • React ( front-end Javascript framework)

    • Apache Zookeeper ( Distributed Key-value Store)

    • RocksDB (embedded Key-value store)

    • Ansible ( CI/CD automation framework )

    • git ( Source Control - github)

    • Cerrit ( code Collaboration tool - Core-review )


기타 정보

  • Headquarters Regions
    San Francisco Bay Area, Silicon Valley, West Coast
  • Founded Date
    Jun 9, 2015
  • Founders
    Jacques Nadeau, Tomer Shiran
  • Funding Status
    Early Stage Venture
  • Number of Employees
    11-50


Dremio provides a quantum leap in performance, based on four areas of innovation.

  • Apache Arrow Execution

    From 1 to 1000+ nodes, architected for cloud deployments: elastic compute, runs on object stores.

  • Data Reflections™

    Accelerate data and queries automatically, up to 1000x faster, with the full power of relational algebra.

  • Native Push-Downs

    Optimized query semantics for each data source – Amazon S3, ADLS, RDBMS, NoSQL, HDFS, and more.

  • Vertically Integrated Query Engine

    Cost-based query planner automatically generates query plans to make optimal use of Data Reflections™ and push downs.






◎ 주요 Feature

  • Data Acceleration. Using columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence of source data that is optimized for one or more query workloads through partitioning, sorting, aggregations, projections, and distributions.
  • Data Catalog searchable index of your data source metadata, as well as virtual datasets created by Dremio users.
  • Integrated Data Curation. Through a powerful and intuitive GUI, easy for business users, yet sufficiently powerful for your data engineers, and fully integrated into Dremio.
  • Push-Downs On Any Data Source. Including optimized push downs and parallel connectivity to relational databases, non-relational systems like MongoDB, Elasticsearch, as well as S3 and HDFS.
  • Cross-Data Source Joins execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
  • Data Lineage. Full visibility into your data lineage, from your data sources, through transformations, joining with other data sources, and sharing with other users.


Terminology & Concepts

Data Reflections™
Physically optimized representations of source data that both offload operational systems, and optimize one or more analytical workloads. Reflections are transparent to end users, and automatically substituted by Dremio’s query planner. Reflections have a configurable TTL SLA, so you can trade off freshness and query latency.
Data Catalog
An index of source metadata, including the names of tables, views, columns, fields, collections, indexes and more. Users can easily issue Google-sytle searches to find datasets for a given job. Data Catalog includes all metadata from virtual datasets as well.
Data Curation
A visual and intuitive way for analysts, data scientists, and data engineers to transform data for the needs of a particular job, without making copies of the data.
Data Lineage
As data is used for multiple jobs, it is transformed, joined, and shared with other users, forming an implicit graph of relationships and dependencies. These relationships help to understand data use, and relationships that are essential for security, governance, and remediation.
Recommendations
As users interact with datasets, their behavior can serve as the basis for recommendations to other users, helping to build joins and transformations more easily.
Apache Arrow-Based Execution
Apache Arrow is a columnar standard for in-memory analytics. It provides significant advantages in terms of memory and CPU efficiency, and is designed to work well with GPUs and FPGAs.
.

Feature Comparison

 
Dremio
SQL Execution Engines
 
Dremio
SQL Execution Engines
Scale-out architectureYesYes
Accelerates aggregation queriesYesQueries are written against the logical schema, and Dremio's query planner automatically rewrites the query to use Aggregation Reflections, invisible to the end user.NoRequires a slow full table scan each time.
Accelerates ad-hoc queriesYesQueries are written against the logical schema, and Dremio's query planner automatically rewrites the query to use Raw Reflections, invisible to the end user.NoRequires a slow full table scan each time.
Accelerates relational data sourcesYesDremio Reflections, and native optimizers with first class push downs of queriesNoVaries by engine, but most require third party ETL to move and prep data for HDFS or S3
Accelerates NoSQL data sourcesYesDremio Reflections, and native optimizers with first class push downs of queriesNoVaries by engine, but most require third party ETL to move and prep data for HDFS
Integrated data curationYesNatural and intuitive UI for data discovery, curation, acceleration, and collaboration.NoRequires third party tool or custom scripts written by data engineers
Integrated Data LineageYesFull visibility into data lineage and access patterns for governance and errr remediation.NoRequires third party tool or custom scripts written by data engineers
LicenseApacheApache