Replies: 4 comments 2 replies
-
LGTM, but there is a little typo: In Initial Goals part, the second point, should be |
Beta Was this translation helpful? Give feedback.
-
LGTM |
Beta Was this translation helpful? Give feedback.
-
hi, @SemyonSinchenko, can you help to list the external dependencies and license that introduced by GraphAr-PySpark here |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Greetings!
We are proposing to enter GraphAr, open source and language-independent data file format designed for efficient graph data storage and retrieval, into incubation. Please see the proposal below and let us know if you have any questions or concerns.
Abstract
GraphAr is an open source and language-independent data file format designed for efficient graph data storage and retrieval.
Proposal
GraphAr provides a standard data file format for graph data storage and exchange with the following features:
GraphAr currently supports cross-language operations, including C++, Java, and Spark, offering performance up to 6 times faster than CSV in loading graph data. It also facilitates the exchange of graph data between different graph systems, such as Nebula Graph, HugeGraph, and Neo4j.
Background
In recent years, several graph store systems, such as vineyard, groot, and GART, have been integrated into the GraphScope system. Each system has its own graph data storage layout, complicating the exchange of graph data between different systems. Moreover, there is no standard format for large graph data storage and exchange in the graph community.
To address this gap, we decided to develop GraphAr.
Initially, the project was developed by the Alibaba GraphScope team in 2022 and was open-sourced on GitHub in December 2022. A number of systems at Alibaba, including GraphScope and vineyard, have adopted GraphAr as their standard graph data storage format.
Rationale
Numerous graph systems, such as Neo4j, Nebula Graph, and HugeGraph, have been developed in recent years. Each of these systems has its own graph data storage format, complicating the exchange of graph data between different systems. The need for a standard data file format for large-scale graph data storage and processing that can be used by diverse existing systems is evident, as it would reduce overhead when various systems work together.
Our aim is to fill this gap and contribute to the open-source community by providing a standard data file format for graph data storage and exchange, as well as for out-of-core querying. This format, which we have named GraphAr, is engineered to be efficient, cross-language compatible, and to support out-of-core processing scenarios, such as those commonly found in data lakes. Furthermore, GraphAr's flexible design ensures that it can be easily extended to accommodate a broader array of graph data storage and exchange use cases in the future.
Initial Goals
Current Status
Meritocracy
This proposal intends to start build a community around GraphAr following the ASF meritocracy model.
Users and new contributors will be respected and welcomed.
They will earn credit by participating in the community and providing quality contributions to move the project forward,the contributions not only include the code contributions but also the non-code contributions(documentation, discussion, testing, events, community development, etc.)
Those who make long-term and high quality contributions will be encouraged to become committers.
Community
GraphAr is being developed by the development team inside Alibaba Group who's responsible for building Graph processing system too.
The extensive graph community is in need of a standardized file format. By incorporating GraphAr into the Apache ecosystem, we anticipate further expansion of the community, benefiting from increased collaboration and adoption of a common format.
Users
GraphAr currently serves a group of users, including Alibaba, Fabarta and TuGraph. Here are some use cases of GraphAr:
Developer
As for developers, although GraphAr has attracted 13 contributors since it was open-sourced, we have four core developers: Weibin Zeng, Xue Li, Zhe Wang and Semyon Sinchenko. Weibin Zeng and Xue Li are the founders of GraphAr project, Zhe Wang is the author and maintainer of GraphAr JAVA module and Sem is the author and maintainer of GraphAr PySpark module.
The community's size and diversity are indeed areas of concern. However, we expect to attract more contributors in the future to address these issues by evolving the software and adhering to the Apache way.
The need for a standard graph data format is common in the graph community, and it provides the potential for a bigger community.
Core Developers
Known Risks
Project Name
We have checked and believe that GraphAr is an appropriate and easy-to-remember name.
Orphaned Products
GraphAr is used as a graph data archive format in Alibaba's graph system GraphScope and vineyard. The developers of these systems will continue to improve GraphAr to meet current and future requirements. Other organizations, such as Fabarta, also use GraphAr in their core products. Furthermore, TuGraph and HugeGraph has shown interest in GraphAr and is considering using it as their graph data archive format. Given the extensive need in the graph community for a standardized file format, we believe the developer and user communities will continue to grow, mitigating the risk of GraphAr becoming an orphaned product.
Inexperience with Open Source
The creators of GraphAr have been working on open-source projects for many years. They has been dedicate to open source projects, such as GraphScope and vineyard.
Homogenous Developers
Currently, GraphAr has four core developers. However, we are working to diversify our developer base. We have already received interest from other organizations, such as Fabarta, which are using GraphAr in their core products. We believe that the developer and user communities will continue to grow.
Reliance on Salaried Developers
It is true that most developers are supported by their employers to contribute to GraphAr, which poses a significant risk. However, GraphAr has already been used by GraphScope and deployed within Alibaba, with no internal forked versions. The developers of GraphScope will continue to improve GraphAr to meet current and future requirements. As a result, Alibaba can ensure a long-term commitment. We believe that if GraphAr enters the incubator, we can attract more maintainers and developers from diverse backgrounds to address this risk.
Relationships with Other Apache Products
GraphAr relies on Parquet, ORC, and Arrow for data storage and exchange and on Spark to provide a connector to graph databases like Nebula Graph, HugeGraph, and Neo4j.
GraphAr can also be integrated with several Apache projects for graph data storage and exchange, including:
We believe that such integration could also be applied to and benefit the ope-source community, and we have a plan to discuss with the community to make it happen.
An Excessive Fascination with the Apache Brand
We believe that the Apache way and its neutrality, not just the brand, will help GraphAr grow. The need for a standard graph data format is common in the graph community and is relevant to many other graph system projects, not just those in Alibaba. A neutral organization like Apache will ultimately better serve the community than a single company.
Documentation
GraphAr documentation is provided on docs.
Initial Source
GraphAr has been under development since June 2022 by a team of engineers at Alibaba called GraphScope. It was open-sourced on GitHub in December 2022, with the project available at https://github.com/alibaba/GraphAr under the name GraphAr. The project is licensed under Apache License 2.0.
Source and Intellectual Property Submission Plan
As soon as GraphAr is approved to join Apache Incubator, our initial committers will submit iCLA(s), SGA, and CCLA(s). The codebase is already licensed under Apache License 2.0.
We will also deprecate the initial source repository and redirect it to the new incubator project repository after approved.
External Dependencies
GraphAr has several external dependencies with various licenses, including Apache 2.0, BSD, BSL-1.0, and MIT.
Apache 2.0
BSD
BSL-1.0
MIT
Cryptography
N/A
Required Resources
Mailing lists
[email protected]
[email protected]
[email protected]
Subversion Directory
N/A
Git Repositories
Upon entering incubation, we want to transfer the existing repo to the Apache Software Foundation:
Issue Tracking
The community would like to continue using GitHub Issues.
Other Resources
The community has chosen GitHub actions as its continuous integration tools.
Initial Committers
Sponsors
Champion
Nominated Mentors
Sponsoring Entity
We are requesting the Incubator to sponsor this project.
References
[1] https://graphscope.io/blog/tech/2023/08/31/Getting-Started-with-GraphAr-Standardized-Graph-Storage-File-Format
Beta Was this translation helpful? Give feedback.
All reactions