3 Star 1 Fork 1

alibaba / feathub

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

FeatHub is a stream-batch unified feature store that simplifies feature development, deployment, monitoring, and sharing for machine learning applications.

Introduction

FeatHub is an open-source feature store designed to simplify the development and deployment of machine learning models. It supports feature ETL and provides an easy-to-use Python SDK that abstracts away the complexities of point-in-time correctness needed to avoid training-serving skew. With FeatHub, data scientists can speed up the feature deployment process and optimize feature ETL by automatically compiling declarative feature definitions into performant distributed ETL jobs using state-of-the-art computation engines of their choice, such as Flink or Spark.

Checkout Documentation for guidance on compute engines, connectors, expression language, and more.

Core Benefits

Similar to other feature stores, FeatHub provides the following core benefits:

  • Simplified feature development: The Pythonic FeatHub SDK makes it easy to develop features without worrying about point-in-time correctness. This helps to avoid training-serving skew, which can negatively impact the accuracy of machine learning models.
  • Faster feature deployment: FeatHub automatically compiles user-specified declarative feature definitions into performant distributed ETL jobs using state-of-the-art computation engines, such as Flink or Spark. This speeds up the feature deployment process and eliminates the need for data engineers to re-write Python programs into distributed stream or batch processing jobs.
  • Performant feature generation: FeatHub offers a range of built-in optimizations that leverage commonly observed feature ETL job patterns. These optimizations are automatically applied to ETL jobs compiled from the declarative feature definitions, much like how SQL optimizations are applied.
  • Facilitated feature sharing: FeatHub allows developers to register and query feature definitions in a persistent feature registry. This capability reduces the duplication of data engineering efforts and the resource cost of feature generation by allowing developers in the organization to share and re-use existing feature definitions and datasets.

In addition to the above benefits, FeatHub provides several architectural benefits compared to other feature stores, including:

  • Real-time feature generation: FeatHub supports real-time feature generation using Apache Flink as the stream computation engine with milli-second latency. This provides better performance than other open-source feature stores that only support feature generation using Apache Spark.

  • Assisted feature monitoring: FeatHub provides built-in metrics to monitor the quality of features and alert users to issues such as feature drift. This helps to improve the accuracy and reliability of machine learning models.

  • Stream-batch unified computation: FeatHub allows for consistent feature computation across offline, nearline, and online stacks using Apache Flink for real-time features with low latency, Apache Spark for offline features with high throughput, and FeatureService for computing features online when the request is received.

  • Extensible framework: FeatHub's Python SDK is decoupled from the APIs of the underlying computation engines, providing flexibility and avoiding lock-in. This allows for the support of additional computation engines in the future. For example, FeatHub supports Local Processor that is implemented using Pandas library, in addition to its support for Apache Flink and Apache Spark.

Usability is a crucial factor that sets feature store projects apart. Our SDK is designed to be Pythonic, declarative, intuitive, and highly expressive to support all the necessary feature transformations. We understand that a feature store's success depends on its usability as it directly affects developers' productivity. Check out the FeatHub SDK Highlights section below to learn more about the exceptional usability of our SDK.

What you can do with FeatHub

With FeatHub, you can:

  • Define new features: Define features as the result of applying expressions, aggregations, and cross-table joins on existing features, all with point-in-time correctness.
  • Read and write features data: Read and write feature data into a variety of offline, nearline, and online storage systems for both offline training and online serving.
  • Backfill features data: Process historical data with the given time range and/or keys to backfill feature data, whic
  • Run experiments: Run experiments on the local machine using LocalProcessor without connecting to Apache Flink or Apache Spark cluster. Then deploy the FeatHub program in a distributed Apache Flink or Apache Spark cluster by changing the program configuration.

Architecture Overview

The architecture of FeatHub and its key components are shown in the figure below.

The workflow of defining, computing, and serving features using FeatHub is illustrated in the figure below.

See Basic Concepts for more details about the key components in FeatHub.

Supported Compute Engines

FeatHub supports the following compute engines to execute feature ETL pipeline:

FeatHub SDK Highlights

The following examples demonstrate how to define a variety of features concisely using FeatHub SDK. See FeatHub SDK for more details.

See NYC Taxi Demo to learn more about how to define, generate and serve features using FeatHub SDK.

  • Define features via table joins with point-in-time correctness
f_price = Feature(
    name="price",
    transform=JoinTransform(
        table_name="price_update_events",
        feature_name="price"
    ),
    keys=["item_id"],
)
  • Define over-window aggregation features:
f_total_payment_last_two_minutes = Feature(
    name="total_payment_last_two_minutes",
    transform=OverWindowTransform(
        expr="item_count * price",
        agg_func="SUM",
        window_size=timedelta(minutes=2),
        group_by_keys=["user_id"]
    )
)
  • Define sliding-window aggregation features:
f_total_payment_last_two_minutes = Feature(
    name="total_payment_last_two_minutes",
    transform=SlidingWindowTransform(
        expr="item_count * price",
        agg_func="SUM",
        window_size=timedelta(minutes=2),
        step_size=timedelta(minutes=1),
        group_by_keys=["user_id"]
    )
)
  • Define features via built-in functions and the FeatHub expression language:
f_trip_time_duration = Feature(
    name="f_trip_time_duration",
    transform="UNIX_TIMESTAMP(taxi_dropoff_datetime) - UNIX_TIMESTAMP(taxi_pickup_datetime)",
)
  • Define a feature via Python UDF:
f_lower_case_name = Feature(
    name="lower_case_name",
    dtype=types.String,
    transform=PythonUdfTransform(lambda row: row["name"].lower()),
)

User Guide

Checkout Documentation for guidance on compute engines, connectors, expression language, and more.

Prerequisites

You need the following to run FeatHub installed using pip:

  • Unix-like operating system (e.g. Linux, Mac OS X)
  • Python 3.7/3.8/3.9

Install FeatHub Nightly Build

To install the nightly version of FeatHub and the corresponding extra requirements based on the compute engine you plan to use, run one of the following commands:

# Run the following command if you plan to run FeatHub using a local process
$ python -m pip install --upgrade feathub-nightly

# Run the following command if you plan to use Apache Flink cluster
$ python -m pip install --upgrade "feathub-nightly[flink]"

# Run the following command if you plan to use Apache Spark cluster, or to use
# Spark-supported storage in a local process. 
$ python -m pip install --upgrade "feathub-nightly[spark]"

Quickstart

Quickstart using Local Processor

Execute the following command to compute features defined in nyc_taxi.py in the given Python process.

$ python python/feathub/examples/nyc_taxi.py

Quickstart using Flink Processor

You can use the following quickstart guides to compute features in a Flink cluster with different deployment modes:

Quickstart using Spark Processor

You can use the following quickstart guides to compute features in a standalone Spark cluster.

Examples

The following examples can be run on Google Colab.

Name Description
NYC Taxi Demo Quickstart notebook that demonstrates how to define, extract, transform and materialize features with NYC taxi-fare prediction sample data.
Feature Embedding Demo FeatHub UDF example showing how to define and use feature embedding with a pre-trained Transformer model and hotel review sample data.
Fraud Detection Demo An example to demonstrate usage with multiple data sources such as user account and transaction data.

Examples in this this repo can be run using docker-compose.

Developer Guide

Prerequisites

You need the following to build FeatHub from source:

  • Unix-like operating system (e.g. Linux, Mac OS X)
  • x86_64 architecture
  • Python 3.7/3.8/3.9
  • Java 8
  • Maven >= 3.1.1

Install Development Dependencies

  1. Install the required Python libraries.
$ python -m pip install -r python/dev-requirements.txt
  1. Start docker engine and pull the required images.
$ docker image pull redis:latest
$ docker image pull confluentinc/cp-kafka:5.4.3
  1. Increase open file limit to be at least 1024.
$ ulimit -n 1024

Build and Install FeatHub from Source

$ mvn clean package -DskipTests -f ./java
$ python -m pip install "./python[flink]"
$ python -m pip install "./python[spark]"

Run Tests

Please execute the following commands under Feathub's root folder to run tests.

$ mvn clean package -f ./java
$ pytest --tb=line -W ignore::DeprecationWarning ./python

While the commands above cover most of Feathub's tests, some FlinkProcessor's python tests, such as tests related to Parquet format, have been ignored by default as they require a Hadoop environment to function correctly. In order to run these tests, please install Hadoop on your local machine and set up environment variables as follows before executing the commands above.

export FEATHUB_TEST_HADOOP_CLASSPATH=`hadoop classpath`

You may refer to Flink's document for Hive connector for supported Hadoop & Hive versions.

Format Code Style

FeatHub uses the following tools to maintain code quality:

  • Black to format Python code
  • flake8 to check Python code style
  • mypy to check type annotation

Before uploading pull requests (PRs) for review, format codes, check code style, and check type annotations using the following commands:

# Format python code
$ python -m black ./python

# Check python code style
$ python -m flake8 --config=python/setup.cfg ./python

# Check python type annotation
$ python -m mypy --config-file python/setup.cfg ./python

Roadmap

Here is a list of key features that we plan to support:

  • Support all FeatureView transformations with FlinkProcessor
  • Support all FeatureView transformations with LocalProcessor
  • Support all FeatureView transformations with SparkProcessor
  • Support common online and offline feature storages (e.g. Kafka, Redis, Hive, MySQL)
  • Support persisting feature metadata in MySQL
  • Support exporting pre-defined and user-defined feature metrics to Prometheus
  • Support online transformation with feature service
  • Support feature metadata exploration (e.g. definition, lineage, metrics) with FeatHub UI

Contact Us

Chinese-speaking users are recommended to join the following DingTalk group for questions and discussion. You need to join the "Apache Flink China" DingTalk organization via this link first in order to join the following DingTalk Group.

English-speaking users can use this invitation link to join our Slack channel for questions and discussion.

We are actively looking for user feedback and contributors from the community. Please feel free to create pull requests and open Github issues for feedback and feature requests.

Come join us!

Additional Resources

  • Documentation: Our documentation provides guidance on compute engines, connectors, expression language, and more. Check it out if you need help getting started or want to learn more about FeatHub.
  • FeatHub Examples: This repository provides a wide variety of FeatHub demos that can be executed using Docker Compose. It's a great resource if you want to try out FeatHub and see what it can do.
  • Tech Talks and Articles
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

暂无描述 展开 收起
Python 等 4 种语言
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/mirrors_alibaba/feathub.git
git@gitee.com:mirrors_alibaba/feathub.git
mirrors_alibaba
feathub
feathub
master

搜索帮助