Haiyang Shi

Research Scientist at ByteDance Infrastructure System Lab
Working on Hardware Acceleration and High-Performance Networking

My research interests include High-Performance Interconnects and Protocols (e.g., RDMA), Erasure Coding, In-Network Computing, and Distributed Systems.

I obtained my Ph.D. from the Department of Computer Science and Engineering at The Ohio State University, advised by Prof. Xiaoyi Lu. I received my bachelor’s degree from Tianjin University, China.

selected publications

A complete list can be found at my Google Scholar profile.

VLDB'23
Krypton: Real-Time Serving and Analytical SQL Engine at ByteDance

Jianjun Chen, Rui Shi, Heng Chen, Li Zhang, Ruidong Li, Wei Ding, Liya Fan, Hao Wang, Mu Xiong, Yuxiang Chen, Benchao Dong, Kuankuan Guo, Yuanjin Lin, Xiao Liu, Haiyang Shi, Peipei Wang, Zikang Wang, Yemeng Yang, Junda Zhao, Dongyan Zhou, Zhikai Zuo, and Yuming Liang

Proc. VLDB Endow., 2023

Abstract Bib HTML PDF

In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs.To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hierarchical cache with persistent memory, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.
@article{10.14778/3611540.3611545, author = {Chen, Jianjun and Shi, Rui and Chen, Heng and Zhang, Li and Li, Ruidong and Ding, Wei and Fan, Liya and Wang, Hao and Xiong, Mu and Chen, Yuxiang and Dong, Benchao and Guo, Kuankuan and Lin, Yuanjin and Liu, Xiao and Shi, Haiyang and Wang, Peipei and Wang, Zikang and Yang, Yemeng and Zhao, Junda and Zhou, Dongyan and Zuo, Zhikai and Liang, Yuming}, title = {Krypton: Real-Time Serving and Analytical SQL Engine at ByteDance}, year = {2023}, issue_date = {August 2023}, publisher = {VLDB Endowment}, volume = {16}, number = {12}, issn = {2150-8097}, url = {https://doi.org/10.14778/3611540.3611545}, doi = {10.14778/3611540.3611545}, journal = {Proc. VLDB Endow.}, pages = {3528–3542}, numpages = {15} }
ICDE'23
Accelerating Cloud-Native Databases with Distributed PMem Stores

Jason Sun, Haoxiang Ma, Li Zhang, Huicong Liu, Haiyang Shi, Shangyu Luo, Kai Wu, Kevin Bruhwiler, Cheng Zhu, Yuanyuan Nie, Jianjun Chen, Lei Zhang, and Yuming Liang

In 2023 IEEE 39th International Conference on Data Engineering (ICDE), 2023

Abstract Bib HTML PDF

Relational databases have gone through a phase of architectural transition from a monolithic to a distributed architecture to take full advantage of cloud technology. These distributed databases can leverage remote storage to maintain larger amounts of data than monolithic databases at the cost of increased latency. At ByteDance, we have built a distributed database called veDB based on the popular compute-storage separation architecture, however we have observed the system is unable to provide both low latency and high throughput required by some business critical applications, such as batched order processing.In this paper we present our novel approaches to tackle this problem. We have modified our system’s storage to utilize persistent memory (PMem) coupled with a remote direct memory access (RDMA) network to reduce read/write latency and increase the throughput. We also propose a query push-down framework to push partial computations to the PMem storage layer to accelerate analytical queries and reduce the impact of the transaction workload in the computation layer. Our experiments show that our methods improve the throughput by up to 1.5× and reduce latency by up to 20× for standard benchmarks and real-world applications.
@inproceedings{10184639, author = {Sun, Jason and Ma, Haoxiang and Zhang, Li and Liu, Huicong and Shi, Haiyang and Luo, Shangyu and Wu, Kai and Bruhwiler, Kevin and Zhu, Cheng and Nie, Yuanyuan and Chen, Jianjun and Zhang, Lei and Liang, Yuming}, booktitle = {2023 IEEE 39th International Conference on Data Engineering (ICDE)}, title = {Accelerating Cloud-Native Databases with Distributed PMem Stores}, year = {2023}, volume = {}, number = {}, pages = {3043-3057}, doi = {10.1109/ICDE55515.2023.00233} }
SC'21
HatRPC: Hint-Accelerated Thrift RPC over RDMA

Tianxi Li, Haiyang Shi, and Xiaoyi Lu

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021

* Tianxi Li and Haiyang Shi contributed equally to this work

Abstract Bib HTML PDF

In this paper, we propose a novel hint-accelerated Remote Procedure Call (RPC) framework based on Apache Thrift over Remote Direct Memory Access (RDMA) protocols, called HatRPC. HatRPC proposes a hierarchical hint scheme towards optimizing heterogeneous RPC services and functions. The proposed hint design is composed of service-granularity and function-granularity hints for achieving varied optimization goals and reducing design space for further optimizing the underneath RDMA communication engine. We co-design a key-value store called HatKV with HatRPC and LMDB. The effectiveness and efficiency of HatRPC are validated and evaluated with our proposed Apache Thrift Benchmarks (ATB), YCSB, and TPC-H workloads. Performance evaluations show that the proposed HatRPC approach can deliver up to 55% performance improvement for ATB benchmarks and up to 1.51X speedup for TPC-H queries compared with vanilla Thrift over IPoIB. In addition, the co-designed HatKV can achieve up to 85.5% improvement for YCSB workloads.
@inproceedings{10.1145/3458817.3476191, note = {* Tianxi Li and Haiyang Shi contributed equally to this work}, author = {Li, Tianxi and Shi, Haiyang and Lu, Xiaoyi}, title = {HatRPC: Hint-Accelerated Thrift RPC over RDMA}, year = {2021}, isbn = {9781450384421}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3458817.3476191}, doi = {10.1145/3458817.3476191}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {36}, numpages = {14}, keywords = {thrift, RDMA, RPC, code generation, hint}, location = {St. Louis, Missouri}, series = {SC '21} }
SC'20
INEC: Fast and Coherent in-Network Erasure Coding

Haiyang Shi, and Xiaoyi Lu

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2020

Abstract Bib HTML PDF

Erasure coding (EC) is a promising fault tolerance scheme that has been applied to many well-known distributed storage systems. The capability of Coherent EC Calculation and Networking on modern SmartNICs has demonstrated that EC will be an essential feature of in-network computing. In this paper, we propose a set of coherent in-network EC primitives, named INEC. Our analyses based on the proposed α-β performance model demonstrate that INEC primitives can enable different kinds of EC schemes to fully leverage the EC offload capability on modern SmartNICs. We implement INEC on commodity RDMA NICs and integrate it into five state-of-the-art EC schemes. Our experiments show that INEC primitives significantly reduce 50th, 95th, and 99th percentile latencies, and accelerate the end-to-end throughput, write, and degraded read performance of the key-value store co-designed with INEC by up to 99.57%, 47.30%, and 49.55%, respectively.
@inproceedings{10.5555/3433701.3433788, author = {Shi, Haiyang and Lu, Xiaoyi}, title = {INEC: Fast and Coherent in-Network Erasure Coding}, year = {2020}, isbn = {9781728199986}, publisher = {IEEE Press}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {66}, numpages = {17}, keywords = {erasure coding, in-network computing, next generation networking, fault tolerance}, location = {Atlanta, Georgia}, series = {SC '20} }
SC'19
TriEC: Tripartite Graph Based Erasure Coding NIC Offload

Haiyang Shi, and Xiaoyi Lu

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019

Best Student Paper Finalist

Abstract Bib HTML PDF

Erasure Coding (EC) NIC offload is a promising technology for designing next-generation distributed storage systems. However, this paper has identified three major limitations of current-generation EC NIC offload schemes on modern SmartNICs. Thus, this paper proposes a new EC NIC offload paradigm based on the tripartite graph model, namely TriEC. TriEC supports both encode-and-send and receive-and-decode operations efficiently. Through theorem-based proofs, co-designs with memcached (i.e., TriEC-Cache), and extensive experiments, we show that TriEC is correct and can deliver better performance than the state-of-the-art EC NIC offload schemes (i.e., BiEC). Benchmark evaluations demonstrate that TriEC outperforms BiEC by up to 1.82x and 2.33x for encoding and recovering, respectively. With extended YCSB workloads, TriEC reduces the average write latency by up to 23.2% and the recovery time by up to 37.8%. TriEC outperforms BiEC by 1.32x for a full-node recovery with 8 million records.
@inproceedings{10.1145/3295500.3356178, author = {Shi, Haiyang and Lu, Xiaoyi}, title = {TriEC: Tripartite Graph Based Erasure Coding NIC Offload}, year = {2019}, isbn = {9781450362290}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3295500.3356178}, doi = {10.1145/3295500.3356178}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {44}, numpages = {34}, keywords = {erasure coding, tripartite, NIC offload, bipartite}, location = {Denver, Colorado}, series = {SC '19} }
HPDC'19
UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems

Haiyang Shi, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. Panda

In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, 2019

Abstract Bib HTML PDF

Distributed storage systems typically need data to be stored redundantly to guarantee data durability and reliability. While the conventional approach towards this objective is to store multiple replicas, today’s unprecedented data growth rates encourage modern distributed storage systems to employ Erasure Coding (EC) techniques, which can achieve better storage efficiency. Various hardware-based EC schemes have been proposed in the community to leverage the advanced compute capabilities on modern data center and cloud environments. Currently, there is no unified and easy way for distributed storage systems to fully exploit multiple devices such as CPUs, GPUs, and network devices (i.e., multi-rail support) to perform EC operations in parallel; thus, leading to the under-utilization of the available compute power. In this paper, we first introduce an analytical model to analyze the design scope of efficient EC schemes in distributed storage systems. Guided by the performance model, we propose UMR-EC, a Unified and Multi-Rail Erasure Coding library that can fully exploit heterogeneous EC coders. Our proposed interface is complemented by asynchronous semantics with optimized metadata-free scheme and EC rate-aware task scheduling that can enable a highly-efficient I/O pipeline. To show the benefits and effectiveness of UMR-EC, we re-design HDFS 3.x write/read pipelines based on the guidelines observed in the proposed performance model. Our performance evaluations show that our proposed designs can outperform the write performance of replication schemes and the default HDFS EC coder by 3.7x - 6.1x and 2.4x - 3.3x, respectively, and can improve the performance of read with failure recoveries up to 5.1x compared with the default HDFS EC coder. Compared with the fastest available CPU coder (i.e., ISA-L), our proposed designs have an improvement of up to 66.0% and 19.4% for write and read with failure recoveries, respectively.
@inproceedings{10.1145/3307681.3325406, author = {Shi, Haiyang and Lu, Xiaoyi and Shankar, Dipti and Panda, Dhabaleswar K.}, title = {UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems}, year = {2019}, isbn = {9781450366700}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3307681.3325406}, doi = {10.1145/3307681.3325406}, booktitle = {Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing}, pages = {219–230}, numpages = {12}, keywords = {high performance, multi-rail erasure coding, distributed storage systems}, location = {Phoenix, AZ, USA}, series = {HPDC '19} }
Bench'18
EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. Panda

In Benchmarking, Measuring, and Optimizing, 2019

Best Paper Award

Abstract Bib HTML PDF

Various Erasure Coding (EC) schemes based on hardware accelerations have been proposed in the community to leverage the advanced compute capabilities on modern data centers, such as Intel ISA-L Onload EC coders and Mellanox InfiniBand Offload EC coders. These EC coders can play a vital role in designing next-generation distributed storage systems. Unfortunately, there does not exist a unified and easy way for distributed storage systems researchers and designers to benchmark, measure, and characterize the performance of these different EC coders. In this context, we propose a unified benchmark suite, called EC-Bench, to help the users to benchmark both onload and offload EC coders on modern hardware architectures. EC-Bench provides both encoding and decoding benchmarks with tunable parameter support. A rich set of metrics, including latency, actual and normalized throughput, CPU utilization, and cache pressure, can be reported through EC-Bench. Evaluations with EC-Bench demonstrate that hardware-optimized offload coders (e.g. Mellanox-EC) have lower demands on CPU and cache compared to onload coders, and highly optimized onload coders (e.g., Intel ISA-L) outperform offload coders for most configurations.
@inproceedings{10.1007/978-3-030-32813-9_18, address = {Cham}, author = {Shi, Haiyang and Lu, Xiaoyi and Panda, Dhabaleswar K.}, booktitle = {Benchmarking, Measuring, and Optimizing}, editor = {Zheng, Chen and Zhan, Jianfeng}, isbn = {978-3-030-32813-9}, pages = {215--230}, publisher = {Springer International Publishing}, title = {EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures}, year = {2019} }