登录 | 注册 | 充值 | 退出 | 公司首页 | 繁体中文 | 满意度调查
综合馆
分布式流处理技术综述
  • 摘要

    随着计算机和网络技术的迅猛发展以及数据获取手段的不断丰富,在越来越多的领域出现了对海量、高速数据进行实时处理的需求.由于此类需求往往超出传统数据处理技术的能力,分布式流处理模式应运而生.首先回顾分布式流处理技术产生的背景以及技术演进过程,然后将其与其他相关大数据处理技术进行对比,以界定分布式流数据处理的外延.进而对分布式流处理所需要考虑的数据模型、系统模型、存储管理、语义保障、负载控制、系统容错等主要问题进行深入分析,指出现有解决方案的优势和不足.随后,介绍S4,Storm,Spark Streaming等几种具有代表性的分布式流处理系统,并对它们进行系统地对比.最后,给出分布式流处理在社交媒体处理等领域的几种典型应用,并探讨分布式流处理领域进一步的研究方向.

  • 作者

    崔星灿  禹晓辉  刘洋  吕朝阳  Cui Xingcan  Yu Xiaohui  Liu Yang  Lü Zhaoyang 

  • 作者单位

    山东大学计算机科学与技术学院 济南 250101

  • 刊期

    2015年2期 ISTIC EI PKU

  • 关键词

    大数据  数据流  分布式流处理  实时处理  分布式系统  big data  data stream  distributed stream processing  real-time processing  distributed system 

参考文献
  • [1] 孟小峰,慈祥. 大数据管理:概念、技术与挑战. 计算机研究与发展, 2013,1
  • [2] 孙大为,张广艳,郑纬民. 大数据流式计算:关键技术及系统实例. 软件学报, 2014,4
  • [3] 李国杰,程学旗. 大数据研究:未来科技及经济社会发展的重大战略领域--大数据的研究现状与科学思考. 中国科学院院刊, 2012,6
  • [4] Lukasz Golab;M. Tamer Oezsu. Issues in Data Stream Management. SIGMOD record: ACM SIGMOD (management of data), 2003,2
  • [5] Michael Stonebraker;Ugur Cetintemel;Stan Zdonik. The 8 Requirements of Real-Time Stream Processing. SIGMOD record: ACM SIGMOD (management of data), 2005,4
  • [6] 除MillWheel外其余平台都已开源.
  • [7] 0.6版本.
  • [8] Ghemawat S;Gobioff H;Leung S T. The Google file system. New York:ACM, 2003
  • [9] Dean J;Ghemawat S. Mapreduce:Simplified data processing on large clusters. San Francisco:USENIX Association, 2004
  • [10] Schreier U;Pirahesh H;Agrawal R. Alert:An architecture for transforming a passive DBMS into an active DBMS. San Francisco,CA:Morgan Kaufmann, 1991
  • [11] Kao B;Garcia-Molina H. An overview of real-time database systems. Berlin:Springer-Verlag, 1994
  • [12] Paton N W;Dáz O. Active database systems. ACM Computer Survey, 1999,01
  • [13] 指单次操作所面向的最小数据单元.
  • [14] Belkin N J;Croft W B. Information filtering and information retrieval:Two sides of the same coin. Communications of the ACM, 1992,12
  • [15] . http://www.ibm.com/software/products/zh/infosphere-streams
  • [16] Abadi D J;Carney D;Cetintemel U. Aurora:A new model and architecture for data stream management. VLDB JOURNAL, 2003,02
  • [17] Chandrasekaran S;Cooper O;Deshpande A. Telegraphcq:Continuous dataflow processing for an uncertain world. http://cidrdb.org/2003Proceedings.zip, 2014 11-11
  • [18] 下文将以流处理系统代指数据流管理系统和流处理平台.
  • [19] 近似是因为在某种极端情况下会导致数据丢失.
  • [20] 有关流处理中的事务并无明确定义,文献[51]中提出的统一事务模型可作参考.
  • [21] . http://github.com/epfldata/squall
  • [22] Arasu A;Babcock B;Babu S. Stream:The stanford data stream management system. http://ilpubs.stanford.edu:8090/641/1/2004-20.pdf, 2014-11-11
  • [23] Demers A J;Gehrke J;Hong M. Towards expressive publish/subscribe systems. Berlin:Springer-Verlag, 2006
  • [24] Demers A J;Gehrke J;Panda B. Cayuga:A general purpose event monitoring system. 2007
  • [25] Brenna L;Gehrke J;Hong,M. Distributed event stream processing with non-deterministic finite automata. New York:ACM, 2009
  • [26] 下文中用切分表示对数据流分段,用划分表示并行处理时对数据分组.
  • [27] Zaharia M;Chowdhury M;Franklin M J. Spark:Cluster computing with working sets. Berkeley,CA:USENIX Association, 2010
  • [28] Gehrke J;Korn F;Srivastava D. On computing correlated aggregates over continual data streams. New York:ACM, 2001
  • [29] Arasu A;Babu S;Widom J. The CQL continuous query language:Semantic foundations and query execution. VLDB JOURNAL, 2006,02
  • [30] Babcock B;Babu S;Datar M. Models and issues in data stream systems. New York:ACM, 2002
  • [31] Akidau T;Balikov A;Bekiroglu K. Millwheel:Faulttolerant stream processing at internet scale. PVLDB, 2013,11
  • [32] Zaharia M;Chowdhury M;Das T. Resilient distributed datasets:A fault-tolerant abstraction for inmemory cluster computing. Berkeley,CA:USENIX Association, 2012
  • [33] Cherniack M;Balakrishnan H;Balazinska M. Scalable distributed stream processing. http://cidrdb.org/2003Proceedings.zip, 2014-11-11
  • [34] Shah M A;Hellerstein J M;Brewer E A. Highly-available,fault-tolerant,parallel dataflows. New York:ACM, 2004
  • [35] Abadi D J;Ahmad Y;Balazinska M. The design of the borealis stream processing engine. 2005
  • [36] Apache Software Foundation. Welcome to ApacheTM Hadoop(R). http://hadoop.apache.org/, 2014-11-11
  • [37] Neumeyer L;Robbins B;Nair A. S4:Distributed stream computing platform. Piscataway,NJ:IEEE, 2010
  • [38] 由于分布式系统天生具有可伸缩性,因此我们不将此问题单独讨论.
  • [39] Toshniwal A;Taneja S;Shukla A. Storm@twitter. New York:ACM, 2014
  • [40] Zaharia M;Das T;Li H. Discretized streams:An efficient and fault-tolerant model for stream processing on large clusters. Berkeley,CA:USENIX Association, 2012
  • [41] Wu E;Diao Y;Rizvi S. High-performance complex event processing over streams. New York:ACM, 2006
  • [42] Agrawal J;Diao Y;Gyllstrom D. Efficient pattern matching over event streams. New York:ACM, 2008
  • [43] Li J;Maier D;Tufte K. Semantics and evaluation techniques for window aggregates in data streams. New York:ACM, 2005
  • [44] Lim H;Fan B;Andersen D. SILT:A memoryefficient,high-performance key-value store. New York:ACM, 2011
  • [45] Lin Liwei;Yu Xiaohui;Koudas N. Pollux:Towards scalable distributed real-time search on microblogs. New York:ACM, 2013
  • [46] UC Berkeley AMPLab. Tachyon Overview-Tachyon0.5.0 Documentation. http://tachyon project.org/, 2014-11 11
  • [47] . http://storm.apache.org/documentation/Trident-tutorial.html
  • [48] iMatix Corporation. Code connected-zeromq. http://zeromq.org/, 2014-11-11
  • [49] Lee Trustin. Netty Home. http://netty.io/, 2014-11-11
  • [50] Chang F;Dean J;Ghemawat S. Bigtable:A distributed storage system for structured data. New York:ACM, 2006
  • [51] Corbett J C;Dean J;Epstein M. Spanner:Google's globally distributed database. ACM Transactions on Computer Systems, 2013,03
  • [52] Kreps J;Narkhede N;Rao J. Kafka:A distributed messaging system for log processing. New York:ACM, 2011
  • [53] Ananthanarayanan R;Basker V;Das S. Photon:Fault tolerant and scalable joining of continuous data streams. New York:ACM, 2013
  • [54] Botan I;Fischer P M;Kossmann D. Transactional stream processing. New York:ACM, 2012
  • [55] Li J;Tufte K;Shkapenyuk V. Out-of-order processing:A new architecture for high-performance stream systems. PVLDB, 2008,01
  • [56] Brito A;Fetzer C;Sturzrehm H. Speculative out-oforder event processing with software transaction memory. New York:ACM, 2008
  • [57] Aurora同样提供减载支持.
  • [58] Mutschler C;Philippsen M. Distributed low-latency out-oforder event processing for high data rate sensor streams. Los Alamitos,CA:IEEE Computer Society, 2013
  • [59] Stephens R. A survey of stream processing. Acta Information, 1997,07
  • [60] Clinger D W. Foundations of actor semantics. http://dspace.mit.edu/handle/1721.1/6935,1981, 2014-11-11
  • [61] Hunt P;Konar M;Junqueira F P. Zookeeper:Wait free coordination for internet-scale systems. Berkeley,CA:USENIX, 2010
  • [62] Wu S;Jiang D;Ooi B C. Efficient b-tree based indexing for cloud data processing. PVLDB, 2010,01
  • [63] Apache Software Foundation. Samza. http://samza.incubator.apache.org/, 2014-11-11
  • [64] Ongaro D;Rumble S M;Stutsman R. Fast crash recovery in ramcloud. New York:ACM, 2011
  • [65] Wei Hong;Stonebraker M. Optimization of parallel query execution plans in XPRS. Berlin:Springer-Verlag, 1991
  • [66] Hwang J;Balazinska M;Rasin A. High-availability algorithms for distributed stream processing. Piscataway,NJ:IEEE, 2005
  • [67] Balazinska M;Balakrisbnan H;Madden S. Faulttolerance in the borealis distributed stream processing system. New York:ACM, 2005
  • [68] Hwang J;Cetintemel U;Zdonik S B. Fast and highlyavailable stream processing over wide area networks. Piscataway,NJ:IEEE, 2008
  • [69] 重复相同操作不会对系统产生影响,例如以相同的key和value向map中多次写入.
  • [70] Piedad F;Hawkins M. High Availability:Design,Techniques,and Processes. London:Prentice Hall Professional, 2001
  • [71] Apache Software Foundation. Yarn. http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/, 2014-11-11
  • [72] Google. Leveldb. http://leveldb.org/, 2014-11-11
  • [73] Petrovic S;Osborne M;Lavrenko V. Streaming first story detection with application to twitter. New York:ACM, 2010
  • [74] McCreadie R;Macdonald C;Ounis I. Scalable distributed event detection for twitter. Piscataway,NJ:IEEE, 2013
  • [75] Yu Ziqiang;Liu Yang;Yu Xiaohui. Scalable distributed processing of k nearest neighbor queries over moving objects. IEEE Transactions on Knowledge and Data Engineering, 2014,99
  • [76] Dayarathna M;Suzumura T. Hirundo:A mechanism for automated production of optimized data stream graphs. New York:ACM, 2012
  • [77] Babu S;Widom J. Streamon:An adaptive engine for stream query processing. New York:ACM, 2004
  • [78] 即单条记录和所有由它产生的新记录已经被途经的全部计算单元所处理.
  • [79] Avnur R;Hellerstein J M. Eddies:Continuously adaptive query processing. New York:ACM, 2000
  • [80] Shah M A;Hellerstein J M;Chandrasekaran S.Franklin. Flux:An adaptive partitioning operator for continuous query systems. Piscataway,NJ:IEEE, 2003
  • [81] Graefe G. Encapsulation of parallelism in the volcano query processing system. New York:ACM, 1990
  • [82] Aniello L;Baldoni R;Querzoni L. Adaptive online scheduling in storm. New York:ACM, 2013
  • [83] Sax M J;Castellanos M;Chen Q. Aeolus:An optimizer for distributed intra-node-parallel streaming systems. Piscataway,NJ:IEEE, 2013
  • [84] Das A;Gehrke J;Riedewald M. Approximate join processing over data streams. New York:ACM, 2003
  • [85] Ayad A;Naughton J F. Static optimization of conjunctive queries with sliding windows over infinite streams. New York:ACM, 2004
  • [86] Yang Chong;Yu Xiaohui;Liu Yang. Continuous knn join processing for real-time recommendation. Berlin:Springer-Verlag, 2014
  • [87] Bifet A;Frank E. Sentiment knowledge discovery in twitter streaming data. Berlin:Springer-Verlag, 2010
  • [88] Wang Hao;Can D;Kazemzadeh A. A system for realtime twitter sentiment analysis of 2012 us presidential election cycle. New York:ACM, 2012
  • [89] 以数据流的方式输入句子,统计每个单词出现的次数.
  • [90] Chandramouli B;Goldstein J;Duan S. Temporal analytics on big data for web advertising. Los Alamitos,CA:IEEE, 2012
  • [91] Bar-Or A;Healey J;Kontothanassis L. Biostream:A system architecture for real-time processing of physiological signals. Piscataway,NJ:IEEE, 2004
  • [92] Sow D;Biem A;Sun J. Real-time prognosis of icu physiological data streams. Piscataway,NJ:IEEE, 2010
查看更多︾
相似文献 查看更多>>
54.196.208.187