¾àÀëSpark 2.0.0·¢²¼Á½¸öÔºó£¬Spark 2.0.1°æ±¾·¢²¼ÁË£¬ÕâÊÇÒ»¸öÐÞÕý°æ±¾£¬¹²´¦ÀíÁË300¶à¸öIssue£¬Éæ¼°sparkÎȶ¨ÐÔºÍbugµÈ·½ÃæµÄÐÞ¸´ £¬ËüµÄ·¢²¼Òâζ×ÅSpark 2.0½Ó½üÉú²ú»·¾³Ê¹ÓÃÒªÇó£¬ÏëÒª³¢ÊÔSpark 2.0µÄ¿ÉÒÔ¶¯ÊÖÁË¡£
Apache Spark 2.0ÊÇ»ùÓÚspark branch-2.x ¿ª·¢µÄ£¬Ïà±ÈÓÚbranch-1.0£¬ËüÔÚ¹¦ÄܺÍÐÔÄܵȷ½Ãæ¾ùÓо޴ó¸Ä½ø¡£ÔÚÐÔÄÜ·½Ã棬Spark 2.x ÓÐ2~10±¶µÄÌáÉý£»ÔÚ¹¦ÄÜ·½Ã棬Spark SQLÖеÄDataset±äµÃ³ÉÊ죬Spark 2.xͨ¹ýDatasetÖØ¹¹ÁËSpark StreamingºÍMLlibµÄAPI£¬½ø¶øÊ¹µÃÕâÁ½¸öϵͳÔÚÒ×ÓÃÐÔºÍÐÔÄÜ·½ÃæÓÐÖØ´óÌáÉý£¬ÔÚ²»¾ÃµÄ½«À´£¬Dataframe/Dataset API£¨high-level API£©½«È¡´úRDD API£¨low-level API£©£¬³ÉΪÖ÷Á÷µÄSpark±à³Ì½Ó¿Ú¡£
Apache Spark 2.xÔÚÐÔÄܺ͹¦ÄÜ·½ÃæµÄ¸Ä½øÖ÷Òª°üÀ¨£º
1. ÐÔÄÜ·½Ãæ
Ïà±ÈÓÚSpark 1.0£¬Spark 2.0ÔÚÒýÇæÐÔÄÜ·½ÃæÓÐÖØ´óÓÅ»¯£¬ÆäÓÅ»¯Ö÷ÒªÌåÏÖÔÚSpark CoreºÍSpark SQLÁ½¸öϵͳÉÏ£¬ÆäÓÅ»¯Ö÷ÒªµÃÒæÓÚTungsten¼Æ»®£¨¡°ÎÙË¿¼Æ»®¡±£©£¬ÆäÖ÷Òª¶¯»úÊÇÓÅ»¯SparkÄÚ´æºÍCPUµÄʹÓã¬Ê¹ÆäÄܹ»±Æ½üÎïÀí»úÆ÷µÄÐÔÄܼ«ÏÞ¡£
ÀûÓá°Õû½×¶Î´úÂëÉú³É¡±£¨¡°whole stage code generation¡±£©£¬Ê¹µÃSQLºÍDataFrameÖÐËã×ÓÐÔÄÜÓÅ»¯2-10±¶
ͨ¹ý¡°ÏòÁ¿»¯¼ÆË㡱ÌáÉýParquet¸ñʽÎļþµÄɨÃèÍÌÍÂÂÊ
ÌáÉýORC¸ñʽÎļþµÄ¶ÁдÐÔÄÜ
ÌáÉýCatalyst²éѯÓÅ»¯Æ÷ÐÔÄÜ
2. ¹¦ÄÜ·½Ãæ
£¨1£©Spark Core/SQL:Tungsten Phase 2£¬ÓÅ»¯CPUÓëMemory·½Ãæ
¡°ÎÙË¿¼Æ»®¡±Íê³ÉµÚ¶þ½×¶ÎÈÎÎñ£¬ÔÚÄÚ´æºÍCPUʹÓ÷½Ãæ½øÒ»²½ÓÅ»¯SparkÒýÇæÐÔÄÜ£¬Öع¹ÁË´óÁ¿Êý¾Ý½á¹¹ºÍËã·¨µÄʵÏÖ£¬Ê¹µÃDataframe/DatasetÐÔÄܵõ½ÏÔÖøÌáÉý£¬ÕâʹµÃDataframe/DatasetÓÐÄÜÁ¦³ÉΪÆäËû¼¸¸öϵͳ£¨±ÈÈçSpark StreamingºÍMLlib£©µÄ»ù´¡API¡£
×¢£º¡°ÎÙË¿¼Æ»®¡±°üÀ¨Èý¸ö·½ÃæµÄÓÅ»¯£º
Memory Management and Binary Processing£º Java GCÑÏÖØ£¬ÇÒjava¶ÔÏóÄڴ濪Ïú´ó£¬¿É²ÉÓÃÀàËÆCÓïÑÔ»úÖÆ£¬Ö±½Ó²Ù×Ýbinary data£¨sun.misc.Unsafe£©
Cache-aware Computation£ººÏÀíʹÓÃCPUµÄL1/L2/L3 cache£¬Éè¼Æ¶ÔcacheÓѺõÄËã·¨
Code Generation£º¿ÉÈ¥³ýÌõ¼þ¼ì²é£¬¼õÉÙÐ麯Êýµ÷¶ÈµÈ
£¨2£©Spark SQL: ͳһDataFrameÓëDataset API
ÖÚËùÖÜÖª£¬ÔÚSpark 1.xÖУ¬DataFrame API´æÔںܶàÎÊÌ⣬°üÀ¨²»ÊÇÀàÐͰ²È«µÄ(not type-safe)£¬È±·¦º¯Êýʽ±à³ÌÄÜÁ¦£¨not object-oriented£©µÈ£¬ÎªÁ˿˷þÕâЩÎÊÌ⣬ÉçÇøÒýÈëÁËDataset£¬Ïà±ÈÓÚDataFrame£¬Ëü¾ßÓÐÒÔϼ¸¸öÌØµã£ºÀàÐͰ²È«£¬ÃæÏò¶ÔÏó±à³Ì·½Ê½£»Ö§³Ö·Ç½á¹¹»¯Êý¾Ý£¨json£©£»javaÓëscalaͳһ½Ó¿ÚºÍÐÔÄܼ«ºÃµÄÐòÁл¯¿ò¼ÜµÈ£¬Ëý½«³ÉΪSparkδÀ´Ö÷Á÷µÄ±à³Ì½Ó¿Ú£¨RDD APIÊÇlow-level API£¬¶øDatasetÔòÊÇhigh-level API£©¡£
£¨3£©Spark SQL£ºÖ§³ÖSQL 2003
Spark SQLÔÚͨÓÃÐÔ·½ÃæÓÐÖØ´óÍ»ÆÆ£¬ËüÅÜͨÁËËùÓУ¨99¸ö£©TPC-DS²éѯ £¬²¢ÓÐÒÔϼ¸¸ö¸Ä½ø£º
½âÎöÆ÷¿Éͬʱ֧³ÖANSI-SQL ºÍHive QL
ʵÏÖÁËDDL
Ö§³Ö´ó²¿·Ö×Ó²éѯ
Ö§³ÖView
£¨4£©Spark Streaming£ºÒýÈëStructured Streaming
Spark Streaming»ùÓÚSpark SQL£¨DataFrame / Dataset £©¹¹½¨ÁËhigh-level API£¬Ê¹µÃSpark Streaming³ä·ÖÊÜÒæSpark SQLµÄÒ×ÓÃÐÔºÍÐÔÄÜÌáÉý¡£Spark StreamingÖØ¹¹µÄAPIÖ÷ÒªÊÇÃæÏò½á¹¹»¯Êý¾ÝµÄ£¬±»³ÆÎª¡°Structured Streaming¡±£¬ÆäÖ÷ÒªÌØÐÔ°üÀ¨£º
DataFrame / Dataset APIµÄÖ§³Ö
ÌṩÁËEvent time, windowing, sessions, sources & sinkµÈAPI
Á¬½ÓÁ÷ʽÊý¾ÝÓ뾲̬Êý¾Ý¼¯
½»»¥Ê½²éѯ½á¹û£ºÍ¨¹ýJDBC server½«RDD½á¹û±©Â¶³öÈ¥£¬ÒÔ±ãÓÚ½»»¥Ê½²éѯ
£¨5£©Spark MLlib: MLlib 2.0µ®Éú
Spark MLlib³¯×Å2.0½ø»¯£¬Ö÷ÒªÌåÏÖÔÚ»úÆ÷ѧϰģÐ͵ĶàÑù»¯£¬³Ö¾Ã»¯ºÍ¶¨ÖÆ»¯ÉÏ£¬¾ßÌå°üÀ¨£º
¹ãÒåÏßÐÔÄ£Ð͵ÄÈ«ÃæÊµÏÖ
Python & R APIµÄÖ§³Ö
ÔöǿģÐͳ־û¯ÄÜÁ¦
Pipieline¶¨ÖÆ»¯
Apache Spark 2.0ÔÚ¹¦ÄܺÍÐÔÄܵÄÖØ´ó¸Ä½ø£¬Ê¹µÃËüÔÚ·Ö²¼Ê½¼ÆËãÁìÓò½øÒ»²½¹®¹ÌÁË×Ô¼ºµÄµØÎ»£¬Ëæ×ÅSparkÓ¦ÓÃÔ½À´Ô½¹ã·º£¬Ëü½«±ä³ÉÊý¾Ý¹¤³ÌʦµÄÒ»Ïî»ù±¾¼¼ÄÜ¡£
Èí¼þÏêÇ飺https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12336857
ÏÂÔØµØÖ·£ºhttp://spark.apache.org/downloads.html
À´×Ô:¿ªÔ´ÖйúÉçÇø

