V2EX = way to explore
V2EX 是一个关于分享和探索的地方
Sign Up Now
For Existing Member  Sign In
zhaoxj58
V2EX  ›  问与答

spark structured streaming 上可以基于 groupBy window 的结果自定义处理方法吗

  •  
  •   zhaoxj58 · Mar 12, 2021 · 641 views
    This topic created in 1879 days ago, the information mentioned may be changed or developed.

    官方的例子是这样的,最后用了一个 count()d 的方法来做统计:

    words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
    
    # Group the data by window and word and compute the count of each group
    windowedCounts = words.groupBy(
        window(words.timestamp, "10 minutes", "5 minutes"),
        words.word
    ).count()
    
    

    现在我想这样做,基于 groupBy window 出来的 GroupedData 数据,使用自定义的方式来处理, 比如在 g()中,增加一些自定义逻辑。

    schema = StructType(
        [StructField("key", StringType()), StructField("avg_min", DoubleType())]
    )
    
    @panda_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
    def g(df):
        #whatever user-defined code 
    
    words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
    windowedCounts = words.groupBy(
        window(words.timestamp, "10 minutes", "5 minutes"),
        words.word
    ).apply(g)
    
    

    我尝试过,但是没成功。不知道是我用法不对,还是说不能将用户自定义方法作用于 groupBy window 后的数据?

    No Comments Yet
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   2351 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 28ms · UTC 15:49 · PVG 23:49 · LAX 08:49 · JFK 11:49
    ♥ Do have faith in what you're doing.