Working on spark, sometimes I need to send a non-serializable object in each task.

A common pattern is @transient lazy val, e.g

class A(val a: Int)def compute(rdd: RDD[Int]) = {// lazy val instance = {@transient lazy val instance = {println("in lazy object")new A(1)}val res = rdd.map(instance.a + _).count()println(res)}compute(sc.makeRDD(1 to 100, 8))

I found that @transient is not necessary here. lazy val can already create the non-serializable upon each task is executed. But people suggest using @transient.

  1. What is the advantage, if we set @transient on the non-initialized lazy val when serializing it ?

  2. Does it make sense to make a non-initialized val transient for serialization, knowing that nothing will be serialized, just like in the example above ?

  3. How is a @transient lazy val serialized ? Is it treated as a method or something else ?

Some details on serializing @transient lazy val and the compiled java bytecode is awesome.

2

Best Answer


see here - http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/

In Scala lazy val denotes a field that will only be calculated once it is accessed for the first time and is then stored for future reference. With @transient on the other hand one can denote a field that shall not be serialized.

lazy val works because you don't use it in the driver process so the field remains null and serialization works. If you do use it in the driver process then @transient is required.