Working on spark, sometimes I need to send a non-serializable object in each task.
A common pattern is @transient lazy val
, e.g
class A(val a: Int)def compute(rdd: RDD[Int]) = {// lazy val instance = {@transient lazy val instance = {println("in lazy object")new A(1)}val res = rdd.map(instance.a + _).count()println(res)}compute(sc.makeRDD(1 to 100, 8))
I found that @transient
is not necessary here. lazy val
can already create the non-serializable upon each task is executed. But people suggest using @transient
.
What is the advantage, if we set
@transient
on the non-initializedlazy val
when serializing it ?Does it make sense to make a non-initialized
val
transient for serialization, knowing that nothing will be serialized, just like in the example above ?How is a
@transient lazy val
serialized ? Is it treated as a method or something else ?
Some details on serializing @transient lazy val
and the compiled java bytecode is awesome.
Best Answer
see here - http://fdahms.com/2015/10/14/scala-and-the-transient-lazy-val-pattern/
In Scala lazy val denotes a field that will only be calculated once it is accessed for the first time and is then stored for future reference. With @transient on the other hand one can denote a field that shall not be serialized.
lazy val
works because you don't use it in the driver process so the field remains null and serialization works. If you do use it in the driver process then @transient
is required.