This Spark configuration saved me 7 hours.
“multiLine: True“
I’ve been reading JSON files, each of ~1.3 GB containing a single line, having a single worker with 11 cores. The amount of data, unexpectedly, being read was ~6-8 times the file size. Spark was reading almost ~7-8 GB of data every time it tried to load a similar JSON file of ~1.3 GB. Adding resources, having cores or memory more than the total file size seemed to only worsen the issue.
As these JSON files were a single line, there wasn’t any need to set “multiLine: True“. After much digging, I found out that, if multiLine is set to False, Spark attempts to break the JSON file in parts in order to be able to read it in parallel. So if you have 11 cores, Spark will split a ~1.3 GB file into 11 parts and read them parallelly - one part by 1 core, but only succeeds when reading all 11 chunks together to get a single record, the other 10 tasks get no records (as seen xx MiB / 0 records image below). Therefore it ended up reading ~7.4 GB in total even though the file size is only ~1.3 GB.
Setting the “multiLine: True“ instructs Spark that the file cannot be split and should be read only once. As shown in the 2nd image, Spark attempts only 1 read and is able to get the record.
Thanks to this blog for being the guiding light here: https://lnkd.in/gdphUMf6
#dataengineering  #data  #dataanalytics #spark #sparkoptimization #json
“multiLine: True“
I’ve been reading JSON files, each of ~1.3 GB containing a single line, having a single worker with 11 cores. The amount of data, unexpectedly, being read was ~6-8 times the file size. Spark was reading almost ~7-8 GB of data every time it tried to load a similar JSON file of ~1.3 GB. Adding resources, having cores or memory more than the total file size seemed to only worsen the issue.
As these JSON files were a single line, there wasn’t any need to set “multiLine: True“. After much digging, I found out that, if multiLine is set to False, Spark attempts to break the JSON file in parts in order to be able to read it in parallel. So if you have 11 cores, Spark will split a ~1.3 GB file into 11 parts and read them parallelly - one part by 1 core, but only succeeds when reading all 11 chunks together to get a single record, the other 10 tasks get no records (as seen xx MiB / 0 records image below). Therefore it ended up reading ~7.4 GB in total even though the file size is only ~1.3 GB.
Setting the “multiLine: True“ instructs Spark that the file cannot be split and should be read only once. As shown in the 2nd image, Spark attempts only 1 read and is able to get the record.
Thanks to this blog for being the guiding light here: https://lnkd.in/gdphUMf6
#dataengineering  #data  #dataanalytics #spark #sparkoptimization #json