Spark_setup_all

Introduction

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small dataset or when running an iterative algorithm like random forests. Since operations in Spark are lazy, caching can help force computation. Sparklyr tools can be used to cache and uncache DataFrames. The Spark UI will tell you which DataFrames and what percentages are in memory.

By using a reproducible example, we will review some of the main configuration settings, commands and command arguments that can be used that can help you get the best out of Spark’s memory management options.

Category: Browsers Last Updated: 2020-09-03 File size: 35.59 MB Operating system: Windows 7/8/8.1/10 Download 167 617 downloads. This file will download from the developer's website. SparkSetupall.exe is known as Spark and it is developed by Baidu, Inc., it is also developed. We have seen about 51 different instances of SparkSetupall.exe in different location. So far we haven't seen any alert about this product. If you think there is a virus or malware with this product, please submit your feedback at the bottom.

Preparation

Download Test Data

The 2008 and 2007 Flights data from the Statistical Computing site will be used for this exercise. The spark_read_csv supports reading compressed CSV files in a bz2 format, so no additional file preparation is needed.

Start a Spark session

A local deployment will be used for this example.

The Memory Argument

In the spark_read_… functions, the memory argument controls if the data will be loaded into memory as an RDD. Setting it to FALSE means that Spark will essentially map the file, but not make a copy of it in memory. This makes the spark_read_csv command run faster, but the trade off is that any data transformation operations will take much longer.

In the RStudio IDE, the flights_spark_2008 table now shows up in the Spark tab.

To access the Spark Web UI, click the SparkUI button in the RStudio Spark Tab. As expected, the Storage page shows no tables loaded into memory.

Loading Less Data into Memory

Using the pre-processing capabilities of Spark, the data will be transformed before being loaded into memory. In this section, we will continue to build on the example started in the Spark Read section

Lazy Transform

The following dplyr script will not be immediately run, so the code is processed quickly. There are some check-ups made, but for the most part it is building a Spark SQL statement in the background.

Register in Spark

sdf_register will register the resulting Spark SQL in Spark. The results will show up as a table called flights_spark. But a table of the same name is still not loaded into memory in Spark.

Cache into Memory

Spark_setup_all-1

The tbl_cache command loads the results into an Spark RDD in memory, so any analysis from there on will not need to re-read and re-transform the original file. The resulting Spark RDD is smaller than the original file because the transformations created a smaller data set than the original file.

Driver Memory

In the Executors page of the Spark Web UI, we can see that the Storage Memory is at about half of the 16 gigabytes requested. This is mainly because of a Spark setting called spark.memory.fraction, which reserves by default 40% of the memory requested.

Process on the fly

The plan is to read the Flights 2007 file, combine it with the 2008 file and summarize the data without bringing either file fully into memory.

Spark_setup_all

Spark_setup_all 2018

Union and Transform

Download Baidu Spark Browser Setup

The union command is akin to the bind_rows dyplyr command. It will allow us to append the 2007 file to the 2008 file, and as with the previous transform, this script will be evaluated lazily.

Spark_setup_all

Collect into R

Baidu Spark_setup_all

When receiving a collect command, Spark will execute the SQL statement and send the results back to R in a data frame. In this case, R only loads 24 observations into a data frame called all_flights.