The following items are inclulded in this section:
- Configuring Cluster Components
- List of Properties
- Specifying Configuration Properties using Environment Variables
- Configuring SnappyData Smart Connector
- Auto-Configuring Off-Heap Memory Size
- Firewalls and Connections
Configuring Cluster Components
Configuration files for locator, lead, and server should be created in the conf folder located in the SnappyData home directory with names locators, leads, and servers.
To do so, you can copy the existing template files servers.template, locators.template, leads.template, and rename them to servers, locators, leads. These files should contain the hostnames of the nodes (one per line) where you intend to start the member. You can modify the properties to configure individual members.
Locators provide discovery service for the cluster. Clients (for example, JDBC) connect to the locator and discover the lead and data servers in the cluster. The clients automatically connect to the data servers upon discovery (upon initial connection). Cluster members (Data servers, Lead nodes) also discover each other using the locator. Refer to the Architecture section for more information on the core components.
It is recommended to configure two locators (for HA) in production using the conf/locators file located in the <SnappyData_home>/conf directory.
In this file, you can specify:
The hostname on which a SnappyData locator is started.
The startup directory where the logs and configuration files for that locator instance are located.
SnappyData specific properties that can be passed.
You can refer to the conf/locators.template file for some examples.
$ cat conf/locators node-a -peer-discovery-port=9999 -dir=/node-a/locator1 -heap-size=1024m -locators=node-b:8888 node-b -peer-discovery-port=8888 -dir=/node-b/locator2 -heap-size=1024m -locators=node-a:9999
Lead Nodes primarily runs the SnappyData managed Spark driver. There is one primary lead node at any given instance, but there can be multiple secondary lead node instances on standby for fault tolerance. Applications can run Jobs using the REST service provided by the Lead node. Most of the SQL queries are automatically routed to the Lead to be planned and executed through a scheduler. You can refer to the conf/leads.template file for some examples.
Create the configuration file (leads) for leads in the <SnappyData_home>/conf directory.
In the conf/spark-env.sh file set the
SPARK_PUBLIC_DNS property to the public DNS name of the lead node. This enables the Member Logs to be displayed correctly to users accessing SnappyData Monitoring Console from outside the network.
Example: To start a lead (node-l), set
spark.executor.cores as 10 on all servers, and change the Spark UI port from 5050 to 9090, update the configuration file as follows:
$ cat conf/leads node-l -heap-size=4096m -spark.ui.port=9090 -locators=node-b:8888,node-a:9999 -spark.executor.cores=10
Configuring Secondary Lead
To configure secondary leads, you must add the required number of entries in the conf/leads file.
$ cat conf/leads node-l1 -heap-size=4096m -locators=node-b:8888,node-a:9999 node-l2 -heap-size=4096m -locators=node-b:8888,node-a:9999
In this example, two leads (one on node-l1 and another on node-l2) are configured. Using
sbin/snappy-start-all.sh, when you launch the cluster, one of them becomes the primary lead and the other becomes the secondary lead.
Configuring Data Servers
Data Servers hosts data, embeds a Spark executor, and also contains a SQL engine capable of executing certain queries independently and more efficiently than the Spark engine. Data servers use intelligent query routing to either execute the query directly on the node or to pass it to the lead node for execution by Spark SQL. You can refer to the conf/servers.template file for some examples.
Create the configuration file (servers) for data servers in the <SnappyData_home>/conf directory.
Example: To start a two servers (node-c and node-c), update the configuration file as follows:
$ cat conf/servers node-c -dir=/node-c/server1 -heap-size=4096m -memory-size=16g -locators=node-b:8888,node-a:9999 node-c -dir=/node-c/server2 -heap-size=4096m -memory-size=16g -locators=node-b:8888,node-a:9999
List of Properties
Refer SnappyData properties.
Specifying Configuration Properties using Environment Variables
SnappyData configuration properties can be specified using environment variables LOCATOR_STARTUP_OPTIONS, SERVER_STARTUP_OPTIONS, and LEAD_STARTUP_OPTIONS respectively for locators, leads and servers. These environment variables are useful to specify common properties for locators, servers, and leads. These startup environment variables can be specified in conf/spark-env.sh file. This file is sourced when SnappyData system is started. A template file conf/spark-env.sh.template is provided in conf directory for reference. You can copy this file and use it to configure properties.
# create a spark-env.sh from the template file $cp conf/spark-env.sh.template conf/spark-env.sh # Following example configuration can be added to spark-env.sh, # it shows how to add security configuration using the environment variables SECURITY_ARGS="-auth-provider=LDAP -J-Dgemfirexd.auth-ldap-server=ldap://192.168.1.162:389/ -user=user1 -password=password123 -J-Dgemfirexd.auth-ldap-search-base=cn=sales-group,ou=sales,dc=example,dc=com -J-Dgemfirexd.auth-ldap-search-dn=cn=admin,dc=example,dc=com -J-Dgemfirexd.auth-ldap-search-pw=password123" #applies the configuration specified by SECURITY_ARGS to all locators LOCATOR_STARTUP_OPTIONS=”$SECURITY_ARGS” #applies the configuration specified by SECURITY_ARGS to all servers SERVER_STARTUP_OPTIONS=”$SECURITY_ARGS” #applies the configuration specified by SECURITY_ARGS to all leads LEAD_STARTUP_OPTIONS=”$SECURITY_ARGS”
Configuring SnappyData Smart Connector
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program). In Smart connector mode, a Spark application connects to SnappyData cluster to store and process data. SnappyData currently works with Spark version 2.1.1. To work with SnappyData cluster, a Spark application must set the
snappydata.connection property while starting.
|snappydata.connection||SnappyData cluster's locator host and JDBC client port on which locator listens for connections. Has to be specified while starting a Spark application.|
$ ./bin/spark-submit --deploy-mode cluster --class somePackage.someClass --master spark://localhost:7077 --conf spark.snappydata.connection=localhost:1527 --packages 'SnappyDataInc:snappydata:1.1.1-s_2.11'
Any Spark or SnappyData specific environment settings can be done by creating a snappy-env.sh or spark-env.sh in SNAPPY_HOME/conf.
Hadoop Provided Settings
If you want to run SnappyData with an already existing custom Hadoop cluster like MapR or Cloudera you should download Snappy without Hadoop from the download link. This allows you to provide Hadoop at runtime.
To do this, you need to put an entry in $SNAPPY-HOME/conf/spark-env.sh as below:
export SPARK_DIST_CLASSPATH=$($OTHER_HADOOP_HOME/bin/hadoop classpath)
Currently, log files for SnappyData components go inside the working directory. To change the log file directory, you can specify a property -log-file as the path of the directory. The logging levels can be modified by adding a conf/log4j.properties file in the product directory.
$ cat conf/log4j.properties log4j.logger.org.apache.spark.scheduler.DAGScheduler=DEBUG log4j.logger.org.apache.spark.scheduler.TaskSetManager=DEBUG
For a set of applicable class names and default values see the file conf/log4j.properties.template, which can be used as a starting point. Consult the log4j 1.2.x documentation for more details on the configuration file.
Auto-Configuring Off-Heap Memory Size
Off-Heap memory size is auto-configured by default in the following scenarios:
When the lead, locator, and server are setup on different host machines: In this case, off-heap memory size is configured by default for the host machines with the server setup. The total size of heap and off-heap memory does not exceed more than 75% of the total RAM. For example, if the RAM is greater than 8GB, the heap memory is between 4-8 GB and the remaining becomes the off-heap memory.
When leads and one of the server node are on the same host: In this case, off-heap memory size is configured by default and is adjusted based on the number of leads that are present. The total size of heap and off-heap memory does not exceed more than 75% of the total RAM. However, here the heap memory is the total heap size of the server as well as that of the lead.
The off-heap memory size is not auto-configured when the heap memory and the off-heap memory are explicitly configured through properties or when multiple servers are on the same host machine.
Firewalls and Connections
You may face possible connection problems that can result from running a firewall on your machine.
SnappyData is a network-centric distributed system, so if you have a firewall running on your machine it could cause connection problems. For example, your connections may fail if your firewall places restrictions on inbound or outbound permissions for Java-based sockets. You may need to modify your firewall configuration to permit traffic to Java applications running on your machine. The specific configuration depends on the firewall you are using.
As one example, firewalls may close connections to SnappyData due to timeout settings. If a firewall senses no activity in a certain time period, it may close a connection and open a new connection when activity resumes, which can cause some confusion about which connections you have.
Firewall and Port Considerations
You can configure and limit port usage for situations that involve firewalls, for example, between client-server or server-server connections.
The port that the server or locator listens on for client connections. This is configurable using the
-client-portoption to the snappy server or snappy locator command.
The peer discovery port. SnappyData members connect to the locator for peer-to-peer messaging. The locator port is configurable using the
-peer-discovery-portoption to the snappy server or snappy locator command.
By default, SnappyData servers and locators discover each other on a pre-defined port (10334) on the localhost.
Limiting Ephemeral Ports for Peer-to-Peer Membership
By default, SnappyData utilizes ephemeral ports for UDP messaging and TCP failure detection. Ephemeral ports are temporary ports assigned from a designated range, which can encompass a large number of possible ports. When a firewall is present, the ephemeral port range usually must be limited to a much smaller number, for example six. If you are configuring P2P communications through a firewall, you must also set each the tcp port for each process and ensure that UDP traffic is allowed through the firewall.
Properties for Firewall and Port Configuration
This following tables contain properties potentially involved in firewall behavior, with a brief description of each property. The Configuration Properties section contains detailed information for each property.
|Configuration Area||Property or Setting||Definition|
|peer-to-peer config||locators||The list of locators used by system members. The list must be configured consistently for every member of the distributed system.|
|peer-to-peer config||membership-port-range||The range of ephemeral ports available for unicast UDP messaging and for TCP failure detection in the peer-to-peer distributed system.|
|member config||-J-Dgemfirexd.hostname-for-clients||The IP address or host name that this server/locator sends to the JDBC/ODBC/thrift clients to use for the connection.|
|member config||client-port option to the snappy server and snappy locator commands||Port that the member listens on for client communication.|
The following table lists the Spark properties you can set to configure the ports required for Spark infrastructure.Refer to Spark Configuration in the official documentation for detailed information.
|spark.blockManager.port||random||Port for all block managers to listen on. These exist on both the driver and the executors.|
|spark.driver.blockManager.port||(value of spark.blockManager.port)||Driver-specific port for the block manager to listen on, for cases where it cannot use the same configuration as executors.|
|spark.driver.port||random||Port for the driver to listen on. This is used for communicating with the executors and the standalone Master.|
|spark.port.maxRetries||16||Maximum number of retries when binding to a port before giving up. When a port is given a specific value (non 0), each subsequent retry will increment the port used in the previous attempt by 1 before retrying. This essentially allows it to try a range of ports from the start port specified to port + maxRetries.|
|spark.shuffle.service.port||7337||Port on which the external shuffle service will run.|
|spark.ui.port||4040||Port for your application's dashboard, which shows memory and workload data.|
|spark.ssl.[namespace].port||None||The port where the SSL service will listen on.The port must be defined within a namespace configuration; see SSL Configuration for the available namespaces. When not set, the SSL port will be derived from the non-SSL port for the same service. A value of "0" will make the service bind to an ephemeral port.|
|spark.history.ui.port||The port to which the web interface of the history server binds.||18080|
|SPARK_MASTER_PORT||Start the master on a different port.||Default: 7077|
|SPARK_WORKER_PORT||Start the Spark worker on a specific port.||(Default: random|
Locators and Ports
The ephemeral port range and TCP port range for locators must be accessible to members through the firewall.
Locators are used in the peer-to-peer cache to discover other processes. They can be used by clients to locate servers as an alternative to configuring clients with a collection of server addresses and ports.
Locators have a TCP/IP port that all members must be able to connect to. They also start a distributed system and so need to have their ephemeral port range and TCP port accessible to other members through the firewall.
Clients need only be able to connect to the locator's locator port. They don't interact with the locator's distributed system; clients get server names and ports from the locator and use these to connect to the servers. For more information, see Using Locators.