CREATE EXTERNAL TABLE
CREATE EXTERNAL TABLE [IF NOT EXISTS] [schema_name.]table_name [( column-definition [ , column-definition ] * )] USING datasource [OPTIONS (key1 val1, key2 val2, ...)]
For more information on column-definition, refer to Column Definition For Column Table.
External tables point to external data sources. SnappyData supports all the data sources supported by Spark. You should use external tables to load data in parallel from any of the external sources. The table definition is persisted in the catalog and visible across all sessions.
USING <data source>
Specify the file format to use for this table. The data source may be one of TEXT, CSV, JSON, JDBC, PARQUET, ORC, and LIBSVM, or a fully qualified class name of a custom implementation of org.apache.spark.sql.sources.DataSourceRegister. Note that most of the prominent datastores provide an implementation of 'DataSource' and accessible as a table. For instance, you can use the Cassandra spark package to create external tables pointing to Cassandra tables and directly run queries on them. You can mix any external table and SnappyData managed tables in your queries.
Create an external table using PARQUET data source on local filesystem
snappy> CREATE EXTERNAL TABLE STAGING_AIRLINE USING parquet OPTIONS(path '../../quickstart/data/airlineParquetData');
Create an external table using CSV data source on local filesystem
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_STAGING USING csv OPTIONS(path '../../quickstart/src/main/resources/customer.csv');
CREATE EXTERNAL TABLE CUSTOMER_STAGING_1 (C_CUSTKEY INTEGER NOT NULL, C_NAME VARCHAR(25) NOT NULL, C_ADDRESS VARCHAR(40) NOT NULL, C_NATIONKEY INTEGER NOT NULL, C_PHONE VARCHAR(15) NOT NULL, C_ACCTBAL DECIMAL(15,2) NOT NULL, C_MKTSEGMENT VARCHAR(10) NOT NULL, C_COMMENT VARCHAR(117) NOT NULL) USING csv OPTIONS (path '../../quickstart/src/main/resources/customer.csv');
You can also load data from AWS S3 buckets, as given in the example below:
CREATE EXTERNAL TABLE NYCTAXI USING parquet OPTIONS(path 's3a://<AWS_SECRET_KEY>:<AWS_SECRET_ID>@<folder>/<data>');
Specifying AWS credentials to access S3 buckets
Providing AWS credentials explicitly in the path url may not be advisable. There are alternative ways users can specify these credentials:
Users can provide those as properties in
conf/leadsfile at the time of launching the cluster.
For example, append these properties with appropriate values for each entry in your
pre -spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem -spark.hadoop.fs.s3a.access.key=<access-key-id> -spark.hadoop.fs.s3a.secret.key=<secret-access-key>
One can also provide them as environment variables in
conf/spark-env.sh. Simply add these two entries with appropriate values in that file before launching the cluster.
pre export AWS_ACCESS_KEY_ID=<access-key-id> export AWS_SECRET_ACCESS_KEY=<secret-access-key>
This option is applicable only if your cluster is running on AWS EC2 instance(s). Also, this may not work if your S3 buckets are created in regions where AWS signature version 2 is not supported.
You can attach an IAM role with appropriate permissions to the instance(s) where the cluster is setup. To do this, go to EC2 dashboard page on AWS console, select your EC2 instance and right-click on it. Select Instance Settings and then Attach/Replace IAM Role. Then click on the drop-down list and select appropriate IAM role. If you do not see any IAM role in the list, you need to create one. Refer to AWS documentation for more details.
When you attach an IAM role to an instance, a temporary set of credentials are generated by AWS for the instance(s). These are then picked up by the cluster while accessing the S3 buckets.
Then provide below configuration in
conf/spark-defaults.conffile before launching the cluster.
pre spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider