How to Read a Csv File in Spark Sql

CSV (Comma-Separated Values) is one of most common file blazon to receive information. That is why, when you are working with Spark, having a good grasp on how to process CSV files is a must. Spark provides out of box support for CSV file types. In this web log, we will learn how to read CSV data in spark and unlike options available with this method.

Reading CSV File

Spark has congenital in support to read CSV file. We tin employ spark read control to it will read CSV data and render usa DataFrame.

df = spark . read . csv ( "data\\flight-data\\csv\\2010-summary.csv" )

df . printSchema ( )

root

| -- _c0 : cord ( nullable = true )

| -- _c1 : string ( nullable = true )

| -- _c2 : string ( nullable = true )

df . show ( 2 )

+ -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- - +

| _c0 | _c1 | _c2 |

+ -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- - +

| DEST_COUNTRY_NAME | ORIGIN_COUNTRY_NAME | count |

| United States | Romania | i |

+ -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- - +

We can apply read CSV function and passed path to our CSV file. Spark will read this file and return us a data frame. In that location are other generic means to read CSV file likewise.

# reading information using spark format choice

df2 = spark . read . format ( "csv" ) . load ( "data/flight-data/csv/2010-summary.csv" )

# we can likewise pass path as option to spark read

df3 = spark . read . format ( "csv" ) \

. selection ( "path" , "data/flight-data/csv/2010-summary.csv" ) \

. load ( )

You can use either of method to read CSV file. In end, spark volition return an appropriate data frame.

Handling Headers in CSV

Generally, you may accept headers in your CSV file. If you directly read CSV in spark, spark volition treat that header as normal data row.

When nosotros print our information frame using show command, we can see that column names are _c0, _c1 and _c2 and our first data row is DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, Count.

To handle headers in CSV file, in spark we can pass header flag as true while reading data.

df = spark . read \

. pick ( "header" , "true" ) \

. csv ( "data/flight-information/csv/2010-summary.csv" )

df . evidence ( 5 )

+ -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- - +

| DEST_COUNTRY_NAME | ORIGIN_COUNTRY_NAME | count |

+ -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- - +

| United States | Romania | i |

| United States | Ireland | 264 |

| United States | Republic of india | 69 |

| Egypt | United States | 24 |

| Equatorial Republic of guinea | United States | 1 |

+ -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- - + -- -- - +

just showing top five rows

We tin can see that Spark has handled header row from CSV properly.

Reading CSV files in Folder

While reading CSV files in Spark, we tin can also laissez passer path of binder which has CSV files. This will read all CSV files in that folder.

df = spark . read \

. choice ( "header" , "truthful" ) \

. csv ( "data/flight-data/csv" )

df . count ( )

1502

You will demand to be more careful when passing path of the directory. If there is some other information or files (in whatever format) in that directory, Spark volition treat that as input data and yous may see wrong results or exceptions while processing such DataFrames.

To avoid this, we tin can also laissez passer multiple paths while reading data.

df = spark . read \

. option ( "header" , "true" ) \

. csv ( "information/flight-data/csv/2010-summary.csv, information/flying-data/csv/2011-summary.csv" )

DataFrame schema while reading CSV files

If you lot print schema of our current information frame, you observe that all cavalcade names are right which have been picked upwardly from header row. But data types of all columns are string.

df . printSchema ( )

root

| -- DEST_COUNTRY_NAME : string ( nullable = true )

| -- ORIGIN_COUNTRY_NAME : cord ( nullable = true )

| -- count : string ( nullable = true )

Spark past default sets column data type equally string. If we want spark to correctly identify data types, we tin can pass infer schema selection while reading CSV file.

df = spark .read \

.selection ( "header" , "true" ) \

.selection ( "inferSchema" , "truthful" ) \

.csv ( "information/flight-data/csv/2010-summary.csv" )

df .printSchema ( )

root

| -- DEST_COUNTRY_NAME : string ( nullable = truthful )

| -- ORIGIN_COUNTRY_NAME : cord ( nullable = truthful )

| -- count : integer ( nullable = true )

Now we can observe that data type for "count" column has been correctly identified as integer. When we pass infer schema as true, Spark reads a few lines from the file. And then that information technology tin can correctly identify data types for each column.

Though in about cases Spark identifies column data types correctly, in production workloads it is recommended to pass our custom schema while reading file. We can do that using Spark'southward "StructType" and "StructFiled" functions.

1

2

3

4

5

six

seven

8

nine

10

11

12

13

xiv

15

16

17

eighteen

from pyspark . sql . types import StructField , StructType , StringType , LongType

custom_schema = StructType ( [

StructField ( "destination" , StringType ( ) , True ) ,

StructField ( "source" , StringType ( ) , True ) ,

StructField ( "total_flights" , LongType ( ) , Truthful ) ,

] )

df = spark . read . format ( "csv" ) \

. schema ( custom_schema ) \

. option ( "header" , True ) \

. load ( "information/flight-data/csv/2010-summary.csv" )

df . bear witness ( 2 )

+ -- -- -- -- -- -- - + -- -- -- - + -- -- -- -- -- -- - +

| destination | source | total_flights |

+ -- -- -- -- -- -- - + -- -- -- - + -- -- -- -- -- -- - +

| United States | Romania | fifteen |

| United States | Croatia | 1 |

+ -- -- -- -- -- -- - + -- -- -- - + -- -- -- -- -- -- - +

only showing top two rows

If you want to lean more almost how to add custom schema while reading files in spark, you can bank check this commodity Adding Custom Schema to Spark DataFrame

Reading CSV with different delimiter

Sometimes, we have different delimiter in file other than comma ",". In such cases nosotros tin can specify separator character while reading CSV file. Below is example reading pipage (|) delimited file. But y'all can use any other character applicable in your case.

df = spark . read \

. option ( "header" , "true" ) \

. option ( "inferSchema" , "truthful" ) \

. choice ( "sep" , "|" ) \

. csv ( "information/flight-data/csv/piped_data.csv" )

Handling comma in cavalcade value

Sometimes, our column value itself has a comma in it. So when spark tried to read such a file, it cannot identify each cavalcade value. Consider sample data below.

name , address , historic period

abc , 123 some road , city , thirty

We have comma present in our accost, so when spark read this it will give us a below result which is incorrect.

df = spark .read \

.option ( "header" , "truthful" ) \

.pick ( "inferSchema" , "true" ) \

.csv ( "data/flight-data/csv/sample_data.csv" )

df .show ( )

+ -- -- + -- -- -- -- -- -- - + -- -- - +

| name | accost | age |

+ -- -- + -- -- -- -- -- -- - + -- -- - +

| abc | 123 some road | metropolis |

+ -- -- + -- -- -- -- -- -- - + -- -- - +

If we want to handle such cases, we need to make certain our data is enclosed in quotes. Then nosotros can set escaped quotes as true while reading data to handle such cases.

1

2

iii

4

5

6

7

viii

9

10

eleven

12

xiii

14

fifteen

sixteen

17

18

df = spark .read \

.pick ( "header" , "true" ) \

.choice ( "inferSchema" , "truthful" ) \

.option ( "escapeQuotes" , "true" ) \

.csv ( "data/flight-information/csv/sample_data.csv" )

df .bear witness ( )

+ -- -- + -- -- -- -- -- -- -- -- -- - + -- -- - +

| name | address | age |

+ -- -- + -- -- -- -- -- -- -- -- -- - + -- -- - +

| abc | 123 some road | city |

| abc | 123 some road , urban center | 30 |

+ -- -- + -- -- -- -- -- -- -- -- -- - + -- -- - +

# data in file

proper noun , address , age

abc , 123 some route , metropolis , 30

"abc" , "123 some road, metropolis" , thirty

We can see that spark has correctly put all address information in 1 column, and our historic period column has right value when nosotros have data enclosed in quotes.

Few of import options while reading CSV files

Though CSV is one of common formats to store data, it is 1 of the most difficult one for processing. There are a lot of cases we have to exist accountable for like spaces, date formats, nil values, etc.

Apart from options nosotros take discussed, below are few more than options which you may find useful while dealing with CSV data in spark. If y'all demand more details virtually these options, permit me know.

Conclusion

In this blog, we have written spark code to read CSV data. we have also checked different options to deal with common pitfalls while dealing with CSV files. You can find lawmaking in this git repo. I promise you lot take plant this useful. Meet you on the side by side blog.

read csv data in spark

saldananeents.blogspot.com

Source: https://analyticshut.com/read-csv-data-in-spark/

0 Response to "How to Read a Csv File in Spark Sql"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel