我试图重现错误,我成功了。我通过发现什么日期有问题开始我的调查,并在执行一个小的for循环后,我结束了:
val spark = SparkSession.builder.appName("My Application").getOrCreate() for (hour <- 0 to 23) { println(s"checking ${hour}") val queryHour = if (hour < 10) { s"0${hour}" } else { hour } spark.sql("SELECT CAST(UNIX_TIMESTAMP('2015-03-29T01:11:23Z', \"yyyy-MM-dd'T'HH:mm:ss'Z'\") AS TIMESTAMP) as column1, " + "CAST(UNIX_TIMESTAMP('2015-03-29T"+queryHour+":11:23Z', \"yyyy-MM-dd'T'HH:mm:ss'Z'\") AS TIMESTAMP) as column2").show(false) }
它返回:
checking 0 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 00:11:23| +-------------------+-------------------+ checking 1 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 01:11:23| +-------------------+-------------------+ checking 2 +-------------------+-------+ |column1 |column2| +-------------------+-------+ |2015-03-29 01:11:23|null | +-------------------+-------+ checking 3 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 03:11:23| +-------------------+-------------------+ checking 4 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 04:11:23| +-------------------+-------------------+ checking 5 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 05:11:23| +-------------------+-------------------+ checking 6 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 06:11:23| +-------------------+-------------------+ checking 7 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 07:11:23| +-------------------+-------------------+ checking 8 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 08:11:23| +-------------------+-------------------+ checking 9 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 09:11:23| +-------------------+-------------------+ checking 10 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 10:11:23| +-------------------+-------------------+ checking 11 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 11:11:23| +-------------------+-------------------+ checking 12 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 12:11:23| +-------------------+-------------------+ checking 13 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 13:11:23| +-------------------+-------------------+ checking 14 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 14:11:23| +-------------------+-------------------+ checking 15 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 15:11:23| +-------------------+-------------------+ checking 16 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 16:11:23| +-------------------+-------------------+ checking 17 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 17:11:23| +-------------------+-------------------+ checking 18 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 18:11:23| +-------------------+-------------------+ checking 19 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 19:11:23| +-------------------+-------------------+ checking 20 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 20:11:23| +-------------------+-------------------+ checking 21 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 21:11:23| +-------------------+-------------------+ checking 22 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 22:11:23| +-------------------+-------------------+ checking 23 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 01:11:23|2015-03-29 23:11:23| +-------------------+-------------------+
因此,只有2点钟受到影响。从那里我挖到Spark的 datetimeExpressions.scala 用于转换日期时间内容的文件。更确切地说,在添加一些断点之后,我发现问题是由org.apache.spark.sql.catalyst.expressions.UnixTime#eval方法中的静默解析错误引起的,更确切地说:
case StringType if right.foldable => if (constFormat == null || formatter == null) { null } else { try { formatter.parse(t.asInstanceOf[UTF8String].toString).getTime / 1000L } catch { case NonFatal(_) => null } }
并使用时区创建格式化程序:
private lazy val formatter: DateFormat = try { DateTimeUtils.newDateFormat(constFormat.toString, timeZone) } catch { case NonFatal(_) => null }
接下来,我隔离了引入问题的代码:
val formatterParseError: DateFormat = DateTimeUtils.newDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'", DateTimeUtils.getTimeZone("Europe/Paris")) val formattedDateParseError = formatterParseError.parse("2015-03-29T02:11:23Z")
实际上,我得到了一个ParseException:
Unparseable date: "2015-03-29T02:11:23Z" java.text.ParseException: Unparseable date: "2015-03-29T02:11:23Z" at java.text.DateFormat.parse(DateFormat.java:366) at com.waitingforcode.stackoverflow.SchemaTest$$anonfun$2.apply$mcV$sp(SchemaTest.scala:225) at com.waitingforcode.stackoverflow.SchemaTest$$anonfun$2.apply(SchemaTest.scala:218) at com.waitingforcode.stackoverflow.SchemaTest$$anonfun$2.apply(SchemaTest.scala:218) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
从那时我检查了Javadoc的时区管理 https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html#timezone 。似乎当时区表示为'Z'(ISO 8601时区)时,使用的正确模式选项是X.当我在隔离测试用例中更改它时,日期格式正确:
val formatter: DateFormat = DateTimeUtils.newDateFormat("yyyy-MM-dd'T'HH:mm:ssX", DateTimeUtils.getTimeZone("Europe/Paris")) val formattedDate = formatter.parse("2015-03-29T02:11:23Z") println(s"formattedDate = ${formattedDate}") // formattedDate = Sun Mar 29 04:11:23 CEST 2015
它也适用于Spark的SQL(转换为欧洲/巴黎时区,你改变它 spark.sql.session.timeZone 选项):
spark.sql.session.timeZone
checking 0 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 01:11:23| +-------------------+-------------------+ checking 1 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 03:11:23| +-------------------+-------------------+ checking 2 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 04:11:23| +-------------------+-------------------+ checking 3 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 05:11:23| +-------------------+-------------------+ checking 4 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 06:11:23| +-------------------+-------------------+ checking 5 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 07:11:23| +-------------------+-------------------+ checking 6 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 08:11:23| +-------------------+-------------------+ checking 7 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 09:11:23| +-------------------+-------------------+ checking 8 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 10:11:23| +-------------------+-------------------+ checking 9 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 11:11:23| +-------------------+-------------------+ checking 10 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 12:11:23| +-------------------+-------------------+ checking 11 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 13:11:23| +-------------------+-------------------+ checking 12 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 14:11:23| +-------------------+-------------------+ checking 13 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 15:11:23| +-------------------+-------------------+ checking 14 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 16:11:23| +-------------------+-------------------+ checking 15 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 17:11:23| +-------------------+-------------------+ checking 16 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 18:11:23| +-------------------+-------------------+ checking 17 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 19:11:23| +-------------------+-------------------+ checking 18 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 20:11:23| +-------------------+-------------------+ checking 19 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 21:11:23| +-------------------+-------------------+ checking 20 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 22:11:23| +-------------------+-------------------+ checking 21 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-29 23:11:23| +-------------------+-------------------+ checking 22 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-30 00:11:23| +-------------------+-------------------+ checking 23 +-------------------+-------------------+ |column1 |column2 | +-------------------+-------------------+ |2015-03-29 03:11:23|2015-03-30 01:11:23| +-------------------+-------------------+
回顾一下,如果失败则返回null,我们可以从以下方面学习:
和代码评论:
/ ** *将unix epoch(1970-01-01 00:00:00 UTC)的秒数转换为字符串 *表示给定的当前系统时区中该时刻的时间戳 *格式。如果缺少格式,请使用“1970-01-01 00:00:00”之类的格式。 *请注意,hive语言手册表示如果失败则返回0,但实际上它返回null。 * /
我们可以使用类似ISO的时区选项(X)来解决它。