Parsing Pretty Big File


Got a pretty big log file. Sure I can read it using glogg or by splitting it first. But, where's the fun in that?

I am going to use conduit for that job.

But, before I write gibberish about it, perhaps you want to read my previous entry which talks about parsing and tweeting. I use the same library, attoparsec, here and there.

Log Files

Pretty much like this.

2018-02-31 00:00:02,968 INFO  [Polling.Thingy.You.Know] method.from.program - StatementCache: [size=100, count=28, hits=231522, misses=47, aged=0]
2018-02-31 07:58:33,479 ERROR [TaskDispatcherThingy] problematic.method.from.program - DoingSomething.validateRequestData :: Validated things doesn't have Persistable: problematic.method.from.program.another.

Just a standard format from log4j (but please note that the message could be more than one line). But the problem is the size of the file.

> du komeror.log
1797720 komeror.log
> du xaa.log
76188   xaa.log

That komeror.log is a filtered log file which contains only lines with ERROR word in it. And that xaa.log is the first one-million lines, splitted using split command. There are more than 200 of it, actually.

Data Type

There are three datatypes that I'm using.

Defining Log Parser.

It's better to start by defining the simplest parser, Parser Severity.

parseSeverity :: Parser Severity
parseSeverity =
  ("INFO" >> return Info)
    <|> ("WARN" >> return Warn)
    <|> ("ERROR" >> return Error)
    <|> ("FATAL" >> return Fatal)

Which basically reads only those string at the left side of the >> operator. We are going to use it as a block builder of the log parser.

And then, defining event parser ([Polling.Thingy.You.Know], which I dumbly misnamed as parseMethod).

parseMethod :: Parser ByteString
parseMethod = do
  _         <- char '['
  something <- takeWhile (/= ']')
  _         <- char ']'
  return something

Pretty simple, correct. That function only accepnt input which starts with [ char and will stop at ] char and then return whatever in between.

What about date time, you ask? Actually I added an external dependency for it, but when I realised that I only need one or two functions, I decided to remove it from my cabal file and then used "copy and paste" method. Shameless I know.

Now individual log parser.

parseKomLog :: Parser KomLog
parseKomLog = do
  tanggal  <- parseTanggal
  _        <- skipSpace
  jam      <- parseJam
  _        <- skipSpace
  severity <- parseSeverity
  _        <- skipSpace
  method   <- parseMethod
  _        <- skipSpace
  message  <- manyTill anyChar $ endOfInput <|> tanggalDiDepan
  return $ KomLog tanggal jam severity method (pack message)

Pretty simple, actually. It parse date time, then skip white spaces, followed by log's severity (using parseSeverity) and then skip white spaces again, followed by parsing the log's event (using parseMethod) and then reads the message until the end of input or tanggalDiDepan, which just a look ahead on previous process, minum the message part.

CLI Parser

I think I only cared about how many lines it should print, which file it should read, and what kind of severity it should print. So, I created a record something like this.

data Opsi = Opsi
  { opsiJumlahMaksimal :: Int
  , opsiNamaBerkas     :: FilePath
  , opsiSeverity       :: Severity
  }

And then using optparse-applicative, I create a parser like the following

opsi :: Parser Opsi
opsi =
  Opsi
    <$> option auto (short 'x' <> showDefault <> value 1) -- [1]
    <*> strOption (short 'b') -- [2]
    <*> option auto (short 's' <>  value "error") -- [3]

where [1] is an optional switch which takes an integer and corresponds with opsiJumlahMaksimal. [2], it's a must filled switch, because it corresponds with the log file of which this program should read. [3], pretty much like number one.

Entry Point

Here's the entry point. I tried to keep it as simple as possible.

someFunc :: IO ()
someFunc = do
  Opsi {..} <- execParser infoopsi -- [1]
  runConduitRes -- [2]
    $  C.sourceFile opsiNamaBerkas -- [3]
    .| conduitParser parseKomLog -- [4]
    .| filterC (\(_, KomLog _ _ c _ _) -> c == opsiSeverity) -- [5]
    .| takeC opsiJumlahMaksimal -- [6]
    .| mapC posrangekomlogketeks -- [7]
    .| iterMC putText -- [8]
    .| sinkNull
  1. Reads the commandline switch.
  2. Conduit in action.
  3. Read the file, which was defined from the commandline switch.
  4. conduit & attoparsec in action.
  5. Filter the parsed file. The default is error. I created this program because I want to know why that program went kaput because of simple error.
  6. Take the amount from the command line.
  7. posrangekomlogketeks is something just show with weird choices.
  8. Print it to stdout.

You can look at the repo here at ibnuda/guling or it's mirror. Somebody told me that I don't write something in my spare time just because I don't use github.

Result:

From an error only log file:

> wc komeror.log
  15850949   95106864 1840857389 komeror.log
> guling -x 15000000 -b komeror.log >> /dev/null
  guling -x 15000000 -b komeror.log >> /dev/null  177,02s user 21,16s system 130% cpu 2:31,69 total
> cat komeror.log >> /dev/null
  cat komeror.log >> /dev/null  0,00s user 0,23s system 99% cpu 0,229 total

From the first million lines log file:

> wc xaa.log
  1000000  3445994 78009277 xaa.log
> time guling -b xaa.log -x 75449 >> /dev/null # 75449 is the amount of ERROR based on grep
  guling -b xaa.log -x 75449 >> /dev/null  13,74s user 3,74s system 161% cpu 10,796 total
> time cat xaa.log >> /dev/null
  cat xaa.log >> /dev/null  0,00s user 0,02s system 96% cpu 0,018 total

Do I happy with the result? Not exactly. It's so slow. Too slow, even.

Not to mention that the writing process of this article was underwhelmingly monotonic. And the writing process for this program? Even more so.

God, This is so boring.



This material is shared under the CC-BY License.