Got a pretty big log file.
Sure I can read it using glogg
or by splitting it first.
But, where's the fun in that?
I am going to use conduit
for that job.
But, before I write gibberish about it, perhaps you want to read my previous entry which talks about parsing and tweeting. I use the same library, attoparsec, here and there.
Pretty much like this.
2018-02-31 00:00:02,968 INFO [Polling.Thingy.You.Know] method.from.program - StatementCache: [size=100, count=28, hits=231522, misses=47, aged=0]
2018-02-31 07:58:33,479 ERROR [TaskDispatcherThingy] problematic.method.from.program - DoingSomething.validateRequestData :: Validated things doesn't have Persistable: problematic.method.from.program.another.
Just a standard format from log4j
(but please note that the message could be more than one line).
But the problem is the size of the file.
> du komeror.log
1797720 komeror.log
> du xaa.log
76188 xaa.log
That komeror.log
is a filtered log file which contains only lines with ERROR
word
in it.
And that xaa.log
is the first one-million lines, splitted using split
command.
There are more than 200 of it, actually.
There are three datatypes that I'm using.
KomLog
which reflects the format of the log from that program which was
produced by log4j
.Severity
which reflects the severity in the log. Is a part of KomLog
and a sum of Info | Warn | Error | Fatal
.Opsi
which reflects what the command line should accept.
e.g. file which should it read and amount of log it should print.It's better to start by defining the simplest parser, Parser Severity
.
parseSeverity :: Parser Severity
parseSeverity =
("INFO" >> return Info)
<|> ("WARN" >> return Warn)
<|> ("ERROR" >> return Error)
<|> ("FATAL" >> return Fatal)
Which basically reads only those string at the left side of the >>
operator.
We are going to use it as a block builder of the log parser.
And then, defining event parser ([Polling.Thingy.You.Know]
, which I dumbly misnamed as parseMethod
).
parseMethod :: Parser ByteString
parseMethod = do
_ <- char '['
something <- takeWhile (/= ']')
_ <- char ']'
return something
Pretty simple, correct.
That function only accepnt input which starts with [
char and will stop at ]
char
and then return whatever in between.
What about date time, you ask? Actually I added an external dependency for it, but when I realised that I only need one or two functions, I decided to remove it from my cabal file and then used "copy and paste" method. Shameless I know.
Now individual log parser.
parseKomLog :: Parser KomLog
parseKomLog = do
tanggal <- parseTanggal
_ <- skipSpace
jam <- parseJam
_ <- skipSpace
severity <- parseSeverity
_ <- skipSpace
method <- parseMethod
_ <- skipSpace
message <- manyTill anyChar $ endOfInput <|> tanggalDiDepan
return $ KomLog tanggal jam severity method (pack message)
Pretty simple, actually.
It parse date time, then skip white spaces, followed by log's severity (using
parseSeverity
) and then skip white spaces again, followed by parsing the log's
event (using parseMethod
) and then reads the message until the end of input or
tanggalDiDepan
, which just a look ahead on previous process, minum the message
part.
I think I only cared about how many lines it should print, which file it should read, and what kind of severity it should print. So, I created a record something like this.
data Opsi = Opsi
{ opsiJumlahMaksimal :: Int
, opsiNamaBerkas :: FilePath
, opsiSeverity :: Severity
}
And then using optparse-applicative
,
I create a parser like the following
opsi :: Parser Opsi
opsi =
Opsi
<$> option auto (short 'x' <> showDefault <> value 1) -- [1]
<*> strOption (short 'b') -- [2]
<*> option auto (short 's' <> value "error") -- [3]
where [1]
is an optional switch which takes an integer and corresponds with opsiJumlahMaksimal
.
[2]
, it's a must filled switch, because it corresponds with the log file of which
this program should read.
[3]
, pretty much like number one.
Here's the entry point. I tried to keep it as simple as possible.
someFunc :: IO ()
someFunc = do
Opsi {..} <- execParser infoopsi -- [1]
runConduitRes -- [2]
$ C.sourceFile opsiNamaBerkas -- [3]
.| conduitParser parseKomLog -- [4]
.| filterC (\(_, KomLog _ _ c _ _) -> c == opsiSeverity) -- [5]
.| takeC opsiJumlahMaksimal -- [6]
.| mapC posrangekomlogketeks -- [7]
.| iterMC putText -- [8]
.| sinkNull
error
.
I created this program because I want to know why that program went kaput
because of simple error.posrangekomlogketeks
is something just show
with weird choices.You can look at the repo here at ibnuda/guling or it's mirror. Somebody told me that I don't write something in my spare time just because I don't use github.
From an error only log file:
> wc komeror.log
15850949 95106864 1840857389 komeror.log
> guling -x 15000000 -b komeror.log >> /dev/null
guling -x 15000000 -b komeror.log >> /dev/null 177,02s user 21,16s system 130% cpu 2:31,69 total
> cat komeror.log >> /dev/null
cat komeror.log >> /dev/null 0,00s user 0,23s system 99% cpu 0,229 total
From the first million lines log file:
> wc xaa.log
1000000 3445994 78009277 xaa.log
> time guling -b xaa.log -x 75449 >> /dev/null # 75449 is the amount of ERROR based on grep
guling -b xaa.log -x 75449 >> /dev/null 13,74s user 3,74s system 161% cpu 10,796 total
> time cat xaa.log >> /dev/null
cat xaa.log >> /dev/null 0,00s user 0,02s system 96% cpu 0,018 total
Do I happy with the result? Not exactly. It's so slow. Too slow, even.
Not to mention that the writing process of this article was underwhelmingly monotonic. And the writing process for this program? Even more so.
God, This is so boring.