Author Topic: How to format IRC scraper ignore REGEX?  (Read 6192 times)

d3athsd00r

  • Guest
How to format IRC scraper ignore REGEX?
« on: 2014-05-29, 10:06:52 pm »
I have a couple things added to my IRC scraper ignore statement, but it doesn't seem to be working.

Code: [Select]
define('SCRAPE_IRC_CATEGORY_IGNORE', '/^(MP3|FLAC|XXX)$/i');

However, as you can seen in the screenshot, I'm still getting some XXX. Any ideas?

Offline kevin123

  • Overlord
  • ******
  • Posts: 456
  • Helpful: +49/-0
Re: How to format IRC scraper ignore REGEX?
« Reply #1 on: 2014-05-30, 03:28:33 am »
The regex is for ignoring categories, not title.

In those cases it would be:

define('SCRAPE_IRC_CATEGORY_IGNORE', '/^(MP3|FLAC|XXX(: HD-CLIPS)?)$/i');
« Last Edit: 2014-05-30, 03:33:23 am by kevin123 »

Offline kevin123

  • Overlord
  • ******
  • Posts: 456
  • Helpful: +49/-0
Re: How to format IRC scraper ignore REGEX?
« Reply #2 on: 2014-05-30, 03:47:46 am »
/ = Delimiter (tells php where the regex starts or ends)
^ = Tells it that this will be the beginning.
( )= Something to enclose matches (keywords, like mp3, etc)
| = This means "or". (MP3 OR FLAC OR XXX)
? = ? After the ) means anything in the previous () is optional. In this case ": HD-CLIPS" is optional.
$ = This tells it, this is the end.
/ = Again, a delimiter, to tell PHP the regex ends here.
i = Tells php the regex is case insensitive.

So If my string is "FLAC"

My regex can be : "/FLAC/i"

But that regex will also match "Flacid"
Since nothing is saying that the string must start with F and end with C.
So we add the start/end operators: "/^FLAC$/i"
Which now says it needs to start with a F and end with a C, and "Flacid" would not match.
What if my string is "FLAC: CD" ? This would not match : "/^FLAC$/i"

I can add optional stuff after FLAC, like this : "/^FLAC(: CD)?$/i"
Now my regex will match both FLAC and FLAC: CD
What if I want to match FLAC: DVD also?
Now I can use the "or" operator:
"/^FLAC(: (CD|DVD))?$/i"

What if I want to match MP3 too ?
Again, I can use the "or" operator.
"/^(FLAC(: (CD|DVD))?|MP3)$/i"

Now I match FLAC: CD, FLAC: DVD, FLAC and MP3

Edit: More stuff:

What if I want to match MP3 or MP4?

"/^(FLAC(: (CD|DVD))?|MP[34])$/i"

/[34]/ would be the same as (3|4).

/MP[2-4]/  ;the - means "to" 2 to 4, so this would match MP2, MP3, MP4

I can also use the ? operator after [], so /MP[2-4]?/ would match MP, MP2, MP3, MP4

You can also use the - inside the [] for letters : /[a-c]/would mean a to c

You can also do more than 1 range : /[a-cf-g2-5]/  = a b c f g 2 3 4 5

More stuff again:

/./ (the period)  means anything, any character

/long.tu.e/ This could make longitude or long tube or longatu4e etc..

* and + ;The star means 0 or more of the previous, the + means 1 or more of the previous.

/The.*Cat/ This could be TheCat, The Cat, The Big Cat, etc.. since there could be anything between The and Cat
/The.+Cat/ This could be The Cat, The Big Cat, but it could not be TheCat, as there needs to be something between the 2 words.

/Se+/ This could be Se or See or Seee, etc

More stuff:
These are similar to * and +

{2} Means the preceding must have exactly 2 of the preceding in a row.

{1,} This is exactly the same as + It means 1 or more times of the preceding.
{0,} This is exactly the same as * It means 0 or more times of the preceding.

{1,2} This means 1 or 2 of the preceding.
« Last Edit: 2014-08-06, 07:18:37 am by kevin123 »

Offline bobtongue

  • Prolific Indexer
  • ****
  • Posts: 109
  • Helpful: +7/-0
  • Wherever you go, There you are!
Re: How to format IRC scraper ignore REGEX?
« Reply #3 on: 2014-05-30, 03:59:39 am »
OMG! What a great explanation. Please sticky this somewhere where it can be easily found. :D Thanks Kevin

d3athsd00r

  • Guest
Re: How to format IRC scraper ignore REGEX?
« Reply #4 on: 2014-05-30, 03:38:27 pm »
What bobtongue said. It now makes all the sense. Thanks for all the help :)