python - Regular expression for a list of elements in patents -


given patent, how produce regular expression filter out element list in description? elements can identified by:

  1. 'a' or 'the' before element
  2. a digit after element

for example, given paragraph:

'fig. 1 shows base 10 adjustable cord holding device according embodiment of present invention. base 10 may comprise base hole 16 allow cord pass through base 10. shape of base hole 16 depends on intended use of adjustable cord holding device. if cross section of cord round, base hole 16 may round. on other hand, when intended cord belt, cross section rounded rectangle, base hole 16 may rounded rectangle.'

i use regular express spit out

['a base 10', 'the base 10', 'a base hole 16', 'the base 10', 'the base hole 16', 'the base hole 16', 'the base hole 16'] 

you can use re.findall() :

>>> re.findall(r'((?:a|the)(?:(?!(?:\ba\b|\bthe\b)).)*\d+)',s,re.i) ['a base 10', 'the base 10', 'a base hole 16', 'the base 10', 'the base hole 16', 'the base hole 16', 'the base hole 16'] 

the following regex :

r'((?:a|the)((?!(?:\ba\b|\bthe\b)).)*\d+) 

will match sub string starts a or the , ends digit.but between them ((?!(?:\ba\b|\bthe\b)).)* used negative ahead match except words a , the. ride of long matches 'the present invention. base 10' , use re.i flag ignoring case!


Comments

Popular posts from this blog

How has firefox/gecko HTML+CSS rendering changed in version 38? -

javascript - Complex json ng-repeat -

jquery - Cloning of rows and columns from the old table into the new with colSpan and rowSpan -