paddlenlp文本分类实践

这里我使用的docker 安装比较方便，在 NVIDIA 提供的 cuda-12.2 docker镜像基础上安装 paddlenlp 2.6 一开始是自己安装，折腾了好久，各种版本不匹配。后来就改用docker 镜像了

环境准备好了后就需要提供训练数据数据格式和训练步骤参考GitHub文档

这里需要把label.txt和train.txt准备好格式正确，然后就是使用命令训练数据了，在模型训练完成后还需要导出模型

1	python export_model.py --params_path ./checkpoint/ --output_path ./export

其实不导出也可以用，但是项目中都会把分类做成web服务api所以需要提供web服务能里，

paddlenlp 有提供自己的web服务 PaddleNLP SimpleServing 的服务这个服务的参数输入基本是固定了，在使用的时候我们只需要实现 from paddlenlp.server import BaseTaskflowHandler 实现 BaseTaskflowHandler的 process 方法即可例子如下

# coding:utf-8
# Copyright (c) 2022  PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddlenlp.server import BaseTaskflowHandler
from paddlenlp.utils.log import logger

RESP_SUCCESS = "SUCCESS"
RESP_FAILURE = "FAILURE"
class TrsTaskflowHandler(BaseTaskflowHandler):
    def __init__(self):
        self._name = "trs_taskflow_handler"

    @classmethod
    def process(cls, predictor, data, parameters):
        if data is None:
            return {}
        text = None
        if len(data) > 0:
            text = data
        else:
            return {}
        if parameters is not None :
            schema = parameters
            predictor.set_schema(schema)
        try:
            result = predictor(text)
            format_result = format_response(result)
            return format_result
        except Exception as e:
            return {"result": RESP_FAILURE, "response": e, "history": []}


def format_response(result,schema):
    rw={}
    for r in result:
        if len(r) == 0:
            continue
        for w in r.values():
            for t in w:
                if t['text'] == '':
                    continue
                rw[t['text']]=t
    return {"result": RESP_SUCCESS, "response": list(rw.keys()), "history":[]}

web服务类 TrsUieServer.py 代码如下

from paddlenlp import  Taskflow
from trs_server import TrsSimpleServer
from trs_taskflow_handler import TrsTaskflowHandler
# The schema changed to your defined schema
schema = ["人物"]
# The task path changed to your best model path
uie = Taskflow("information_extraction", schema=schema, task_path="../../checkpoint/model_best/")
# If you want to define the finetuned uie service
trsApp = TrsSimpleServer()
trsHandler = TrsTaskflowHandler()
trsApp.register_taskflow("chat", uie,trsHandler)

启动就只需要执行如下命令

1	paddle server TrsUieServer:trsApp --host 0.0.0.0 --port 9090

完整nlp server 服务代码GitHub代码

整个项目paddlenlp 的文档还是比较全面工具成套，唯一麻烦的数据，需要手动打标。这里使用的百度的doccano